<a href="https://colab.research.google.com/github/visengagan/Gagan/blob/main/22052115_KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TASK 1

## use user-defined k-NN classifier for mice and breast-cancer data

## MICE DATASET

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml

mice_data = fetch_openml(name='miceprotein', version=4)
print(mice_data.data.shape)
print(mice_data.target.shape)
print(np.unique(mice_data.target))
print(mice_data.DESCR)

(1080, 77)
(1080,)
['c-CS-m' 'c-CS-s' 'c-SC-m' 'c-SC-s' 't-CS-m' 't-CS-s' 't-SC-m' 't-SC-s']
**Author**: Clara Higuera, Katheleen J. Gardiner, Krzysztof J. Cios  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression) - 2015   
**Please cite**: Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126.

Expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning, a task used to assess associative learning.

The data set consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of cortex. There are 38 control mice and 34 trisomic mice (Down syndrome), for a total of 72 mice. In the experiments, 15 measurements were registered of each protein per sample/mouse. Therefore, for control mice, there ar

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(mice_data.data, mice_data.target,
                                       test_size=0.20,
                                       random_state=32,
                                       shuffle=True)

In [None]:
print(mice_data.data.shape)
print(mice_data.target.shape)

(1080, 77)
(1080,)


In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from collections import Counter

k=7

# Ensure the data is numeric
X_train = X_train.apply(pd.to_numeric, errors='coerce')
X_test = X_test.apply(pd.to_numeric, errors='coerce')

def predict(x):
    # Compute Euclidean distances between x and all examples in the training set
    distances = [np.linalg.norm(x - x_train) for x_train in X_train.values]

    # Sort by distance and return indices of the first k neighbors
    k_indices = np.argsort(distances)[:k]

    # Extract the labels of the k nearest neighbor training samples
    k_nearest_labels = [y_train.iloc[i] for i in k_indices]

    # Return the most common class label
    most_common = Counter(k_nearest_labels).most_common(1)
    return most_common[0][0]

y_pred = [predict(x) for x in X_test.values]
print(y_pred)

['c-SC-s', 't-SC-s', 'c-SC-s', 'c-CS-s', 'c-SC-s', 'c-SC-s', 'c-SC-s', 'c-CS-m', 'c-SC-s', 'c-SC-s', 'c-SC-s', 'c-SC-s', 'c-SC-s', 'c-SC-s', 't-SC-m', 'c-SC-s', 'c-SC-s', 't-SC-m', 't-CS-m', 't-SC-s', 'c-CS-s', 'c-SC-m', 'c-SC-s', 'c-SC-s', 'c-SC-s', 't-CS-m', 't-SC-s', 'c-CS-s', 'c-SC-s', 'c-SC-s', 'c-SC-s', 't-SC-s', 'c-SC-s', 't-SC-m', 't-CS-s', 'c-SC-m', 't-SC-m', 'c-SC-m', 't-CS-s', 't-CS-m', 'c-SC-s', 'c-SC-s', 't-SC-m', 'c-SC-s', 't-SC-s', 'c-SC-s', 't-CS-m', 't-SC-m', 't-CS-s', 'c-SC-s', 'c-SC-s', 't-CS-m', 't-SC-s', 'c-SC-s', 't-CS-m', 'c-SC-m', 'c-SC-m', 'c-SC-s', 't-CS-s', 'c-SC-s', 'c-SC-s', 'c-SC-s', 't-CS-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-CS-s', 't-SC-m', 't-CS-m', 'c-SC-s', 't-CS-s', 'c-SC-s', 't-CS-m', 't-SC-s', 'c-SC-s', 'c-SC-s', 'c-SC-s', 'c-SC-s', 'c-SC-s', 'c-SC-s', 'c-SC-s', 'c-CS-m', 'c-CS-s', 't-SC-m', 'c-CS-m', 'c-SC-m', 'c-SC-m', 'c-SC-s', 'c-SC-s', 'c-SC-s', 'c-SC-s', 't-CS-m', 'c-SC-s', 'c-SC-s', 'c-SC-s', 'c-CS-s', 'c-SC-s', 't-SC-m', 'c-SC-m', 'c-SC-s',

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[ 7,  4,  0, 21,  2,  0,  0,  0],
       [ 1,  9,  0, 12,  0,  0,  0,  0],
       [ 0,  0, 17, 23,  0,  0,  0,  0],
       [ 0,  0,  0, 22,  0,  0,  0,  0],
       [ 0,  0,  0,  8, 21,  0,  0,  0],
       [ 0,  0,  0,  4,  0, 12,  0,  0],
       [ 0,  0,  0, 16,  0,  0, 12,  0],
       [ 0,  0,  0, 11,  0,  0,  0, 14]], dtype=int64)

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

      c-CS-m       0.88      0.21      0.33        34
      c-CS-s       0.69      0.41      0.51        22
      c-SC-m       1.00      0.42      0.60        40
      c-SC-s       0.19      1.00      0.32        22
      t-CS-m       0.91      0.72      0.81        29
      t-CS-s       1.00      0.75      0.86        16
      t-SC-m       1.00      0.43      0.60        28
      t-SC-s       1.00      0.56      0.72        25

    accuracy                           0.53       216
   macro avg       0.83      0.56      0.59       216
weighted avg       0.85      0.53      0.58       216



## BREAST CANCER DATASET


In [None]:
cancer_data = fetch_openml(name='breast-cancer', version=1)
print(cancer_data.data.shape)
print(cancer_data.target.shape)
print(np.unique(cancer_data.target))
print(cancer_data.DESCR)

(286, 9)
(286,)
['no-recurrence-events' 'recurrence-events']
**Author**:   
**Source**: Unknown -   
**Please cite**:   

Citation Request:
    This breast cancer domain was obtained from the University Medical Centre,
    Institute of Oncology, Ljubljana, Yugoslavia.  Thanks go to M. Zwitter and 
    M. Soklic for providing the data.  Please include this citation if you plan
    to use this database.
 
 1. Title: Breast cancer data (Michalski has used this)
 
 2. Sources: 
    -- Matjaz Zwitter & Milan Soklic (physicians)
       Institute of Oncology 
       University Medical Center
       Ljubljana, Yugoslavia
    -- Donors: Ming Tan and Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
    -- Date: 11 July 1988
 
 3. Past Usage: (Several: here are some)
      -- Michalski,R.S., Mozetic,I., Hong,J., & Lavrac,N. (1986). The 
         Multi-Purpose Incremental Learning System AQ15 and its Testing 
         Application to Three Medical Domains.  In Proceedings of the 
         Fifth N

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(cancer_data.data, cancer_data.target,
                                       test_size=0.20,
                                       random_state=32,
                                       shuffle=True)

In [None]:
print(X_train.iloc[0:12, :])

       age menopause tumor-size inv-nodes node-caps deg-malig breast  \
33   50-59      ge40      20-24       0-2        no         3  right   
283  30-39   premeno      30-34       6-8       yes         2  right   
46   60-69      ge40      10-14       0-2        no         2  right   
135  30-39   premeno      15-19       0-2        no         1  right   
116  40-49   premeno      30-34       6-8       yes         3  right   
235  60-69      ge40      40-44       0-2        no         2  right   
5    50-59   premeno      25-29       3-5        no         2  right   
255  50-59   premeno      25-29       0-2        no         3  right   
216  60-69      ge40      25-29       0-2        no         3   left   
200  30-39   premeno        5-9       0-2        no         2   left   
132  40-49   premeno      30-34       0-2        no         3  right   
180  30-39   premeno      40-44       0-2        no         2  right   

    breast-quad irradiat  
33      left_up       no  
283    ri

In [None]:
print(y_train.iloc[0:12])

33     no-recurrence-events
283    no-recurrence-events
46     no-recurrence-events
135       recurrence-events
116       recurrence-events
235       recurrence-events
5      no-recurrence-events
255       recurrence-events
216       recurrence-events
200    no-recurrence-events
132       recurrence-events
180    no-recurrence-events
Name: Class, dtype: category
Categories (2, object): ['no-recurrence-events', 'recurrence-events']


In [None]:
print(cancer_data.data.shape)
print(cancer_data.target.shape)

(286, 9)
(286,)


In [None]:
import pandas as pd

k=7

# Ensure the data is numeric
X_train = X_train.apply(pd.to_numeric, errors='coerce')
X_test = X_test.apply(pd.to_numeric, errors='coerce')

def predict(x):
    # Compute Euclidean distances between x and all examples in the training set
    distances = [np.linalg.norm(x - x_train) for x_train in X_train.values]

    # Sort by distance and return indices of the first k neighbors
    k_indices = np.argsort(distances)[:k]

    # Extract the labels of the k nearest neighbor training samples
    k_nearest_labels = [y_train.iloc[i] for i in k_indices]

    # Return the most common class label
    most_common = Counter(k_nearest_labels).most_common(1)
    return most_common[0][0]

y_pred = [predict(x) for x in X_test.values]

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[42,  0],
       [16,  0]], dtype=int64)

In [None]:
print(classification_report(y_test, y_pred))

                      precision    recall  f1-score   support

no-recurrence-events       0.72      1.00      0.84        42
   recurrence-events       0.00      0.00      0.00        16

            accuracy                           0.72        58
           macro avg       0.36      0.50      0.42        58
        weighted avg       0.52      0.72      0.61        58



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## TASK 2

##  use user-defined k-NN for Regression

In [None]:
from sklearn.datasets import fetch_california_housing
import numpy as np
from sklearn.model_selection import train_test_split

#here is the link to find all the datasets available in sklearn https://scikit-learn.org/stable/datasets.html
X, y = fetch_california_housing(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
k = 7

def predict_regression(x):
    # Compute Euclidean distances between x and every training sample
    distances = [np.linalg.norm(x - x_train) for x_train in X_train]

    # Get the indices of the k closest training samples
    k_indices = np.argsort(distances)[:k]

    # Extract the corresponding target values
    k_nearest_values = [y_train[i] for i in k_indices]

    # Return the mean value of the neighbors
    return np.mean(k_nearest_values)

# Apply the predict_regression function to each test sample
predicted = [predict_regression(x) for x in X_test]
print(predicted)

[1.5071428571428573, 1.3718571428571429, 2.714857142857143, 2.4174285714285717, 1.468142857142857, 2.2914285714285714, 2.62943, 2.200142857142857, 1.7991428571428574, 2.4761428571428574, 1.5048571428571427, 1.5201428571428568, 1.8442857142857143, 2.334142857142857, 1.7054285714285715, 2.398714285714285, 2.1171442857142857, 1.8474285714285712, 2.7415714285714285, 1.6372857142857142, 1.5115714285714286, 2.4994285714285716, 1.5302857142857145, 1.8587142857142858, 1.8392857142857142, 1.426, 2.281857142857143, 1.7422857142857144, 1.184, 2.267572857142857, 1.3865714285714286, 1.6882857142857142, 2.3111428571428574, 2.2367142857142857, 2.78043, 1.81, 2.815714285714286, 1.5432857142857144, 1.4798571428571428, 2.7825728571428576, 1.7552857142857143, 2.1617142857142855, 1.8952857142857142, 1.8511428571428572, 1.9949999999999999, 2.55943, 1.8671428571428572, 1.995142857142857, 2.5500000000000003, 1.8571428571428574, 1.657714285714286, 1.8217142857142858, 2.4682871428571427, 1.685142857142857, 2.0

In [None]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

mae = mean_absolute_error(y_test, predicted)
mse = mean_squared_error(y_test, predicted)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, predicted)

print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (Coefficient of Determination):", r2)

Mean Squared Error (MSE): 1.1045381245194705
Root Mean Squared Error (RMSE): 1.0509700873571381
R-squared (Coefficient of Determination): 0.1571042759473391
