## **RADI608: Data Mining and Machine Learning**

### Assignment: K-Nearest Neighbors 
**Romen Samuel Rodis Wabina** <br>
Student, PhD Data Science in Healthcare and Clinical Informatics <br>
Clinical Epidemiology and Biostatistics, Faculty of Medicine (Ramathibodi Hospital) <br>
Mahidol University

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random 
import warnings 
warnings.filterwarnings('ignore')

from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, make_scorer, f1_score
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

## Question 1
#### Perform K-Nearest Neighbors to predict patient have a cancer using <code>weights = 'distance'</code>.

**SOLUTION** <br>
The class distribution is imbalanced with 22 patients having cancer while 40 patients having no cancer. Hence, it is important to perform resampling techniques (i.e., undersampling, oversampling, SMOTE). Here, we performed oversampling using Synthetic Minority Oversampling Technique (SMOTE) as presented below as <code>smote = SMOTE()</code>. We also verified the dataset in terms of their data types (i.e., <code>float</code> for $\mathbf{X}$ while <code>int</code> for $\mathbf{y}$) to ensure proper data modeling in KNN. No missing values were detected in the dataset. In addition, we normalize the given dataset through standardization by removing the mean and scaling to unit variance. We also split the data into 80:20 ratio between training and testing sets.

In [15]:
df_colon = pd.read_csv('../data/colon.csv')
XX = df_colon.drop('Class', axis = 1)
yy = df_colon['Class']

X, y = XX.to_numpy(), yy.to_numpy()
y = y.flatten()

smote = SMOTE()
X, y = smote.fit_resample(X, y)

random.seed(413)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

We used <code>weights = 'distance'</code> for this KNN model, that initializes the weight assigned to points in the neighbourhood. Since we used <code>distance</code>, the closer neighbours of a query point will have a greater influence than neighbours which are further away. In the code below, we initialized the <code>n_neighbors</code> as a range between $[2, 12)$.

In [47]:
def main(X_train, X_test, y_train, y_test, 
         model, param_grid = {'n_neighbors': np.arange(1, 13, 1)}):

    random.seed(413)
    cv = StratifiedShuffleSplit(n_splits = 10, random_state = 42)
    grid = GridSearchCV(model, param_grid = param_grid, cv = cv, refit = 'f1_micro')
    grid.fit(X_train, y_train) 

    print(f"The best parameters are {grid.best_params_} with" + f" a score of {grid.best_score_:.2f}")
    
    yhat = grid.predict(X_test)

    print('======================= Confusion Matrix =======================')
    print(confusion_matrix(y_test, yhat))
    
    print('==================== Classification Report =====================')
    print(classification_report(y_test, yhat, target_names = ['Cancer', 'No Cancer']))

In [48]:
model_distance = KNeighborsClassifier(weights = 'distance')
main(X_train, X_test, y_train, y_test, model_distance)

The best parameters are {'n_neighbors': 3} with a score of 0.86
[[7 1]
 [1 7]]
              precision    recall  f1-score   support

      Cancer       0.88      0.88      0.88         8
   No Cancer       0.88      0.88      0.88         8

    accuracy                           0.88        16
   macro avg       0.88      0.88      0.88        16
weighted avg       0.88      0.88      0.88        16



## Question 2
#### Perform KNN to predict patient have a cancer using <code>weights = 'uniform'</code>

We used <code>weights = 'uniform'</code> for this KNN model, that initializes the weight assigned to points in the neighbourhood. Since we used <code>distance</code>, the closer neighbours of a query point will have a greater influence than neighbours which are further away. In the code below, we initialized the <code>n_neighbors</code> as a range between $[2, 12)$.

In [49]:
model_distance = KNeighborsClassifier(weights = 'uniform')
main(X_train, X_test, y_train, y_test, model_distance)

The best parameters are {'n_neighbors': 10} with a score of 0.90
[[6 2]
 [0 8]]
              precision    recall  f1-score   support

      Cancer       1.00      0.75      0.86         8
   No Cancer       0.80      1.00      0.89         8

    accuracy                           0.88        16
   macro avg       0.90      0.88      0.87        16
weighted avg       0.90      0.88      0.87        16



## Model Comparison

We used <code>weights = 'uniform'</code> for this KNN model, that initializes the weight assigned to points in the neighbourhood. Since we used <code>distance</code>, the closer neighbours of a query point will have a greater influence than neighbours which are further away. In the code below, we initialized the <code>n_neighbors</code> as a range between $[2, 12)$.