## **RADI608: Data Mining and Machine Learning**

### Assignment: K-Nearest Neighbors 
**Romen Samuel Rodis Wabina** <br>
Student, PhD Data Science in Healthcare and Clinical Informatics <br>
Clinical Epidemiology and Biostatistics, Faculty of Medicine (Ramathibodi Hospital) <br>
Mahidol University

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random 
import warnings 
warnings.filterwarnings('ignore')

from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
from time import time
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, make_scorer, f1_score
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import label_binarize
from sklearn.metrics import average_precision_score

## Question 1
#### Perform K-Nearest Neighbors to predict patient have a cancer using <code>weights = 'distance'</code>.

**SOLUTION** <br>
The class distribution is imbalanced with 22 patients having cancer while 40 patients having no cancer. Hence, it is important to perform resampling techniques (i.e., undersampling, oversampling, SMOTE). Here, we performed oversampling using Synthetic Minority Oversampling Technique (SMOTE) as presented below as <code>smote = SMOTE()</code>. We also verified the dataset in terms of their data types (i.e., <code>float</code> for $\mathbf{X}$ while <code>int</code> for $\mathbf{y}$) to ensure proper data modeling in KNN. No missing values were detected in the dataset. 

In addition, we normalize the given dataset through standardization by removing the mean and scaling to unit variance since it is possible that they are measured in different units. We also split the data into 80:20 ratio between training and testing sets.

In [7]:
df_colon = pd.read_csv('../data/colon.csv')
XX = df_colon.drop('Class', axis = 1)
yy = df_colon['Class']

X, y = XX.to_numpy(), yy.to_numpy()
y = y.flatten()

smote = SMOTE()
X, y = smote.fit_resample(X, y)

random.seed(413)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

random.seed(413)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

We used <code>weights = 'distance'</code> for this KNN model, that initializes the weight assigned to points in the neighbourhood. Since we used <code>distance</code>, the closer neighbours of a query point will have a greater influence than neighbours which are further away. In the code below, we initialized the <code>n_neighbors</code> as a range between $[2, 12)$.

In [13]:
def display_results(y, prediction, set = 'Training Set'):
     print(f'======================= {set} =======================')
     print(confusion_matrix(y, prediction))
     print(classification_report(y, prediction, target_names = ['No Cancer', 'Cancer']))

def main(X_train, X_test, y_train, y_test, model, param_grid = {'n_neighbors': [1, 2, 4, 6, 8, 10, 12]}, display_train = True):
     start = time()
     
     cv = StratifiedShuffleSplit(n_splits = 10, random_state = 42)
     
     random.seed(413)
     grid = GridSearchCV(model, param_grid = param_grid, cv = cv, refit = 'recall')
     grid.fit(X_train, y_train) 

     print(f'The best parameters are {grid.best_params_} with" + f" a score of {grid.best_score_:.2f}')
     
     predictions = grid.predict(X_train)
     yhat = grid.predict(X_test)
     print(f"Fit and predict time: {np.round(time() - start, 4)} seconds")
     if display_train:
          display_results(y_train, predictions, set = 'Training Set')
     display_results(y_test, yhat, set = 'Testing Set')

In [28]:
model_distance = KNeighborsClassifier(weights = 'distance')
main(X_train, X_test, y_train, y_test, model_distance, display_train = False)

The best parameters are {'n_neighbors': 12} with" + f" a score of 0.91
Fit and predict time: 10.198 seconds
[[7 1]
 [0 8]]
              precision    recall  f1-score   support

   No Cancer       1.00      0.88      0.93         8
      Cancer       0.89      1.00      0.94         8

    accuracy                           0.94        16
   macro avg       0.94      0.94      0.94        16
weighted avg       0.94      0.94      0.94        16



The best parameter for <code>KNN('distance')</code> is <code>k = 12</code>. This implies that we consider twelve nearest neighbors to get its *majority* vote and determine the classification of a given sample based on the vote. Hence, a patient will be assigned to the same class (i.e., Cancer or No Cancer) as to its twelve nearest neighbors due to similarity of their features (i.e., independent variables). The <code>KNN('distance')</code> took 10.20 seconds to fit the training set and predict the testing data. This may imply that <code>KNN</code>, in general, is costly to calculate distance on large datasets.

The <code>SVM(weights = 'distance')</code>, however, still produced excellent performance metrics. Results have shown that there is only one misclassification among the two classes in the given dataset - that is, one false positive prediction, indicating that a patient has been classified by the model with cancer but the ground truth indicates otherwise. While there exists a single misclassification, the results still generated good <code>recall</code> for both classes: 88% for No Cancer while 100% for Cancer classes. It is important that we use <code>recall</code> 

## Question 2
#### Perform KNN to predict patient have a cancer using <code>weights = 'uniform'</code>

We used <code>weights = 'uniform'</code> for this KNN model, that initializes the weight assigned to points in the neighbourhood. Since we used <code>uniform</code>, this means that all neighbors get an equally weighted *vote* about an observation's class. In the code below, we initialized the <code>n_neighbors</code> as a range between $[2, 12)$.

In [30]:
model_distance = KNeighborsClassifier(weights = 'uniform')
main(X_train, X_test, y_train, y_test, model_distance, display_train = False)

The best parameters are {'n_neighbors': 8} with" + f" a score of 0.90
Fit and predict time: 9.7412 seconds
[[7 1]
 [1 7]]
              precision    recall  f1-score   support

   No Cancer       0.88      0.88      0.88         8
      Cancer       0.88      0.88      0.88         8

    accuracy                           0.88        16
   macro avg       0.88      0.88      0.88        16
weighted avg       0.88      0.88      0.88        16



The best parameter for <code>KNN('uniform')</code> is <code>k = 8</code>. This implies that we consider eight nearest neighbors and get its *majority* vote to determine the classification of a given sample. Hence, a patient will be assigned to the same class (i.e., Cancer or No Cancer) as to its eight nearest neighbors due to similarity of their features (i.e., independent variables). The <code>KNN('uniform')</code> fitted and predicted the datasets at almost 10 seconds, which is approximately the same with the <code>KNN('uniform')</code>.

While <code>KNN('uniform')</code> produced good performance metrics, its scores are still comparatively lower than in <code>KNN('distance')</code>, particularly the <code>recall</code> metrics where both 'cancer' and 'no cancer' class have 88% recall. Results have shown that <code>KNN('uniform')</code> produced one false positive and one false negative predictions, thus, having misclassifications to its results. We already know what false positive implies. Meanwhile, the single false negative result indicates that the model misclassified one patient as healthy but the ground truth says the patient has colon cancer.

## Model Comparison

- The <code>weights = 'uniform'</code> generated more **misclassifications** than in <code>weights = 'distance'</code>. One possible reason for <code>uniform</code>'s low performance metrics is mainly based on how every neighborhood points get an equally-weighted vote (i.e., thus, *uniform*) about the sample's class regardless of its distance to the sample. Far neighbors, despite having different classes, can still contribute to the majority vote of sample's class since every neighbor, regardless of its distance, has an equal vote. In other words, <code>weights = 'uniform'</code> ensures that even if some of the nearest neighbors are a bit further away, they still count as much towards the prediction, regardless if their classes are different to nearer neighbors. Therefore, <code>weights = 'uniform'</code> is most likely prone to misclassifications.

- From our point above, it is expected that <code>weights = 'distance'</code> can **reduce** misclassifications since it prioritizes nearer neighbors. The nearer neighbors are most likely contain similar classes with the sample - having classification results that are more accurate. Furthermore, <code>distance</code> reduces bias by down-weighting data points that are less similar, but this increases variance because the prediction relies more on individual data points from the training sample.

- Since <code>weights = 'uniform'</code> takes into account far neighbors, this strategy produces lower dispersions among its predictions since each of the nearest neighbors has the same contribution. This implies that uniform-based weighting strategy in KNN provides lower variance and higher bias than <code>weights = 'distance'</code>. Low variance and high bias highly correspond to underfitting, which may likely produce misclassifications. 

- This also implies that uniform-based weighting strategy provides higher variance and lower bias than <code>weights = 'distance'</code>. Hence, we may expect that <code>weights = 'distance'</code> would tend to overfit since it can overly prioritize the closest neighbor and disregard the other nearest neighbors if they are bit a further away.    