## **RADI608: Data Mining and Machine Learning**

### Assignment: K-Nearest Neighbors 
**Romen Samuel Rodis Wabina** <br>
Student, PhD Data Science in Healthcare and Clinical Informatics <br>
Clinical Epidemiology and Biostatistics, Faculty of Medicine (Ramathibodi Hospital) <br>
Mahidol University

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random 
import warnings 
warnings.filterwarnings('ignore')

from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, make_scorer, f1_score
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

## Question 1
#### Perform K-Nearest Neighbors to predict patient have a cancer using <code>weights = 'distance'</code>.

**SOLUTION** <br>
The class distribution is imbalanced with 22 patients having cancer while 40 patients having no cancer. Hence, it is important to perform resampling techniques (i.e., undersampling, oversampling, SMOTE). Here, we performed oversampling using Synthetic Minority Oversampling Technique (SMOTE) as presented below as <code>smote = SMOTE()</code>. We also verified the dataset in terms of their data types (i.e., <code>float</code> for $\mathbf{X}$ while <code>int</code> for $\mathbf{y}$) to ensure proper data modeling in KNN. No missing values were detected in the dataset. In addition, we normalize the given dataset through standardization by removing the mean and scaling to unit variance. We also split the data into 80:20 ratio between training and testing sets.

In [9]:
df_colon = pd.read_csv('../data/colon.csv')
XX = df_colon.drop('Class', axis = 1)
yy = df_colon['Class']

X, y = XX.to_numpy(), yy.to_numpy()
y = y.flatten()

smote = SMOTE()
X, y = smote.fit_resample(X, y)

random.seed(413)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

We used <code>weights = 'distance'</code> for this KNN model, that initializes the weight assigned to points in the neighbourhood. Since we used <code>distance</code>, the closer neighbours of a query point will have a greater influence than neighbours which are further away. In the code below, we initialized the <code>n_neighbors</code> as a range between $[2, 12)$.

In [10]:
def main(X_train, X_test, y_train, y_test, 
         model, param_grid = {'n_neighbors': np.arange(1, 13, 1)}):

    random.seed(413)
    cv = StratifiedShuffleSplit(n_splits = 10, random_state = 42)
    grid = GridSearchCV(model, param_grid = param_grid, cv = cv, refit = 'f1_micro')
    grid.fit(X_train, y_train) 

    print(f"The best parameters are {grid.best_params_} with" + f" a score of {grid.best_score_:.2f}")
    
    yhat = grid.predict(X_test)

    print('======================= Confusion Matrix =======================')
    print(confusion_matrix(y_test, yhat))
    
    print('==================== Classification Report =====================')
    print(classification_report(y_test, yhat, target_names = ['No Cancer', 'Cancer']))

In [11]:
model_distance = KNeighborsClassifier(weights = 'distance')
main(X_train, X_test, y_train, y_test, model_distance)

The best parameters are {'n_neighbors': 4} with a score of 0.90
[[8 0]
 [0 8]]
              precision    recall  f1-score   support

   No Cancer       1.00      1.00      1.00         8
      Cancer       1.00      1.00      1.00         8

    accuracy                           1.00        16
   macro avg       1.00      1.00      1.00        16
weighted avg       1.00      1.00      1.00        16



The <code>SVM(weights = 'distance')</code> produced perfect performance metrics. Results have shown that there are no misclassifications among the two classes in the given dataset - that is, no false positive and false negative predictions. Because of this, the <code>F1-score</code> produced 100% accuracy for two classes. It is important that we will use F1-score as the primary performance metrics since we utilized a high-dimensional dataset (i.e., number of features is larger than samples). 

## Question 2
#### Perform KNN to predict patient have a cancer using <code>weights = 'uniform'</code>

We used <code>weights = 'uniform'</code> for this KNN model, that initializes the weight assigned to points in the neighbourhood. Since we used <code>uniform</code>, this means that all neighborsget an equally weighted *vote* about an observation's class. In the code below, we initialized the <code>n_neighbors</code> as a range between $[2, 12)$.

In [13]:
model_distance = KNeighborsClassifier(weights = 'uniform')
main(X_train, X_test, y_train, y_test, model_distance)

The best parameters are {'n_neighbors': 12} with a score of 0.87
[[6 2]
 [1 7]]
              precision    recall  f1-score   support

   No Cancer       0.86      0.75      0.80         8
      Cancer       0.78      0.88      0.82         8

    accuracy                           0.81        16
   macro avg       0.82      0.81      0.81        16
weighted avg       0.82      0.81      0.81        16



## Model Comparison





It is expected that using <code>weights = 'distance'</code> would tend to overfit more indeed. The reason for this is that it can potentially overly prioritize the closest neighbor and disregard the other nearest neighbors if they are a bit further away. 

The <code>weights = 'uniform'</code> ensures that even if some of the nearest neighbors are a bit further away, they still count as much towards the prediction.

This is a good illustration of the bias-variance tradeoff.
- The <code>distance</code> reduces the bias by down-weighting data points that are less similar, but by doing that it increases the variance since the prediction relies more on individual data points of the training sample.
- The <code>uniform</code> does the opposite, it reduces variance by ensuring each of the nearest neighbors has the same contribution, thus reducing the dependence in individual training data points, but at the cost of equally considering nearest neighbors which can end up being still quite distant from the observation to label, which leads to larger bias in return.

To conclude, you might want to go for distance when you feel like your model is underfitting, which could be characterized by many "average" predictions.