# Diabetes Diagnostics Using KNN

Author: Sahngyoon Rhee

K-Nearest-Neighbors (KNN) is a supervised machine learning model that is used for classification tasks. It works by taking a datapoint and looking at the $K$ nearest neighbors and predicting that the datapoint belongs to the class whose label appears most often.

For example, if $K = 7$ and $4$ of a datapoint's seven closest neighbors are blue and $3$ are red, that datapoint would be classified as blue.

<img src='https://th.bing.com/th/id/OIP.hiUeM5Zt_4iruDN49aHAugHaDt?rs=1&pid=ImgDetMain'>

<p style="text-align = center;"> Demonstration of KNN Algorithm. </p>

We shall conduct Diabetes Diagnostic using KNN through `sklearn`.

## Reading the training data

We shall read the training data and display the results.

In [1]:
import pandas as pd

# read data
df = pd.read_csv('data/diabetes/diabetes.csv')

# display sample rows
display(df.head())

print(f"There are {df.shape[0]} observations and " \
      f"{df.shape[1]} features in our training set.")

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


There are 768 observations and 9 features in our training set.


## Split up the Predictor and Target Variables

For a supervised machine learning algorithm, it's necessary to split up the predictor and target variables. We shall then split each into train and test data.

In [2]:
# Separate predictor and target variables
X = df.drop(columns = ['Outcome'])
y = df['Outcome']

In [3]:
from sklearn.model_selection import train_test_split

# split dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, stratify = y)

In the above, `stratify = y` ensures that the split of the dataset into training and testing sets maintains the same proportion of class labels as in the original dataset. This is particularly useful when you have an imbalanced dataset, as it helps to ensure that both the training and testing sets are representative of the overall distribution of classes.

## Building and Training the Model

We now call the KNN algorithm.

In [4]:
from sklearn.neighbors import KNeighborsClassifier

# Create KNN Classifier
n_neighbors = 3
knn = KNeighborsClassifier(n_neighbors = n_neighbors)

# Fit the classifier to the data
knn.fit(X_train, y_train)

## Testing the model

We shall now run the testing dataset through our model.

In [5]:
# show first five model predictions on the test data
print(knn.predict(X_test)[0:5])

# check accuracy of our moel on the test data, i.e. what percentage of the model's predictions are correct
print(f"The KNN with K = {n_neighbors} has an accuracy of {100*knn.score(X_test, y_test):.2f}%")

[1 0 0 0 1]
The KNN with K = 3 has an accuracy of 69.48%


## Kfold Cross Validation

We now employ K-fold cross validation to get a more accurate accuracy.

In [6]:
from sklearn.model_selection import cross_val_score

# create a new KNN model
knn_cv = KNeighborsClassifier(n_neighbors = n_neighbors)

# train model with cv of 10
cv_scores = cross_val_score(knn_cv, X, y, cv=5)

# print each cv score and take the average
print(cv_scores)
print(f"The Cross Validation score of our KNN model with K = {n_neighbors} is {100*cv_scores.mean():.2f}%")

[0.68181818 0.69480519 0.75324675 0.75163399 0.68627451]
The Cross Validation score of our KNN model with K = 3 is 71.36%


## Hypertuning model parameters using GridSearchCV

Our choice of the number of neighbors to consider, `n_neighbors = 3`, was arbitrary. We want to try out several other values so that we may be able to get a better result. To do that, we use `GridSearchCV`.

In general, `GridSearchCV` is used for finding the best hyperparmeters. It works by training our model a multiple times on a range of parameters that we give to `GridSearchCV`, and calculating the cross validation score for each combination of hyperparameters, so that we can figure out the optiomal hyperparameters value to get the best accuracy results.

In our case, we will test the values of `n_neighbors` from 1 to 24.

In [7]:
from sklearn.model_selection import GridSearchCV
import numpy as np

# create a new knn model
knn2 = KNeighborsClassifier()

# Create a dictionary of all values we want to test for n_neighbors
param_grid = {'n_neighbors': np.arange(1,25)}

# use gridsearch to test all values for n_neighbors
knn_gridsearch = GridSearchCV(knn2, param_grid, cv=5)

# fit model to data
knn_gridsearch.fit(X,y)

We can now check which of our values of `n_neighbors` performed the best. We will call `best_params_` for our model. We then get the best accuracy.

In [8]:
# check the best performing n_neighbors value
print("The best choice of the number of neighbors to consider is "\
        f"{knn_gridsearch.best_params_['n_neighbors']}, "\
      f"which gives us the accuracy of {100*knn_gridsearch.best_score_:.2f}%.")

The best choice of the number of neighbors to consider is 14, which gives us the accuracy of 75.79%.


Thank you for reading!