#### Use diabetes dataset. The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. Use k-nn to build the classification model. Evaluate your model performance. Use “gridsearchcv( )” to find the best value of ‘k’.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score, accuracy_score

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/rahul96rajan/sample_datasets/master/diabetes.csv')
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [4]:
X = data.drop('Outcome', axis=1)
y = data['Outcome']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [6]:
knn = KNeighborsClassifier(n_jobs=-1)
kf = KFold(n_splits=4, shuffle=True, random_state=42)
params = {'n_neighbors':np.arange(3,21,2),
         'leaf_size': list(range(1,5)),
         'p': [1,2]}

gs = GridSearchCV(estimator=knn, param_grid=params, scoring='f1_micro', cv=kf)

In [7]:
gs.fit(X_train, y_train)

GridSearchCV(cv=KFold(n_splits=4, random_state=42, shuffle=True),
             estimator=KNeighborsClassifier(n_jobs=-1),
             param_grid={'leaf_size': [1, 2, 3, 4],
                         'n_neighbors': array([ 3,  5,  7,  9, 11, 13, 15, 17, 19]),
                         'p': [1, 2]},
             scoring='f1_micro')

In [8]:
print(gs.best_estimator_)
print(gs.best_score_)

KNeighborsClassifier(leaf_size=1, n_jobs=-1, n_neighbors=15, p=1)
0.752387318563789


In [9]:
y_pred_test = gs.predict(X_test)
y_pred_train = gs.predict(X_train)

In [10]:
print('F1 Score(Train): ', f1_score(y_train, y_pred_train, average='micro'))
print('F1 Score(Test): ', f1_score(y_test, y_pred_test, average='micro'))

print('\nAccuracy Score(Train): ', accuracy_score(y_train, y_pred_train))
print('Accuracy Score(Test): ', accuracy_score(y_test, y_pred_test))

F1 Score(Train):  0.783387622149837
F1 Score(Test):  0.8246753246753247

Accuracy Score(Train):  0.7833876221498371
Accuracy Score(Test):  0.8246753246753247
