## Pima Indians Diabetes Prediction (Using KNN)

This is a classification predictive analytics project. Here I will be using k-Nearest Neighbour algorithm to predict the occurance of diabetes. 

### Importing Libraries, Dataset, and EDA

In [None]:
#importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

from sklearn.preprocessing import MinMaxScaler



In [None]:
#importing dataset

diabetes_df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')

In [None]:
diabetes_df.head()

In [None]:
#accessing target variable distribution
diabetes_df['Outcome'].hist()

In [None]:
#exploring distribution and individual values
sns.pairplot(diabetes_df,hue="Outcome")

In [None]:
diabetes_df.describe()

In [None]:
#assigning input and target variables
X=diabetes_df.drop('Outcome',axis=1)
y=diabetes_df['Outcome']

In [None]:
#rescaling data for analysis
scaler = MinMaxScaler()

X_ = scaler.fit_transform(X)

X_rescaled = pd.DataFrame(X_, columns= X.columns)

X_rescaled.describe()



### K-Nearest Nighbour

To begin with, I will split data into 70% training and 30% testing dataset. After that, I wil build a knn model and test the model using the 30% dataset.

I will be using Sci-kit learn ML library to split the dataset, build the model, and test the model.

- from sklearn.model_selection import train_test_split: to split our data into 70% training & 30% testing datasets
- from sklearn.neighbors import KNeighborsClassifier: to build knn model using the training dataset
- from sklearn.model_selection import cross_val_score: to find the optimal value of k
- from sklearn.metrics import classification_report: to get the classsification report (f1, accuracy, precision)

In [None]:
#splitting data 70/30 into training and test datasets
X_train, X_test, y_train, y_test = train_test_split(X_rescaled,y,test_size=0.3,random_state=1)

In [None]:
#building KNN  model
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train,y_train)

y_pred = knn.predict(X_test)

print("ROC AUC : ", roc_auc_score(y_test, y_pred))

In [None]:
#Determining optimal value of k based on ROC AUC using "cross_val_score" package
from sklearn.model_selection import cross_val_score
max_k = 100
cv_score = []

for k in range(1,max_k):
  knn = KNeighborsClassifier(n_neighbors= k)
  scores = cross_val_score(knn,X_train,y_train.values.ravel(),cv = 5, scoring = "roc_auc")
  cv_score.append(scores.mean())

In [None]:
sns.lineplot(x=range(1,max_k),y=cv_score)

Here, we can see that the optimal value of k lies in the range (40, 60). To find the exact optimal value of k, we will use following function:

In [None]:
cv_score.index(max(cv_score))+1

Therefore, the optimal value of k is 43.

In [None]:
#optimized KNN model
knn = KNeighborsClassifier(n_neighbors=43, metric='euclidean')
knn.fit(X_train,y_train)

y_pred = knn.predict(X_test)

print("KNN Model\n")

print("ROC AUC : ", roc_auc_score(y_test, y_pred))

#classification report for KNN model
print(classification_report(y_test,y_pred))
