## KNN - K-Nearest Neighbors 

### Goal: Predict whether a person will have diabetes or not

### KNN is based on feature similarity
1. mostly used for classification
2. classifies a data point based on how its neighbors are classified
3. classifies new instances based on a similarity measure (e.g., Euclidean distance)
3. to find the nearest neighbors, calculate Euclidean distance
3. calculate Euclidean distance of unknown data point from all the points in the dataset
3. take the classes of the nearest neighbors to determine the majority votes
4. k in KNN is a parameter that refers to the number of nearest neighbors to include in the majority voting process
5. how to choose k: 
   - k~sqrt(n), where n is total number of data points
   - choose odd value of k to avoid confusion
6. when to use KNN:
   - data is labeled
   - data is noise free
   - dataset is small
7. lazy learner: doesn't learn a discriminative function from the training set

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

### read dataset

In [2]:
dataset = pd.read_csv('diabetes.csv')  #https://github.com/plotly/datasets/blob/master/diabetes.csv

In [3]:
dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
dataset.shape

(768, 9)

### impute missing values

In [5]:
#numeric features with zero values will be replaced with mean values

In [6]:
#dataset[dataset == 0].count(axis=0)  #variant 1
#(~dataset.astype(bool)).sum(axis=0)  #variant 2
dataset.isin([0]).sum()  #var 3: DataFrame.isin(values): whether each element in the DataFrame is contained in values.

Pregnancies                 111
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                     500
dtype: int64

In [7]:
#replace zeros
zero_not_accepted = ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI', 'Insulin']

In [8]:
for column in zero_not_accepted:
    dataset[column] = dataset[column].replace(0, np.NaN)
    mean = int(dataset[column].mean(skipna=True))
    dataset[column] = dataset[column].replace(np.NaN, mean)

In [9]:
dataset.isin([0]).sum()

Pregnancies                 111
Glucose                       0
BloodPressure                 0
SkinThickness                 0
Insulin                       0
BMI                           0
DiabetesPedigreeFunction      0
Age                           0
Outcome                     500
dtype: int64

### scale features

In [10]:
#split dataset
X = dataset.iloc[:,0:8]
y = dataset.iloc[:, 8]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

In [11]:
#feature scaling: z = (x - u) / s
sc_X = StandardScaler()  #standardize features by removing the mean and scaling to unit variance
X_train = sc_X.fit_transform(X_train)  #fit to data by computing the mean and std, then transform it. 
X_test = sc_X.transform(X_test)  #transform dataset

### train classifier

In [12]:
#determine number of nearest neighbors k; choose k to be odd
import math
math.sqrt(len(y_test))

12.409673645990857

In [13]:
#define model
classifier = KNeighborsClassifier(n_neighbors=11, p=2, metric='euclidean')  #euclidean_distance (l2) for p = 2

In [14]:
#fit model
classifier.fit(X_train, y_train)

KNeighborsClassifier(metric='euclidean', n_neighbors=11)

### evaluate model

In [15]:
#predict test set results
y_pred = classifier.predict(X_test)

In [16]:
#confusion_matrix: in binary classification, the count of true negatives is C(0,0), false negatives is C(1,0), true positives is C(1,1) and false positives is C(0,1).
#                Prediction
#                [Negative] [Positive]
#Actual [False]       TN         FP
#       [True]        FN         TP

In [17]:
#F1 score: harmonic mean of precision and recall
#F1 = 2/(1/precision + 1/recall); precision = TP/(TP+FP); recall = TP/(TP+FN)

In [18]:
#evaluate model
print(confusion_matrix(y_test, y_pred))
print(f1_score(y_test, y_pred))

[[94 13]
 [15 32]]
0.6956521739130436


### calculate accuracy

In [19]:
#calculate accuracy: (TN+TP)/len(y_test)
print(round(accuracy_score(y_test, y_pred),2))

0.82
