#  k-NN Classifier with HyperparameterTuning

> “the k-nearest neighbors algorithm (kNN) is a non-parametric machine learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression.”-Wikipedia

![Source: dslytics.files.wordpress.com](https://dslytics.files.wordpress.com/2017/11/knn.png?w=400)

## Introduction
kNN is distance based algorithm in which there is no learning step, instead dataset is stored in memory used is used to classify on the fly. kNN is one of the simplest methods of classfication.

In kNN, 'k' is a parameter which refers to the number of nearest neighbours. Often people find it difficult to specify optimal 'k' value. In article we will see basic implementation of kNN and lateron  we will find out how to find optimal 'k' value and how it improves the overall accuracy.


## How kNN finds nearest neighbour?
When a new query point (let's say p) is added then the classification procedure for query point 'p' works in two steps as:
1. kNN find the K neighbours in the dataset which are closest to 'p' query point. For regression it is mean value. Here no. of neighbours are specified as K.
2. Use these K neighbours to determine the class of 'p' using voting mechanism.

### Distance measures used
* Euclidean Distance  
* Manhattan Distance
* Chebyshev Distance
* Minkowski Distance
* Mahalanobis Distance
* Hamming Distance


## Other variant of kNN
1. Radius Neighbour Classifier - Within a fixed radious it coumputes number of neighbours. Radius Neighbour classifier can be helpful when data is not uniform.
2. KD Tree Nearest Neighbour - This method uses tree based approach and is effective when sample size is huge. 
3. KNN Regression- In this method target variable is continous. Hence nearest neighbours are calculated based on average.

## Pros and Cons
Pros - 
* kNN algorithm is very simple to implement and it makes predictions on the fly by calculating the similarity between an input sample and each training variables.
* It is robust and works with multiple classess.
* There are many distance measures to choose from.

Cons - 
* Can have high comuputation cost if more dimensions are present.
* May suffer from Curse of Dimensionality due to increase in dimensions.
* It is not effective when distribution overlaps with each other. 
* User may find it challenging to find an optimal 'k' value.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# KNN Classification
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline 
from sklearn.model_selection import train_test_split
from scipy.stats import zscore
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('/kaggle/input/wisc-bc-data/wisc_bc_data.csv')
df.columns

In [None]:
df.head()

In [None]:
# The first column is id column which is patient id and nothing to do with the model attriibutes. So drop it.

df = df.drop(labels = "id", axis = 1)

In [None]:
X = df.drop(labels="diagnosis", axis = 1)
y = df["diagnosis"]

In [None]:
# convert the features into z scores as we do not know what units / scales were used and store them in new dataframe
# It is always adviced to scale numeric attributes in models that calculate distances.

XScaled  = X.apply(zscore)  # convert all attributes to Z scale 

XScaled.describe()

Below code gives usestimated mean accuracy. Now we can proceed further.

In [None]:
num_folds = 10
kfold = KFold(n_splits=num_folds, random_state=7)
model = KNeighborsClassifier()
results = cross_val_score(model, XScaled, y, cv=kfold)
print(results.mean())

In [None]:
# Split X and y into training and test set in 75:30 ratio


X_train, X_test, y_train, y_test = train_test_split(XScaled, y, test_size=0.30, random_state=1)

In [None]:
KNN = KNeighborsClassifier(n_neighbors= 5 , weights = 'distance' )
# Call Nearest Neighbour algorithm

KNN.fit(X_train, y_train)

In [None]:
# For every test data point, predict it's label based on 5 nearest neighbours in this model. The majority class will 
# be assigned to the test data point

predicted_labels = KNN.predict(X_test)
KNN.score(X_test, y_test)

In [None]:
# calculate accuracy measures and confusion matrix
from sklearn import metrics

print("Confusion Matrix")
cm=metrics.confusion_matrix(y_test, predicted_labels, labels=["M", "B"])

df_cm = pd.DataFrame(cm, index = [i for i in ["M","B"]],
                  columns = [i for i in ["Predict M","Predict B"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True,fmt='.5g',cmap="YlGn")


In [None]:
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix,plot_confusion_matrix

print(classification_report(y_test,predicted_labels))

In [None]:
from sklearn.model_selection import GridSearchCV
#List Hyperparameters that we want to tune.
leaf_size = list(range(1,50))
n_neighbors = list(range(1,30))
p=[1,2]
#Convert to dictionary
hyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)
#Create new KNN object
knn_2 = KNeighborsClassifier()
#Use GridSearch
clf = GridSearchCV(knn_2, hyperparameters, cv=10)
#Fit the model
best_model = clf.fit(XScaled,y)
#Print The value of best Hyperparameters
print('Best leaf_size:', best_model.best_estimator_.get_params()['leaf_size'])
print('Best p:', best_model.best_estimator_.get_params()['p'])
print('Best n_neighbors:', best_model.best_estimator_.get_params()['n_neighbors'])

In [None]:
predicted_labels1 = best_model.predict(X_test)
best_model.score(X_test, y_test)

In [None]:
# calculate accuracy measures and confusion matrix
from sklearn import metrics

print("Confusion Matrix")
cm=metrics.confusion_matrix(y_test, predicted_labels1, labels=["M", "B"])

df_cm = pd.DataFrame(cm, index = [i for i in ["M","B"]],
                  columns = [i for i in ["Predict M","Predict B"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True,fmt='.5g',cmap="YlGn")

In [None]:
print(classification_report(y_test,predicted_labels1))

We can see that accuracy has improved from 96% to 98%.