# K-Nearest Neighbor Classifier
- K-NN algorithm classifies unknown data points by finding the most common class among the **k** closest examples. 
- Each data point in the k closest data points casts a vote and the category with the highest number of vote wins.
- To apply K-NN, we first need to select a distance metric or similarity function, such as Euclidean distance (or L2 distance)
$$
d(p,q) = \sqrt(\sum_{i-1}^N(q_i-p_i)^2)
$$
- Even Manhattan/city block distance(called L1-distance):
$$
d(p,q) = \sum_{i-1}^N|q_i-p_i|
$$

## Hyperparameter Tuning
- There are two paramters to tune:
    - Value of k: If it's too small, we gain efficiency but become susceptible to noise and outliers. If it is too large, we are at risk of over smoothing our classification results and increasing bias.
    - Distance metric: L1 or L2
- To tune the hyperparameters, we split our data into three sets: training, validation and testing set.
- Using the three-split scheme we can:
    - Train our classifier on training data using various values of k and distance metrics.
    - Evaluate the performance of the classifier on validation set, keeping track of which parameters obtained the highest acc.
    - Take the parameters that obtained highest acc and use it to train the model.
    - Evaluate the best classifier on the test set.

## Recognizing handwritten digit using MNIST
- Five step pipeline:
    - Step1: Structuring our initial dataset: Our dataset consists of images in grayscale, and have been pre-processed, aligned and centered, thus we can skip this step.
    - Step2: Splitting dataset
    - Step3: Extracting features: We'll use raw grayscale pixel intensities of image.
    - Step4: Training our classification model.
    - Step5: Evaluating our classifier

In [1]:
import sys
sys.path.append("../")

In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn import datasets
from skimage import exposure
import numpy as np
import imutils
import cv2
import sklearn
from cv_imshow import display_image, create_subplot
from sklearn.model_selection import train_test_split

In [4]:
#load MNIST dataset
mnist = datasets.load_digits()

#construct training and testing split using 75%-25%.
(trainingData, testData, trainLabels, testLabels) = train_test_split(np.array(mnist.data),
                                                                    mnist.target, test_size=0.25, random_state=42)

#10% of training for validation
(trainData, valData, trainLabels, valLabels) = train_test_split(trainingData,
                                                               trainLabels, test_size=0.1, random_state=84)

print("training data points: {}".format(len(trainLabels)))
print("validation data points: {}".format(len(valLabels)))
print("testing data points: {}".format(len(testLabels)))

training data points: 1212
validation data points: 135
testing data points: 450


In [5]:
#init k for our KNN
kVals = range(1, 30, 2)
accuracies = []

for k in kVals:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(trainData, trainLabels)
    
    score = model.score(valData, valLabels)
    print("k=%d accuracy=%.2f%%" % (k, score*100))
    accuracies.append(score)
    
i = int(np.argmax(accuracies))
print("k=%d achieved highest accuracy of %.2f%% on validation data" % (kVals[i], accuracies[i] * 100))

k=1 accuracy=99.26%
k=3 accuracy=99.26%
k=5 accuracy=99.26%
k=7 accuracy=99.26%
k=9 accuracy=99.26%
k=11 accuracy=99.26%
k=13 accuracy=99.26%
k=15 accuracy=99.26%
k=17 accuracy=98.52%
k=19 accuracy=98.52%
k=21 accuracy=97.78%
k=23 accuracy=97.04%
k=25 accuracy=97.78%
k=27 accuracy=97.04%
k=29 accuracy=97.04%
k=1 achieved highest accuracy of 99.26% on validation data


In [6]:
#re-train our classifier using the best k value and predict the labels of the
# test data
model = KNeighborsClassifier(n_neighbors=kVals[i])
model.fit(trainData, trainLabels)
predictions = model.predict(testData)
 
# show a final classification report demonstrating the accuracy of the classifier
# for each of the digits
print("EVALUATION ON TESTING DATA")
print(classification_report(testLabels, predictions))

EVALUATION ON TESTING DATA
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        43
           1       0.95      1.00      0.97        37
           2       1.00      1.00      1.00        38
           3       0.98      0.98      0.98        46
           4       0.98      0.98      0.98        55
           5       0.98      1.00      0.99        59
           6       1.00      1.00      1.00        45
           7       1.00      0.98      0.99        41
           8       0.97      0.95      0.96        38
           9       0.96      0.94      0.95        48

    accuracy                           0.98       450
   macro avg       0.98      0.98      0.98       450
weighted avg       0.98      0.98      0.98       450

