# k-Nearest Neighbors

In this lab you will explore tuning a k-Nearest Neighbors model for image classification.

We will use the [Imagenette dataset](https://github.com/fastai/imagenette), a small subset of the larger ImageNet dataset.  Download the 160 px version and extract it in the same folder as this notebook.

In [37]:
import glob
import imageio
import skimage
from matplotlib import pyplot as plt
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from collections import Counter


Here is some code to load the dataset and remap the labels to nicer names.

In [30]:
label_map = {
    'n01440764':'fish',
    'n02102040':'dog',
    'n02979186':'cassette_tape',
    'n03000684':'chain_saw',
    'n03028079':'church',
    'n03394916':'french_horn',
    'n03417042':'garbage_truck',
    'n03425413':'gas_station',
    'n03445777':'golf_ball',
    'n03888257':'parachute'
}

In [32]:
def load_split(split):
    images = []
    labels = []
    for key,value in label_map.items():
        image_paths = sorted(glob.glob(f'imagenette2-160/{split}/{key}/*.JPEG'))
        for path in image_paths:
            image = imageio.imread(path)
            if len(image.shape)>2:
                image = skimage.color.rgb2gray(image)
            if image.shape[0] > 160:
                image = image[image.shape[0]//2-80:image.shape[0]//2+80,:]
            if image.shape[1] > 160:
                image = image[:,image.shape[1]//2-80:image.shape[1]//2+80]
            images.append(image)
            labels.append(value)
    return images, labels

train_images, train_labels = load_split('train')

val_images, val_labels = load_split('val')

  image = imageio.imread(path)


len of train images 9469


Now what we have are lists `train_images` and `train_labels` containing the images and labels for the training set, and the same for the validation set.

2. Compute a HOG descriptor for each image to make two lists, `train_descriptors` and `val_descriptors` (```skimage.features.hog```).

In [33]:
def hog_descriptors(images):
    
    descriptors = []
    for image in images:
        hog_descriptor = skimage.feature.hog(image)
        descriptors.append(hog_descriptor) 
    return np.array(descriptors)

train_descriptors = hog_descriptors(train_images)
val_descriptors = hog_descriptors(val_images)


In [34]:
print(f"Number of training descriptors: {len(train_descriptors)}")
print(f"Number of training labels: {len(train_labels)}")
print(f"Number of validation descriptors: {len(val_descriptors)}")
print(f"Number of validation labels: {len(val_labels)}")


Number of training descriptors: 9469
Number of training labels: 9469
Number of validation descriptors: 3925
Number of validation labels: 3925


2. Build a k-nearest neighbors classifier on the training set (```sklearn.neighbors.KNeighborsClassifier```).

This model will find the $k$ nearest neighbors to the query point and output the most common label.  Use the default value of $k$.

Run the model on the test set and print out the accuracy (```sklearn.metrics.accuracy_score```).

In [35]:
knn = KNeighborsClassifier()
knn.fit(train_descriptors, train_labels)
val_predictions =knn.predict(val_descriptors)

accuracy = accuracy_score(val_labels, val_predictions)



In [36]:
print(accuracy)

0.26038216560509553


In [39]:
counter = Counter(val_predictions)
most_common = counter.most_common(1)[0][0]
print(most_common)

golf_ball


3. Test $k$ from 1 to 20 and make a plot of the train and test accuracy.  Explain how bias and variance changes as $k$ increases.  Which is the best setting of $k$?

In [44]:

for i in range(1,21):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(train_descriptors, train_labels)
    val_predictions = knn.predict(val_descriptors)
    accuracy = accuracy_score(val_labels, val_predictions)
    print(f"k is : {i}, with an accuracy of ", accuracy)

k is : 1, with an accuracy of  0.27235668789808914
k is : 2, with an accuracy of  0.27261146496815286
k is : 3, with an accuracy of  0.2647133757961783
k is : 4, with an accuracy of  0.26522292993630575
k is : 5, with an accuracy of  0.26038216560509553
k is : 6, with an accuracy of  0.2647133757961783
k is : 7, with an accuracy of  0.2647133757961783
k is : 8, with an accuracy of  0.26420382165605094
k is : 9, with an accuracy of  0.2619108280254777
k is : 10, with an accuracy of  0.2624203821656051
k is : 11, with an accuracy of  0.2570700636942675
k is : 12, with an accuracy of  0.2580891719745223
k is : 13, with an accuracy of  0.25834394904458596
k is : 14, with an accuracy of  0.25910828025477706
k is : 15, with an accuracy of  0.2593630573248408
k is : 16, with an accuracy of  0.2578343949044586
k is : 17, with an accuracy of  0.25834394904458596
k is : 18, with an accuracy of  0.2573248407643312
k is : 19, with an accuracy of  0.2588535031847134
k is : 20, with an accuracy of  

I am a bit lost, I thought there would be a reverse correlation with the increase in K. I would assume smaller k means less bias (because we are closer to the training data) and large k would mean Variance decreases. However, it seemss that the best K was 1 with the highest accuracy. 