# Q1: k-Nearest Neighbor (kNN) exercise

The kNN classifier consists of two stages:

- During training, the classifier takes the training data and simply remembers it
- During testing, kNN classifies every test image by comparing to all training images and transfering the labels of the k most similar training examples
- The value of k is cross-validated

In this part of exercise, you will implement these steps and understand the basic Image Classification pipeline, cross-validation, and gain proficiency in writing efficient, vectorized code.


In [None]:
# Run some setup code for this notebook.

import random
import numpy as np
from test.data_utils import load_CIFAR10
import matplotlib.pyplot as plt


In [None]:
# Load the raw CIFAR-10 data.
# make sure you've run datasets/get_datasets.sh beforehand
cifar10_dir = 'datasets/cifar-10-batches-py'
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

# As a sanity check, you need print out the size of the training and test data.



In [None]:
# Visualize some examples from the dataset.
# We show a few examples of training images from each class.


In [None]:
# Subsample the data for more efficient code execution in this exercise


In [None]:
# Reshape the image data into rows


We would now like to classify the test data with the kNN classifier. Recall that we can break down this process into two steps: 

1. First you must compute the distances between all test examples and all train examples. 
2. Given these distances, for each test example we find the k nearest examples and have them vote for the label

Lets begin with computing the distance matrix between all training and test examples. For example, if there are **Ntr** training examples and **Nte** test examples, this stage should result in a **Nte x Ntr** matrix where each element (i,j) is the distance between the i-th test and j-th train example.

First, open `test/clf/k_nearest_neighbor.py` and implement the function `compute_distances_two_loops` that uses a (very inefficient) double loop over all pairs of (test, train) examples and computes the distance matrix one element at a time.

In [None]:
from test.clf import KNearestNeighbor

# Create a kNN classifier instance. 
# Remember that training a kNN classifier is a noop: 
# the Classifier simply remembers the data and does no further processing 




In [None]:
# You can visualize the distance matrix: each row is a single test example and
# its distances to training examples


**Inline Question #1:** Notice the structured patterns in the distance matrix, where some rows or columns are visible brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)

- What in the data is the cause behind the distinctly bright rows?
- What causes the columns?

**Your Answer**: *fill here.*



In [None]:
# Now implement the function predict_labels and run the code below:
# use k = 1 (which is Nearest Neighbor).


# Compute and print the fraction of correctly predicted examples


You should expect to see approximately `27%` accuracy. Now lets try out a larger `k`, say `k = 5`:

You should expect to see a slightly better performance than with `k = 1`.

In [None]:
# Now lets speed up distance matrix computation by using partial vectorization
# with one loop. Implement the function compute_distances_one_loop and run the
# code below:




# To ensure that our vectorized implementation is correct, we make sure that it
# agrees with the naive implementation. There are many ways to decide whether
# two matrices are similar; one of the simplest is the Frobenius norm. In case
# you haven't seen it before, the Frobenius norm of two matrices is the square
# root of the squared sum of differences of all elements; in other words, reshape
# the matrices into vectors and compute the Euclidean distance between them.





In [None]:
# Now implement the fully vectorized version inside compute_distances_no_loops
# and run the code
# check that the distance matrix agrees with the one we computed before:




In [None]:
# Let's compare how fast the implementations are
# you should see significantly faster performance with the fully vectorized implementation



### Cross-validation

You have implemented the k-Nearest Neighbor classifier but we set the value k = 5 arbitrarily. Now please determine the best value of this hyperparameter with cross-validation.

In [None]:
# plot the raw observations
# plot the trend line with error bars that correspond to standard deviation


In [None]:
# Based on the cross-validation results above, choose the best value for k,   
# retrain the classifier using all the training data, and test it on the test
# data. You should be able to get above 28% accuracy on the test data.
