# Lab 3 (Part B): Classification with kNN

Make sure that you check the videos of lecture 3 before starting this Lab:
- Linear classification with Logistic Regression: https://youtu.be/KLcKZ-Hs7YY
- Logistic Regression with nonlinear features: https://youtu.be/GOCiEogF-qI

In this part of the Lab, you will implement the *k nearest neighbours* (kNN) classification method, and apply it to a dataset. Your task is to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each microchip goes through various tests to ensure it is functioning correctly.

Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected. To help you make the decision, you have a dataset of test results on past microchips, from which you can build a *k nearest neighbours* classifier.

## Loading the data
We have a file `microchips-dataset.csv` which contains the dataset for our *nonlinear* classification problem. The first column corresponds to the result of "microchip test 1", the second column corresponds to the result of "microchip test 2", and the third column is the class-label indicating if the microchip has been accepted or rejected (1 = Accepted, 0 = Rejected).

<img src="imgs/MicroshipDataLab3B.png" />

Complete the following Python code to load the dataset from the csv file into the variables `X` (input data matrix) and `y` (output class-labels). `X` should be a matrix with $n$ lines and $2$ columns (i.e. two feature) corresponding to "microchip test 1" and "microchip test 2". `y` should be a numpy array of $n$ elements.

**Note**: You DO NOT need to add an additional column of all ones to the dataset as we are NOT using a linear model of the form $h_{\theta}(x)={\theta}^T x$ in this part of the Lab.

In [2]:
%matplotlib notebook
import numpy as np

filename = "datasets/microchips-dataset.csv"

""" TODO:
Write the code to load the dataset from the `filename` into the variables X and y.
X should be a numpy array of n lines and 2 columns (the input data matrix).
y should be a numpy array of n elements (the outputs vector).
Try to do it by yourself and only check the code used in previous Labs if you are uncertain.
"""
data = np.genfromtxt(filename, delimiter=",")

X = data[:, :2]
y = data[:, -1]

print(X[:10])
print(y[:10])

[[ 0.051267  0.69956 ]
 [-0.092742  0.68494 ]
 [-0.21371   0.69225 ]
 [-0.375     0.50219 ]
 [-0.51325   0.46564 ]
 [-0.52477   0.2098  ]
 [-0.39804   0.034357]
 [-0.30588  -0.19225 ]
 [ 0.016705 -0.40424 ]
 [ 0.13191  -0.51389 ]]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


## Visualizing the data
Similar to the previous part of the Lab, before starting to implement anything, it is always good to visualize the data if possible. Complete the following Python code so that it displays a figure like the one shown below. The two dimensions (features) correspond to the two tests results, and the class-labels are shown with different markers/colors.

<img src="imgs/MicroshipScatterPlotLab3B.png" width="500px" />

In [3]:
import matplotlib.pylab as plt

""" TODO:
Write code here to produce a scatter plot of the training 
dataset like the one shown in the figure above.
"""
X0 = X[y==0]
X1 = X[y==1]

fig, ax = plt.subplots()
ax.scatter(X0[:,0], X0[:,1], color='r', marker='x')
ax.scatter(X1[:,0], X1[:,1], color='b', marker='+')
ax.set_xlabel('Microchip Test 1')
ax.set_ylabel('Microchip Test 2')
ax.legend(['Rejected (y=0)', 'Accepted (y=1)'])
plt.title('Plot of the training dataset')

<IPython.core.display.Javascript object>

Text(0.5, 1.0, 'Plot of the training dataset')

**Side Note:** From the figure you can see that our dataset cannot be separated into positive (class 1) and negative (class 0) examples by a straight-line through the plot. Therefore, a straightforward application of logistic regression will not perform well on this dataset since logistic regression will only be able to find a linear decision boundary. If one still wants to use logistic regression, then one way to fit the data better is to create more features from each data-point by mapping the features into polynomial terms of $x_1$ and $x_2$ (e.g. $x_1^2$, $x_1 x_2$, etc ...). However, in this part of the Lab, we will use a kNN which is a nonlinear classifier.

## Implementing the k Nearest Neighbours (kNN) classifier

In the following code, you are asked to first implement the definition of the distance function `dist(u, v)` which computes the euclidean distance between two vectors $u \in \mathbb{R}^d$ and $v \in \mathbb{R}^d$. The euclidean distance between $u$ and $v$ is defined as: $\left \| u - v \right \| = \sqrt{\sum_{j=1}^{d} (u_i - u_v)^2}$. Note that this is simply the norm of the vector $u - v$, so you can either code it by yourself in pure Python, or make use of the numpy function `np.lianalg.norm(..)` which returns the norm of a given vector (or you can try both to see if you implemented it correctly).

One you implement the distance function, you are asked to implement the definition of the function `prediction(x, X, y, k=5)`. This function takes as arguments a new data-point $x$ for which we want to predict the class-label, the training input data $X$, the output class-labels $y$, and a parameter $k$ corresponding to the number of nearest neighbours to use. The function should return the predicted class-label for $x$. To help you implement the function, you can follow the comments and make use of some predefined functions such as:
- [numpy.argsort](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html)
- [collections.Counter](https://docs.python.org/dev/library/collections.html#collections.Counter.most_common)

In [20]:
from collections import Counter
from numpy.linalg import norm

""" TODO
Implement the definition of the function dist(u, v) which
computes the euclidean distance between two arrays u and v.
"""
def dist(u, v):
    return np.sqrt((u-v)@(u-v))


u = np.array([1,0])
v = np.array([0,1])
print(dist(u,v))
print(norm([u-v]))
""" TODO:
Implement the definition of the function prediction(x, X, y, k=5). You should return 
the predicted class-label for x, using the training dataset X, y, and k nearest neighbours.
"""
def prediction(x, X, y, k):
    # TODO: Compute the list of distances from x to each point in X
    dists = [dist(x, i) for i in X]
    # TODO: Get the list of indices sorted according to their corresponding distance
    order = np.array(sorted(enumerate(dists), key=lambda i: i[1]))
    # TODO: Take the class-labels corresponding to the first k indices (closest points to x)
    labels = [y[int(i)] for i in order[:k,0]]
    # TODO: The predicted class-label is the most common one ammong these class-labels
    return np.round(np.mean(labels))


""" TODO:
Test your function prediction(x, X, y, k=5) on x = np.array([0, 0]); it 
should return the class-label 1 (i.e. accepted microship).
"""
x = np.array([0, 0])
prediction(x, X, y, k=5)

1.4142135623730951
1.4142135623730951


1.0

## Plotting the decision boundary



In the following code, a function `plot_decision_boundary(func, X, y, k)` is provided to you. This fonction takes as a first argument the name of the prediction function (that you defined previously), and plots the decision boundary and the training dataset. You can read it if you want, but you DO NOT need to fully understand it. Your task here is to simply call the function `plot_decision_boundary(func, X, y, k)` a couple of times with different values of $k$, and see the difference in the decision boundary. Is the kNN decision boundary more complex when $k$ is smaller? It should be.

*Note*: when the function is called, it can take some time (few seconds) before the plots are produced.

In [21]:
from matplotlib.colors import ListedColormap

# This fonction plots the decision boundary and the training dataset
# You can read it if you want, but you don't need to fully understand it.
def plot_decision_boundary(func, X, y, k):
    print("Please wait. This might take few seconds to plot ...")
    min_x1, max_x1 = min(X[:, 0]) - 0.1, max(X[:, 0]) + 0.1
    min_x2, max_x2 = min(X[:, 1]) - 0.1, max(X[:, 1]) + 0.1

    plot_x1, plot_x2 = np.meshgrid(np.linspace(min_x1, max_x1, 50), np.linspace(min_x2, max_x2, 50))
    points = np.c_[plot_x1.ravel(), plot_x2.ravel()]
    preds = np.array([ func(x, X, y, k) for x in points ])
    preds = preds.reshape(plot_x1.shape)

    X0 = X[y==0]
    X1 = X[y==1]

    fig, ax = plt.subplots()
    ax.pcolormesh(plot_x1, plot_x2, preds, cmap=ListedColormap(['#FFAAAA', '#AAAAFF']))
    ax.scatter(X0[:, 0], X0[:, 1], color="red", label="Rejected")
    ax.scatter(X1[:, 0], X1[:, 1], color="blue", label="Accepted")
    ax.set_xlabel("Microship Test 1")
    ax.set_xlabel("Microship Test 2")
    ax.set_title("Decision boundary with k = {}".format(k))
    plt.legend()
    fig.show()


""" TODO:
Call here the function plot_decision_boundary(..) a couple of times with 
different values of k, and see the difference in the decision boundary. 
Normally, the decision boundary looks more complex when k is smaller.
"""
plot_decision_boundary(prediction, X, y, k=1)  # with k = 1
plot_decision_boundary(prediction, X, y, k=15) # with k = 15
plot_decision_boundary(prediction, X, y, k=30) # with k = 30


Please wait. This might take few seconds to plot ...


<IPython.core.display.Javascript object>

Please wait. This might take few seconds to plot ...


<IPython.core.display.Javascript object>

Please wait. This might take few seconds to plot ...


<IPython.core.display.Javascript object>

## Evaluating the kNN classifier
One way to evaluate the quality of our classifier is to see how well it predicts on our training set. In this part, your task is to complete the Python code below to report the training accuracy of your classifier by computing the percentage of examples for which you correctly predicted the class-label.

*Note*: We will see later in the course that computing the ***training** accuracy* is NOT a good way to evaluate the quality of your machine learning model.

In [22]:
""" TODO:
Predict the class-labels of the data-points in the training set by calling the function 
prediction(..) on each data-point in X. Then, compute the classification accuracy by comparing 
the predicted class-labels with the actual (true) class-labels y. Use a value of k = 15.
"""

y_pred = []
for x in X:
    y_pred.append(prediction(x, X, y, 15))
accuracy = sum(np.equal(y, y_pred)==True)/len(y)
print(accuracy)

0.8050847457627118


## OPTIONAL: Weighted k Nearest Neighbours (kNN) classifier
This task is optional. Your task here is to re-define your previous function `prediction(x, X, y, k=5)` so that it corresponds to the weighted kNN. You can define the weights as the inverse of the distances.

In [59]:

""" TODO:
Implement the definition of the function prediction_weighted(x, X, y, k=5). You should return 
the predicted class-label for x, using the training dataset X, y, and k.
"""
def prediction_weighted(x, X, y, k=5):
    # TODO: Compute the list of distances from x to each point in X
    dists = [dist(x, i) for i in X]
    weights = np.array([[i, dists[i], 1/dists[i]] for i in range(len(dists))])
    # TODO: Get the list of indices sorted according to their corresponding distance
    
    order = np.array(sorted(weights, key=lambda i: i[1]))
    # TODO: Take the class-labels corresponding to the first k indices (closest points to x)
    labels = np.array([(y[int(i)])*order[int(i),2] for i in order[:k,0]])
    print(labels)
    print(sum(labels)/sum(order[:k,2]))
    # TODO: The predicted class-label is the most common one ammong these class-labels
    return np.round(np.mean(labels))

""" TODO
Test the function prediction_weighted(x, X, y, k) by calling it. Then, 
call the function plot_decision_boundary(..) a couple of times with 
different values of k, and see the difference in the decision boundary.
"""

x = np.array([0, 0])
print(prediction_weighted(x, X, y, k=5),'ans')

#plot_decision_boundary(prediction_weighted, X, y, k=1)  # with k = 1
#plot_decision_boundary(prediction_weighted, X, y, k=15) # with k = 15
#plot_decision_boundary(prediction_weighted, X, y, k=30) # with k = 30


[1.62615053 1.67064718 1.43097691 1.44300569 0.        ]
0.2747000533283291
1.0 ans
