# Nearest Neighbors Homework

In this homework notebook we use **Nearest Neighbors** to classify back injuries for patients in a hospital. For each patient, measurements were taken on the shape and orientation of their pelvis and spine.

The data set contains information from **310** patients. For each patient, there are: six measurements (the x) and has label (the y). The label a **3** possible values, `’NO’` (normal), `’DH’` (herniated disk), or `’SL’` (spondilolysthesis). 

**Note:** Before attempting this homework read the <font color="magenta">*Predicting via Nearest Neighbors*</font> notebook.

# Setup Notebook

We import all necessary pacages for the homework. please read the documentation on 
[sklearn.neighbors](http://scikit-learn.org/stable/modules/neighbors.html)


In [1]:
import numpy as np
from sklearn.neighbors import BallTree
from sklearn.neighbors import KDTree

We now download the dataset. We divide the data into a training set of 248 patiens and a separate test set of 62 patients. You will need to use the following variables in the exercises:

* **`trainx`** : The training data's features. `trainx` is the data used to predict `trainy`
* **`trainy`** : The training data's labels. The data you're trying to predict.
* **`testx`** : The test data's features.
* **`testy`** : The test data's labels.

Notice that we code `0. = ’NO’, 1. = ’DH’, 2. = ’SL’` as the outputs for `trainy` and `testy`

In [2]:
# Load data set and code labels as 0 = ’NO’, 1 = ’DH’, 2 = ’SL’
labels = [b'NO', b'DH', b'SL']
data = np.loadtxt('column_3C.dat', converters={6: lambda s: labels.index(s)} )

# Separate features from labels
x = data[:,0:6]
y = data[:,6]

# Divide into training and test set
training_indices = list(range(0,80)) + list(range(100,148)) + list(range(160,280))
test_indices = list(range(80,100)) + list(range(148,160)) + list(range(280,310))

trainx = x[training_indices,:]
trainy = y[training_indices]
testx = x[test_indices,:]
testy = y[test_indices]

# Exercise 1:

In this exercise we compute nearest neighbors on on our dataset using the $\ell_2$ norm ( the *"euclidean distance"* ).

Write a function, **NN_L2**, which uses `trainx`, `trainy`, and `testx` to compute the nearest neighbors fit of `testy`. For **NN_L2**, the $\ell_2$ norm should be used as the distance metric.


<font  style="color:blue"> **Code**</font>
```python
# test function 
testy_L2 = NN_L2(trainx, trainy, testx)
print( type( testy_L2) )
print( len(testy_L2) )
print( testy_L2[40:50] )
```

<font  style="color:magenta"> **Output**</font>
```
<class 'numpy.ndarray'>
62
[ 0.  0.  1.  1.  0.  2.  0.  0.  1.  1.]
```


In [3]:
# Modify this Cell

def NN_L2(trainx, trainy, testx):
    # inputs: trainx, trainy, testx <-- as defined above
    # output: an np.array of the predicted values for testy 
    
    ### BEGIN SOLUTION
    ball_tree = BallTree(trainx, metric='euclidean')
    test_neighbors = np.squeeze(ball_tree.query(testx, k=1, return_distance=False))
    return trainy[test_neighbors]
    ### END SOLUTION

In [4]:
# test function 

testy_L2 = NN_L2(trainx, trainy, testx)

assert( type( testy_L2).__name__ == 'ndarray' )
assert( len(testy_L2) == 62 ) 
assert( np.all( testy_L2[30:40] == [ 2.,  2.,  1., 0.,  0., 0.,  0.,  0.,  0.,  0.] )  )

### BEGIN HIDDEN TESTS
assert( np.all( testy_L2[0:10] == [ 2.,  2.,  2.,  2.,  2.,  2.,  0.,  2.,  2.,  2.] ) )
### END HIDDEN TESTS

# Exercise 2:

We now compute nearest neighbors using the $\ell_1$ norm ( the *"Manhattan Distance"* ).

Write a function, **NN_L1**, which again uses `trainx`, `trainy`, and `testx` to compute the nearest neighbors fit of `testy`. For **NN_L1**, the $\ell_1$ norm should be used as the distance metric.

Notice that **NN_1** and **NN_2** produce different fits for the test labels.


<font  style="color:blue"> **Code**</font>
```python
# test function 
testy_L2 = NN_L2(trainx, trainy, testx)
testy_L1 = NN_L1(trainx, trainy, testx)

print( type( testy_L1) )
print( len(testy_L1) )
print( testy_L1[40:50] )
print( all(testy_L1 == testy_L2) )
```

<font  style="color:magenta"> **Output**</font>
```
<class 'numpy.ndarray'>
62
[ 0.  0.  0.  1.  0.  2.  0.  0.  1.  1.]
False
```


In [5]:
# Modify this Cell

def NN_L1(trainx, trainy, testx):
    # inputs: trainx, trainy, testx <-- as defined above
    # output: an np.array of the predicted values for testy 
    
    ### BEGIN SOLUTION
    ball_tree = BallTree(trainx, metric="manhattan")
    test_neighbors = np.squeeze(ball_tree.query(testx, k=1, return_distance=False))
    return trainy[test_neighbors]
    ### END SOLUTION

In [6]:
testy_L1 = NN_L1(trainx, trainy, testx)
testy_L2 = NN_L2(trainx, trainy, testx)

assert( type( testy_L1).__name__ == 'ndarray' )
assert( len(testy_L1) == 62 ) 
assert( not all(testy_L1 == testy_L2) )
assert( all(testy_L1[20:30]== [ 2.,  2.,  2.,  2.,  2.,  0.,  2.,  2.,  2.,  2.]) )

### BEGIN HIDDEN TESTS
assert( all( testy_L1[0:10] == [ 2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.]) )
### END HIDDEN TESTS

# Exercise 3:

Below we see that the $\ell_1$ and $\ell_2$ nearest neighbors have different levels of accuracy on the test dataset

In [7]:
def accuracy(testy, testy_fit):
    return sum(testy==testy_fit)/len(testy) 

print("Accuracy of NN_L1: ", accuracy(testy,testy_L1) )
print("Accuracy of NN_L2: ", accuracy(testy,testy_L2) )

Accuracy of NN_L1:  0.822580645161
Accuracy of NN_L2:  0.774193548387


We will now look deeper into how $\ell_1$ and $\ell_2$ nearest neighbots faired at predicting `testy` by constructing the <font color="magenta">*confusion matrix*</font>

The confusion matrix is a $3\times3$ matrix that shows the number of misclassifications for each label. For example, the entry at row DH, column SL, contains the number of test points whose correct label was DH but which were classified as SL.

<img style="width:200px" src="confusion_matrix.png">




Write a function, **confusion**, which given the fitted labels and correct labels for `testy`, computes the confusion matrix. The confusion matrix should be a `np.array` of shape `(3,3)` . 

<font  style="color:blue"> **Code**</font>
```python
L2_neo = confusion(testy, testy_L2)  
print( type(L2_neo) )
print( L2_neo.shape )
print( L2_neo )
```

<font  style="color:magenta"> **Output**</font>
```
<class 'numpy.ndarray'>
(3, 3)
[[ 0.  9.  2.]
 [ 0.  0.  0.]
 [ 3.  0.  0.]]
```


In [8]:
# Modify this cell

def confusion(testy,testy_fit):
    # inputs: the correct labels, the fitted NN labels 
    # output: a 3x3 np.array representing the confusion matrix as above
    
    ### BEGIN SOLUTION
    confusion = np.zeros( (3,3) )
    for i in range(len(testy)):
        if testy[i] != testy_fit[i]:
            confusion[ int(testy[i]), int(testy_fit[i]) ] += 1
    return confusion
    ### END SOLUTION

In [9]:
# Test Function

L1_neo = confusion(testy, testy_L1)  # <-- 'neo' because it's the Matrix
assert( type(L1_neo).__name__ == 'ndarray' )
assert( L1_neo.shape == (3,3) )
assert( np.all(L1_neo == [[ 0.,  8.,  2.],[ 0.,  0.,  0.],[ 1.,  0.,  0.]]) )

### BEGIN HIDDEN TESTS
L2_neo = confusion(testy, testy_L2)  
assert( np.all(L2_neo == [[ 0.,  9.,  2.],[ 0.,  0.,  0.],[ 3.,  0.,  0.]]) )
### END HIDDEN TESTS
