# *k* Nearest Neighbour
Objective is to work down through this notebook and make sure you understand what is going on.   
Where necessary, look up Python help to understand what the methods take as arguments or return.

## Use `NearestNeighbors` to identify neighbours.  

### Athlete Selection Data  
First load dataset into a data frame.

In [None]:
import pandas as pd
import numpy as np
from sklearn import preprocessing

from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier
athlete = pd.read_csv('data/AthleteSelection.csv',index_col = 'Athlete')
athlete.head()

In [None]:
names = athlete.index
names

In [None]:
# Store features and labels in numpy arrays X and y
y = athlete.pop('Selected').values
X = athlete.values
q = [5.0,7.5]
X[0]

In [None]:
athlete

### Plot this dataset

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
color= ['red' if l == 'No' else 'green' for l in y]
x1 = X[:,0]
x2 = X[:,1]
plt.figure(figsize=(8,5))
plt.scatter(x1,x2, color=color)
plt.scatter(q[0],q[1],color='black')
plt.annotate('q',(q[0]+0.05,q[1]))
plt.title("Athlete Selection")
plt.xlabel("Speed")
plt.ylabel("Agility")
plt.grid()
red_patch = mpatches.Patch(color='red', label='Not Selected')
blue_patch = mpatches.Patch(color='green', label='Selected')
plt.legend(handles=[red_patch, blue_patch],loc=4)
for i, txt in enumerate(names):
    plt.annotate(txt, (x1[i]+0.05, x2[i]))

## Data Normalization
Features may be measured on very different scales.  
(Not really an issue here.)  
Rescale the data so that all features have the same scale, two options:
- N(0,1) rescale with zero mean and unit variance
- MinMax scaling - typically in the range (0,1)

### N(0,1) normalisation

In [None]:
scaler = preprocessing.StandardScaler().fit(X)  #need a handle on the scaler to apply to training and test data
X_scaled = scaler.transform(X)
q_scaled = scaler.transform([q])
q_scaled

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
color= ['red' if l == 'No' else 'green' for l in y]
x1 = X_scaled[:,0]
x2 = X_scaled[:,1]
plt.figure(figsize=(8,5))
plt.scatter(x1,x2, color=color)
plt.scatter(q_scaled[0,0],q_scaled[0,1],color='black')
plt.annotate('q',(q_scaled[0,0]+0.05,q_scaled[0,1]))
plt.title("Athlete Selection (Normalized)")
plt.xlabel("Speed N(0,1)")
plt.ylabel("Agility N(0,1)")
plt.grid()
red_patch = mpatches.Patch(color='red', label='Not Selected')
blue_patch = mpatches.Patch(color='green', label='Selected')
plt.legend(handles=[red_patch, blue_patch],loc=4)
for i, txt in enumerate(names):
    plt.annotate(txt, (x1[i]+0.05, x2[i]))

In [None]:
athlete_neigh = NearestNeighbors(n_neighbors=2, radius=0.4)
athlete_neigh.fit(X_scaled)

In [None]:
athlete_neigh

In [None]:
# Find x nearest neighbours for X4
x4 = X_scaled[3]
athlete_neigh.kneighbors([x4], 2, return_distance=True)

#what are the neighbours returned?  Are these correct? 

In [None]:
# Find nearest neighbours for X4 within a radius 
athlete_neigh.radius_neighbors([x4], 1.0, return_distance=True)

In [None]:
# Find three nearest neighbours for q - are these correct?
q = [5.0,7.5]
q3n = athlete_neigh.kneighbors(q_scaled, 3)[1][0]
# q3n contains the 'index' of the nearest neighbours
for n in q3n:
    print(names[n])

## *k*-NN Classifier



In [None]:
kNN = KNeighborsClassifier(n_neighbors = 3)
kNN = kNN.fit(X_scaled,y)
kNN.predict(q_scaled)

In [None]:
q_scaled

## Forecast data

In [None]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix 
from sklearn import preprocessing

train = pd.read_csv('data/Forecast.csv')
train.head(5)

In [None]:
train.shape

In [None]:
y = train.pop('Go-Out').values # y is a numpy array with the class labels

In [None]:
X = train.values.astype(float)  # X is a numpy array with the training data converted to floats

In [None]:
X = train.values

In [None]:
X

In [None]:
X.shape

In [None]:
# Default k-NN metric is Minkowski with p = 2, i.e. Euclidean
forecast_kNN = KNeighborsClassifier(n_neighbors=3) 
forecast_kNN.fit(X,y)

In [None]:
forecast_kNN

In [None]:
# Generate predictions (forecasts) for 2 query examples
xinput = np.array([[8.,70.,11.],
                   [8,69,15]])
forecast_kNN.predict(xinput)

In [None]:
# Explicitly find the neighbours (and distances) for a query
q = [8,69,15]
forecast_kNN.kneighbors([q])

In [None]:
y_dash = forecast_kNN.predict(X) # Use training data as test
print('     y:',y)      #print actuals
print('y_dash:',y_dash) #print predictions

In [None]:
confusion = confusion_matrix(y, y_dash)
print("Confusion matrix:\n{}".format(confusion)) 

What is the accuracy here?
What would we expect to happen when k=1? (Try it.)

### Normalise data
The default for `preprocessing.scale` is to convert features to N(0,1)

In [None]:
X_scaled = preprocessing.scale(X)
forecast_kNN_S = KNeighborsClassifier(n_neighbors=3)
forecast_kNN_S.fit(X_scaled,y)
y_dash = forecast_kNN_S.predict(X_scaled)
confusion = confusion_matrix(y, y_dash)   
print("Confusion matrix:\n{}".format(confusion)) 
print('\n     y:',y)           
print('y_dash:',y_dash)          

What is the accuracy here?   
In this case scaling is actually making things worse.

In [None]:
help(preprocessing)

In [None]:
X_scaled[:5] # First five rows of the scaled data.

In [None]:
# predict for query case q
forecast_kNN_S.kneighbors([q])
# What is wrong with this? 
# We haven't scaled the query.

In [None]:
# We need a 'handle' on the scaler so that we can reapply it to the query
scaler = preprocessing.StandardScaler().fit(X) #A scaler object
X_scaled = scaler.transform(X)
q_scaled = scaler.transform([q])
q_scaled[0]

In [None]:
forecast_kNN_S.fit(X_scaled,y)
forecast_kNN_S.kneighbors(q_scaled)

### MinMax Scaling - range (0,1)

In [None]:
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
X_scaled01 = min_max_scaler.fit_transform(X)
X_scaled01

## Instance weighting
Why should all neighbours have the same impact on the classification?  
Give nearer neighbours a larger vote.

In [None]:
forecast_kNN_SW = KNeighborsClassifier(n_neighbors=3,weights='distance')
forecast_kNN_SW.fit(X_scaled,y)
y_dash = forecast_kNN_SW.predict(X_scaled)
confusion = confusion_matrix(y, y_dash)
print("Confusion matrix:\n{}".format(confusion)) 
print('\n     y:',y)
print('y_dash:',y_dash)

No errors now as nearest neighbour (itself) gets the largest vote. 