## Implementing the $k$-NN Classification

* Apply the $k$-NN Algorithm
* Using Cross Validation
* Apply Scaling

In [None]:
%matplotlib inline
from __future__ import division
import pandas as pd
import numpy as np
#from seaborn import plt
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

The common data set to validate classification algorithm's performance is the [Fisher Iris data set](http://en.wikipedia.org/wiki/Iris_flower_data_set), which is commonly included in most stats or machine learning packages.

In [None]:
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets, feature_selection
from sklearn.cross_validation import train_test_split, cross_val_score

In [None]:
# the values of k in KNN
# we will examin the performance on different k values and explore what value gives the best result
n_neighbors = range(1, 51, 2)
print n_neighbors

In [None]:
# Load in the data
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)

iris_df['Target'] = iris.target
print 'iris data head:'
print iris_df.head()
print 'iris describe():'
print iris_df.describe()

print "label set: " + repr(iris_df['Target'].unique())


In [None]:
iris_df.plot?

In [None]:
#Let's explore the data to get some intuition about it

#we'll plot 2x3 figures (why?)
fig, axes = plt.subplots(nrows=2, ncols=3)

colors = ['r','g','b']
for i in range(3): 
    tmp = iris_df[iris_df.Target == i]
    tmp.plot(x=0,y=1, kind='scatter', c=colors[i], ax=axes[0,0])

for i in range(3): 
    tmp = iris_df[iris_df.Target == i]
    tmp.plot(x=0,y=2, kind='scatter', c=colors[i], ax=axes[0,1])

for i in range(3): 
    tmp = iris_df[iris_df.Target == i]
    tmp.plot(x=0,y=3, kind='scatter', c=colors[i], ax=axes[0,2])
    
for i in range(3): 
    tmp = iris_df[iris_df.Target == i]
    tmp.plot(x=1,y=2, kind='scatter', c=colors[i], ax=axes[1,0])

for i in range(3): 
    tmp = iris_df[iris_df.Target == i]
    tmp.plot(x=1,y=3, kind='scatter', c=colors[i], ax=axes[1,1])

for i in range(3): 
    tmp = iris_df[iris_df.Target == i]
    tmp.plot(x=2,y=3, kind='scatter', c=colors[i], ax=axes[1,2])


### Parameter Search

In [None]:
# Create the training (and test) set using scikit-learn's train_test_split function
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=12)

# Try this sequence again with the following random seed.
# observe how it changes the scores of K quite dramatically
# X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=8)

In [None]:
# Loop through each neighbors value from 1 to 51 and append
# the scores
scores = []
for n in n_neighbors:
    clf = neighbors.KNeighborsClassifier(n)
    clf.fit(X_train, y_train)
    scores.append(clf.score(X_test, y_test))

In [None]:
plt.plot(n_neighbors, scores, linewidth=3.0)

In [None]:
#Why does the classification rate go down with more neighbors?



#If we have N points in our dataset, what would happen if we use N neighbors
#to classify each point




### Application of Cross Validation

The work above shows that at 11 neighbors, we can get an ideal result that doesn't overfit the data. To verify this, we'll use cross validation.

In [None]:
from sklearn.datasets import load_iris
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)

from sklearn.cross_validation import cross_val_score
clf = neighbors.KNeighborsClassifier(11, weights='uniform')
clf.fit(iris.data, iris.target)
scores = cross_val_score(clf, iris_df.values, iris.target, cv=5)


In [None]:
print scores
print scores.mean()

# Visualizaiton of the Decision Boundary between Classes

### We will just consider the last two features of the dataset for this visualization

In [None]:
clf = neighbors.KNeighborsClassifier(11, weights='uniform')
clf.fit(iris.data[:, 2:4], iris.target)

In [None]:
h = 0.01  # step size in the mesh
# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

In [None]:
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, y_min = iris_df.min()[['petal length (cm)', 'petal width (cm)']]
x_max, y_max = iris_df.max()[['petal length (cm)', 'petal width (cm)']]


* [np.meshgrid](http://docs.scipy.org/doc/numpy/reference/generated/numpy.meshgrid.html) (build grid)
* [ravel](http://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html) (flatten)
* [np.c_](http://docs.scipy.org/doc/numpy-1.6.0/reference/generated/numpy.c_.html#numpy.c_)
    * `np.c_[np.array([1,2,3]), np.array([4,5,6])]` will get `[[1, 4],[2, 5],[3, 6]]`

In [None]:
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

In [None]:
# Put the result into a color plot
Z = Z.reshape(xx.shape)

In [None]:
plt.figure(figsize=(18,6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

In [None]:
# Plot also the training points
plt.figure(figsize=(18,6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
plt.scatter(iris_df['petal length (cm)'], iris_df['petal width (cm)'], c=iris.target, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = {}, weights = '{}')".format(clf.n_neighbors, clf.weights))

### Scaling

In [None]:
iris_df.describe()

In [None]:
from sklearn.preprocessing import scale

In [None]:
df_norm = pd.DataFrame(scale(iris.data), columns=iris.feature_names)

In [None]:
df_norm.head()

In [None]:
df_norm.describe()

## Lab

1. Rerun the [parameter search](#Parameter-Search) with `random_state=8`. Do you get the same result for the optimal $k$
2. Rerun the whole lab but using [scaled](#Scaling) data
3. (Advanced) Write your own `classifyByKNeighbors` method:
```
score = classifyByKNeighbors(k, X_train, y_train, X_test, y_test)
```
or even better, your own `MyKNeighborsClassifier` class:
```
clf = MyKNeighborsClassifier(k)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
```