# HOMEWORK: k-Nearest Neighbors

In [3]:
import os

import pandas as pd
pd.set_option('display.max_rows', 100)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 100)

from sklearn import preprocessing, neighbors, grid_search, cross_validation
from sklearn import model_selection

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

In [4]:
df = pd.read_csv('dataset-boston.csv')

In [5]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,BLACK,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


The Boston dataset concerns itself with housing values in suburbs of Boston.  A description of the dataset is as follows:

- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sqft
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River binary/dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centers
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate (per ten thousands of dollars)
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT: % lower status of the population
- MEDV: Median value of owner-occupied homes (in thousands of dollars)

## Question 1.  
+ Let's first categorize `MEDV` to 4 groups: Bottom 20% as Level 1, next 30% as Level 2, next 30% categorized as Level 3, and the top 20% as Level 4.  
+ Please create a new variable `MEDV_Category` that stores the level number
+ Remember the quantile function
+ Remember how to segment your pandas data frame

In [18]:
level_1_ceil = df.quantile(0.2).MEDV
level_2_ceil = df.quantile(0.5).MEDV
level_3_ceil = df.quantile(0.8).MEDV
df['MEDV_Category'] = 1
df.loc[(df.MEDV <= level_2_ceil) & (df.MEDV > level_1_ceil),'MEDV_Category'] = 2
df.loc[(df.MEDV <= level_3_ceil) & (df.MEDV > level_2_ceil),'MEDV_Category'] = 3
df.loc[(df.MEDV > level_3_ceil),'MEDV_Category'] = 4
df.head(10)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,BLACK,LSTAT,MEDV,MEDV_Category
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0,3
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6,3
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7,4
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4,4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2,4
5,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7,4
6,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9,3
7,0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1,3
8,0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5,2
9,0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9,2


### Our goal is to predict `MEDV_Category` based on `RM`, `PTRATIO`, and `LSTAT`

## Question 2.  

+ First normalize `RM`, `PTRATIO`, and `LSTAT`.  
+ By normalizing, we mean to scale each variable between 0 and 1 with the lowest value as 0 and the highest value as 1

+ Check out the documentation for MinMaxScaler()

In [93]:
y = list(df.MEDV_Category)
x = df[['RM', 'PTRATIO', 'LSTAT']].copy()
mms = preprocessing.MinMaxScaler()
X = mms.fit_transform(x)
print X[:5]

[[ 0.57750527  0.28723404  0.08967991]
 [ 0.5479977   0.55319149  0.2044702 ]
 [ 0.6943859   0.55319149  0.06346578]
 [ 0.65855528  0.64893617  0.03338852]
 [ 0.68710481  0.64893617  0.09933775]]


## Question 3.  

+ Run a k-NN classifier with 5 nearest neighbors and report your misclassification error; set weights to uniform
+ Calculate your misclassification error on the training set

In [48]:
knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform')
fit = knn.fit(X,y)
print 1 - fit.score(X,y)

0.213438735178


Answer: TODO

## Question 4. 
+ Is this error reliable? 
+ What could we do to make it better?

Answer: Tough to say, we could try to find a more optimal k

## Question 5.  
+ Now use 10-fold cross-validation to choose the most efficient `k`

In [50]:
from sklearn import grid_search

params = {
    'n_neighbors': range(2,20),
    'weights': ['uniform']
}

kf = cross_validation.KFold(len(y), n_folds=10)
gs = grid_search.GridSearchCV(
    estimator=neighbors.KNeighborsClassifier(),
    param_grid=params,
    cv=kf
)

gs.fit(X,y)
print gs.best_score_
print gs.best_params_

0.695652173913
{'n_neighbors': 18, 'weights': 'uniform'}


## Question 6.  

+ Explain your findings
+ What were your best parameters?
+ What was the best k?
+ What was the best model?

Answer: The best model was k=18. The results were not as good as training on the entire data set, so this must mean the full dataset overfits the data

## Question 7.  

+ Train your model with the optimal `k` you found above 
+ (don't worry if it changes from time to time - if that is the case use the one that is usually the best)

In [64]:
knn = neighbors.KNeighborsClassifier(n_neighbors=18, weights='uniform')
fit = knn.fit(X,y)
print fit.score(X,y)

0.749011857708


Answer: TODO

## Question 8.  

+ After training your model with that `k`, 
+ use it to *predict* the class of a neighborhood with `RM = 2`, `PRATIO = 19`, and `LSTAT = 3.5`
+ If you are confused, check out the sklearn documentation for KNN

In [111]:
import numpy as np
prediction = knn.predict(mms.transform([3,10,20]))
print prediction

[2]




Answer: TODO