# Individual practice with kNN

In this exercise, load up one of these datasets:

* Wisconsin breast cancer data
* Affairs data
* Project 3 data

---

You will be exploring the kNN algorithm with the data. This is an open-ended practice, in the sense that you will choose a target variable and predictors that you're interested in.

There are some general guidelines for things to do below. But it is up to you if you want to follow them.

[Feel free to borrow my matplotlib kNN boundary plotting code from the lecture, since it would be a pain to code up yourself.]

---

## 1. Load data and packages

In [14]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import patsy
%matplotlib inline

import seaborn as sns

from sklearn.neighbors import KNeighborsClassifier

In [15]:
liquor = pd.read_csv('../../../../Iowa_Liquor_Sales_reduced.csv')

In [21]:
liquor.head(2)

Unnamed: 0,Date,Store_Number,City,Zip_Code,County_Number,County,Category,Category_Name,Vendor_Number,Item_Number,Item_Description,Bottle_Volume_ml,State_Bottle_Cost,State_Bottle_Retail,Bottles_Sold,Sale_Dollars,Volume_Sold_Liters,Volume_Sold_Gallons
0,03/31/2016,5029,DAVENPORT,52806,82.0,Scott,1022100.0,TEQUILA,370,87152,Avion Silver,375,9.99,14.99,12,179.88,4.5,1.19
1,03/31/2016,5029,DAVENPORT,52806,82.0,Scott,1022100.0,TEQUILA,395,89197,Jose Cuervo Especial Reposado Tequila,1000,12.5,18.75,2,37.5,2.0,0.53


In [53]:
invest = liquor[['Store_Number','Sale_Dollars']].groupby('Store_Number').sum()

In [57]:
print "Mean", invest['Sale_Dollars'].mean()
print "Median", invest['Sale_Dollars'].median()
print invest['Sale_Dollars'].describe()

Mean 249362.021589
Median 86907.28
count    1.403000e+03
mean     2.493620e+05
std      6.182766e+05
min      6.031300e+02
25%      3.789939e+04
50%      8.690728e+04
75%      2.294446e+05
max      1.228265e+07
Name: Sale_Dollars, dtype: float64


Create an invest/don't invest threshold of $249362 for the stores

In [17]:
dollar_columns = ['State Bottle Cost',      # Pull cols with dollar values
                  'State Bottle Retail', 
                  'Sale (Dollars)']

liquor[dollar_columns] = \
liquor[dollar_columns].replace(             # Replace:
    '[\$,]',                                # Dollar signs with:
    "",regex=True                           # Empty Space and
).astype(float)

In [20]:
columns = liquor.columns.values

new_cols = []                        # Create new list for assignment
for column in columns:
    col = re.sub('\s+', '_', column) # Replace spaces with Underscores
    col = re.sub('[()]','',col)      # Replace paren with emptyspace
    new_cols.append(col)
    
liquor.columns = new_cols

## 2. Pick predictors and target of interest

In [23]:
model = KNeighborsClassifier(n_neighbors=3)


In [31]:
formula = "C(Category_Name) ~ Volume_Sold_Liters + Bottles_Sold + Sale_Dollars"

# Look at relationship of total sale for Bottles_Sold, 
# Volume and Category of alcohol sold

## 3. Do exploratory data analysis with metrics (correlation, etc.) and plotting

In [32]:
liquor.corr()

Unnamed: 0,Store_Number,County_Number,Category,Vendor_Number,Item_Number,Bottle_Volume_ml,State_Bottle_Cost,State_Bottle_Retail,Bottles_Sold,Sale_Dollars,Volume_Sold_Liters,Volume_Sold_Gallons
Store_Number,1.0,0.006428,-0.01189,-0.004009,-0.025724,-0.057311,-0.037101,-0.037164,0.014656,-0.018042,-0.017166,-0.017155
County_Number,0.006428,1.0,-0.007005,0.000799,0.009171,-0.026037,0.006268,0.006235,0.017886,0.016652,0.007545,0.007555
Category,-0.01189,-0.007005,1.0,0.093939,0.111992,-0.006541,-0.00764,-0.007724,0.001862,0.006617,-0.0049,-0.004892
Vendor_Number,-0.004009,0.000799,0.093939,1.0,0.134106,0.024174,0.001371,0.001179,-0.000784,-0.010784,-0.006085,-0.006081
Item_Number,-0.025724,0.009171,0.111992,0.134106,1.0,-0.042872,0.08036,0.08023,-0.001215,0.01116,-0.007188,-0.007178
Bottle_Volume_ml,-0.057311,-0.026037,-0.006541,0.024174,-0.042872,1.0,0.343526,0.34395,-0.013334,0.080751,0.144235,0.144081
State_Bottle_Cost,-0.037101,0.006268,-0.00764,0.001371,0.08036,0.343526,1.0,0.999991,-0.030318,0.10619,0.00845,0.008436
State_Bottle_Retail,-0.037164,0.006235,-0.007724,0.001179,0.08023,0.34395,0.999991,1.0,-0.030261,0.106257,0.008641,0.008627
Bottles_Sold,0.014656,0.017886,0.001862,-0.000784,-0.001215,-0.013334,-0.030318,-0.030261,1.0,0.836082,0.890409,0.89044
Sale_Dollars,-0.018042,0.016652,0.006617,-0.010784,0.01116,0.080751,0.10619,0.106257,0.836082,1.0,0.840192,0.840204


In [33]:
Y, X = patsy.dmatrices (formula, data=liquor)

In [34]:
Y = np.ravel(Y)

In [35]:
model.fit(X,Y)

ValueError: Found arrays with inconsistent numbers of samples: [  2703443 197351339]

## 4. Make X and Y cross-validation folds

BONUS: Use StratifiedKFold

## 5. Create the kNN classifier from sklearn

## 6. Cross-validate accuracy

Try out:

* weights='uniform' and weights='distance'
* 3 different values for k

## 7. Plot out data points and boundary for neighbors

This will require you to choose just 2 predictors and a target variable.

Please feel free to borrow my plotting code! If you want, walk through the code and get an understanding for how it works, and you can of course ask me to explain in more detail.