In [1]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf #needed for models in this script
import pylab as pl
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

In [2]:
pd.set_option('html', True) #see the dataframe in a more user friendly manner
%matplotlib inline

## k-Nearest Neighbors

K-Nearest Neighbors is a supervised learning algorithm that attempts to classify a new observation based on the known classification of the observations in our data that are closest to it. In this lesson, we'll discuss the algorithm's steps, and implement the algorithm to classify a new, randomly-generated point in the Iris dataset.

## K-NN Overview

K-Nearest Neighbors (K-NN) is among the simplest of all machine learning algorithms, simple enough that we're going to code each step ourselves in this lesson. K-NN is a supervised classification algorithm that classifies a data point based on a combination of the known classification of the k points that are closest to it. K-Nearest Neighbors does not attempt to fit a model to the data. Rather, the algorithm simply determines the "majority vote" (the class mode) of the k points that are nearest the point you are trying to classify.

Algorithm steps:

1. Determine k.
2. Calculate the distance between the new observation and all points in the training set.
3. Sort the distances to determine the k nearest neighbors based on the k-th minimum distance.
4. Determine the class of those neighbors.
5. Determine the majority.

A very good post on K-NN: http://blog.yhathq.com/posts/classification-using-knn-and-python.html

<b> Walk through of that post:</b>

K-nearest neighbors, or KNN, is a supervised learning algorithm for either classification or regression. It's super intuitive and has been applied to many types of problems.

It's great for many applications, with personalization tasks being among the most common. To make a personalized offer to one customer, you might employ KNN to find similar customers and base your offer on their purchase behaviors. KNN has also been applied to medical diagnosis and credit scoring.

This is a post about the K-nearest neighbors algorithm and Python.

<b>What is K-nearest neighbors?</b>

Conceptually, KNN is very simple. Given a dataset for which class labels are known, you want to predict the class of a new data point.

The strategy is to compare the new observation to those observations already labeled. The predicted class will be based on the known classes of the nearest k neighbors (i.e. based on the class labels of the other data points most similar to the one you're trying to predict).

<b>An example</b>

Imagine a bunch of widgets which belong to either the "Blue" or the "Red" class. Each widget has 2 variables associated with it: x and y. We can plot our widgets in 2D.

![](files/knn1.jpg)

Now let's say I get another widget (black dot) that also has x and y variables. How can we tell whether this widget is blue or red?

![](files/knn2.jpg)

The KNN approach to classification calls for comparing this new point to the other nearby points. If we were using KNN with 3 neighbors, we'd grab the 3 nearest dots to our black dot and look at the colors. The nearest dots would then "vote", with the more predominant color being the color we'll assign to our new black dot.

![](files/knn3.jpg)

<b>Prediction with 5 Neighbors</b>

If we had chosen 5 neighbors instead of 3 neighbors, things would have turned out differently. Looking at the plot below, we can see that the vote would tally blue: 3, red: 2. So we would classify our new dot as blue.

![](files/knn4.jpg) 

<b>Choosing the right value of k</b>

KNN requires us to specify a value for k. Logically, the next question is "how do we choose k?" It turns out that choosing the right number of neighbors matters. A lot.

But, choosing the right k is as challenging as it is important. Generally it's good to try an odd number for k to start out. This helps avoid situations where your classifier "ties" as a result of having the same number of votes for two different classes. This is particularly true if your dataset has only two classes (i.e. if k=4 and an observation has nearest neighbors ['blue', 'blue', 'red', 'red'], you've got a tie on your hands. The good news is that scikit-learn does a lot to help you find the best value for k.

<b>Choose the right k example</b>

Let's take a look at the Wine Data Set (http://archive.ics.uci.edu/ml/datasets/Wine+Quality) from the UCI Machine Learn Repo. Each record consists of some metadata about a particular wine, including the color of the wine (red/white). We're going to use density, sulphates, and residual_sugar to predict color.

Using scikit-learn, we can vary the parameter n_neighbors by just looping through a range of values, calculating the accuracy against a holdout set, and then plotting the results.

Looking at plot, you can see that classifier peaks in accuracy somewhere around 23 neighbors gradually deteriorate. Despite the fact that the classifier maxes out with K=23, this doesn't necessarily mean that you should select 23 neighbors.

![](files/knn5.jpg)

Take a look at K=13. With 13 neighbors, the classifier performs almost just as good as it does with 23 neighbors. In addition to performing nearly as well, it's also interesting to note that K values 15-21 actually have worse performance than K=13. This indicates that the classifier is likely to be overfitting, or paying too much attention to the noise in the data. Due to both the comparable performance and the indication of possible overfitting, I would select 13 neighbors for this classifier.

<b>Code Walk-Through:</b>

In [3]:
from sklearn.neighbors import KNeighborsClassifier

In [4]:
#open wine data:
wine1 = pd.read_csv('winequality-red.csv')
wine2 = pd.read_csv('winequality-white.csv')
print wine1.shape
print wine2.shape

(1599, 1)
(4898, 1)


In [5]:
wine1.head()

Unnamed: 0,"fixed acidity;""volatile acidity"";""citric acid"";""residual sugar"";""chlorides"";""free sulfur dioxide"";""total sulfur dioxide"";""density"";""pH"";""sulphates"";""alcohol"";""quality"""
0,7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
1,7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
2,7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;...
3,11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58...
4,7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5


As we can see above, this file needs to be read in a different way due to it's format. We can add in the sep argument and specify it to use ';' as the seperator and then the file can be read in column by column:

In [12]:
wine_red = pd.read_csv('winequality-red.csv', sep=';')
wine_red.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5


In [14]:
#add in a new column to identify these wines as red:
wine_red['type'] = 'red'
wine_red.head(2)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.4,0.7,0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5,red


Read in the white wine dataset:

In [15]:
wine_white = pd.read_csv('winequality-white.csv', sep=';')
wine_white.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45,170,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6


In [16]:
#add in new column to identify these wines as white:
wine_white['type'] = 'white'
wine_white.head(2)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.0,0.27,0.36,20.7,0.045,45,170,1.001,3.0,0.45,8.8,6,white
1,6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6,white


<b>Append the datasets together:</b>

In [17]:
wine_frame = wine_white.append(wine_red, ignore_index=True) #index has no meaning so ignore is appropriate
print wine_frame.shape
print wine_red.shape
print wine_white.shape

(6497, 13)
(1599, 13)
(4898, 13)


<b>Create Training and Testing Datasets:</b>

In [43]:
#Shuffle the dataframe 5 times to mix up red and white wines:
shuffle = 1
for i in range(0,5):
    wine_frame = wine_frame.reindex(np.random.permutation(wine_frame.index))
    print'Shuffle', shuffle
    print wine_frame['type'][:7]
    print ''
    shuffle += 1

print ''
wine_frame = wine_frame.reset_index(drop=True) #reset index, dont want to save the index
print 'Final Wine Frame'
wine_frame[2000:2010]

Shuffle 1
4897    white
5980      red
4735      red
4980    white
6195      red
955     white
1501    white
Name: type, dtype: object

Shuffle 2
1865      red
3351    white
2533      red
4591    white
5603    white
2377      red
5056    white
Name: type, dtype: object

Shuffle 3
5375    white
3183      red
11      white
5061      red
4216      red
2209    white
5539    white
Name: type, dtype: object

Shuffle 4
4965    white
1294    white
5602    white
334     white
4026    white
2992      red
887     white
Name: type, dtype: object

Shuffle 5
1503      red
2357    white
5738    white
590       red
2905    white
4817    white
233     white
Name: type, dtype: object


Final Wine Frame


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
2000,8.0,0.22,0.42,14.6,0.044,45,163,1.0003,3.21,0.69,8.6,7,white
2001,7.4,0.18,0.36,13.1,0.056,72,163,1.0,3.42,0.35,9.1,6,white
2002,9.3,0.41,0.39,2.2,0.064,12,31,0.9984,3.26,0.65,10.2,5,red
2003,6.6,0.26,0.27,11.8,0.048,28,112,0.99606,2.87,0.49,9.7,6,white
2004,6.9,0.29,0.23,8.6,0.056,56,215,0.9967,3.17,0.44,8.8,5,white
2005,6.5,0.41,0.22,4.8,0.052,49,142,0.9946,3.14,0.62,9.2,5,white
2006,7.2,0.19,0.31,6.3,0.034,17,103,0.99305,3.15,0.52,11.4,7,white
2007,6.9,0.4,0.43,6.2,0.065,42,178,0.99552,3.11,0.53,9.4,5,white
2008,5.8,0.26,0.18,1.2,0.031,40,114,0.9908,3.42,0.4,11.0,7,white
2009,8.9,0.43,0.45,1.9,0.052,6,16,0.9948,3.35,0.7,12.5,6,red


In [50]:
#Create testing and training data; Train with 80%, Test with 20%:
train80 = (int(round(len(wine_frame)*0.8)))
wine_train = wine_frame[:train80]
print 'Training Shape:', wine_train.shape
print ''
wine_test = wine_frame[train80:]
print 'Testing Shape:', wine_test.shape

Training Shape: (5198, 13)

Testing Shape: (1299, 13)
