# Classification and the K Nearest Neighbors algorithm

**Classification** is one of the two main branches of supervised learning, the other being **regression** which we covered last week.

Classification is predicting **target classes**, which are categorical variables, from a set of predictor variables. 
Models for classification are able to assign new data to a class using the derived predicted probability of that class.

## kNN

The pseudocode algorithm for kNN is as follows:

```
for unclassified_point in sample:
    for known_point in known_class_points:
        calculate distances (euclidean or other) between known_point and unclassified_point
    for k in range of specified_neighbors_number:
        find k_nearest_points in known_class_points to unclassified_point
    assign class to unclassified_point using "votes" from k_nearest_points
```

---

[NOTE: in the case of ties, sklearn's `KNeighborsClassifier()` will just choose the first class using uniform weights! If this is unappealing to you you can change the weights keyword argument to 'distance'.]

## 1. Load affairs dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


%matplotlib inline

In [2]:
affair = pd.read_csv('../assets/datasets/Fair.csv')
affair.head()

Unnamed: 0.1,Unnamed: 0,sex,age,ym,child,religious,education,occupation,rate,nbaffairs
0,1,male,37.0,10.0,no,3,18,7,4,0
1,2,female,27.0,4.0,no,4,14,6,4,0
2,3,female,32.0,15.0,yes,1,12,1,4,0
3,4,male,57.0,15.0,yes,5,18,6,5,0
4,5,male,22.0,0.75,no,2,17,6,3,0


In [3]:
affair.sex.unique()

array(['male', 'female'], dtype=object)

## 2. Encode nbaffairs as binary

We just want to see if people have had any affair or not.

In [5]:
def binary_affair(x):
    if x == 0:
        return 0
    else:
        return 1
    
affair['had_affair'] = affair.nbaffairs.map(binary_affair)

In [6]:
affair.head()

Unnamed: 0.1,Unnamed: 0,sex,age,ym,child,religious,education,occupation,rate,nbaffairs,had_affair
0,1,male,37.0,10.0,no,3,18,7,4,0,0
1,2,female,27.0,4.0,no,4,14,6,4,0,0
2,3,female,32.0,15.0,yes,1,12,1,4,0,0
3,4,male,57.0,15.0,yes,5,18,6,5,0,0
4,5,male,22.0,0.75,no,2,17,6,3,0,0


## 3. Load sklearn KNeighborsClassifier and initialize with k=3

In [9]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)

## 4. Setup X and Y matrices (predict had_affair) with patsy

In [13]:
import patsy

formula = 'had_affair ~ C(sex) + age + ym + C(child) + religious + education + C(occupation) + rate -1'

ymat, xmat = patsy.dmatrices(formula, data=affair)

In [14]:
xmat[0:10]

array([[  0.  ,   1.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,
          1.  ,  37.  ,  10.  ,   3.  ,  18.  ,   4.  ],
       [  1.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   1.  ,
          0.  ,  27.  ,   4.  ,   4.  ,  14.  ,   4.  ],
       [  1.  ,   0.  ,   1.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,
          0.  ,  32.  ,  15.  ,   1.  ,  12.  ,   4.  ],
       [  0.  ,   1.  ,   1.  ,   0.  ,   0.  ,   0.  ,   0.  ,   1.  ,
          0.  ,  57.  ,  15.  ,   5.  ,  18.  ,   5.  ],
       [  0.  ,   1.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   1.  ,
          0.  ,  22.  ,   0.75,   2.  ,  17.  ,   3.  ],
       [  1.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   1.  ,   0.  ,
          0.  ,  32.  ,   1.5 ,   2.  ,  17.  ,   5.  ],
       [  1.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,
          0.  ,  22.  ,   0.75,   2.  ,  12.  ,   3.  ],
       [  0.  ,   1.  ,   1.  ,   0.  ,   0.  ,   1.  ,   0.  ,   0.  ,
          0.  ,  57.  ,  

## 5. Fit kNN classifier

In [16]:
knn.fit(xmat, np.ravel(ymat))

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

## 6. Validate the knn classifier

In [25]:
from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(xmat, ymat, test_size=0.33)

In [26]:
knn.fit(X_train, np.ravel(Y_train))

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [27]:
knn.score(X_test, np.ravel(Y_test))

0.70351758793969854

## 7. Look at predictions and predicted probability

In [29]:
predictions = knn.predict(X_test)
pred_probability = knn.predict_proba(X_test)

In [30]:
print predictions[0:15]
print pred_probability[0:15]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[[ 0.66666667  0.33333333]
 [ 0.66666667  0.33333333]
 [ 0.66666667  0.33333333]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 0.66666667  0.33333333]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 0.66666667  0.33333333]
 [ 1.          0.        ]
 [ 0.66666667  0.33333333]]


## 8. Use weights='distance' and examine effect on score and predicted probability

In [31]:
knn = KNeighborsClassifier(n_neighbors=3, weights='distance')

knn.fit(X_train, Y_train)

print knn.score(X_test, Y_test)
print knn.predict_proba(X_test)[0:15]

0.708542713568
[[ 0.6545085   0.3454915 ]
 [ 0.69371294  0.30628706]
 [ 0.68989795  0.31010205]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 0.47503415  0.52496585]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 0.70882886  0.29117114]
 [ 1.          0.        ]
 [ 0.69047053  0.30952947]]


  app.launch_new_instance()


## 9. Keeping weights 'distance', change k to 7 and look at score, predicted probability

In [32]:
knn = KNeighborsClassifier(n_neighbors=7, weights='distance')

knn.fit(X_train, Y_train)

print knn.score(X_test, Y_test)
print knn.predict_proba(X_test)[0:15]

0.743718592965
[[ 0.7072949   0.2927051 ]
 [ 0.59529042  0.40470958]
 [ 0.72650237  0.27349763]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 0.87959233  0.12040767]
 [ 0.89471298  0.10528702]
 [ 1.          0.        ]
 [ 0.88627438  0.11372562]
 [ 0.48534477  0.51465523]
 [ 0.86865701  0.13134299]
 [ 1.          0.        ]
 [ 0.8525542   0.1474458 ]
 [ 0.67895339  0.32104661]
 [ 0.57686907  0.42313093]]


  app.launch_new_instance()
