# Classification and the K Nearest Neighbors algorithm

**Classification** is one of the two main branches of supervised learning, the other being **regression** which we covered last week.

Classification is predicting **target classes**, which are categorical variables, from a set of predictor variables. 
Models for classification are able to assign new data to a class using the derived predicted probability of that class.

## kNN

The pseudocode algorithm for kNN is as follows:

```
for unclassified_point in sample:
    for known_point in known_class_points:
        calculate distances (euclidean or other) between known_point and unclassified_point
    for k in range of specified_neighbors_number:
        find k_nearest_points in known_class_points to unclassified_point
    assign class to unclassified_point using "votes" from k_nearest_points
```

---

[NOTE: in the case of ties, sklearn's `KNeighborsClassifier()` will just choose the first class using uniform weights! If this is unappealing to you you can change the weights keyword argument to 'distance'.]

## 1. Load affairs dataset

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


%matplotlib inline

In [3]:
affair = pd.read_csv('../../assets/datasets/Fair.csv')
affair.head()

Unnamed: 0.1,Unnamed: 0,sex,age,ym,child,religious,education,occupation,rate,nbaffairs
0,1,male,37.0,10.0,no,3,18,7,4,0
1,2,female,27.0,4.0,no,4,14,6,4,0
2,3,female,32.0,15.0,yes,1,12,1,4,0
3,4,male,57.0,15.0,yes,5,18,6,5,0
4,5,male,22.0,0.75,no,2,17,6,3,0


In [4]:
affair.sex.unique()

array(['male', 'female'], dtype=object)

## 2. Encode nbaffairs as binary

We just want to see if people have had any affair or not.

In [5]:
def binary_affair(x):
    if x == 0:
        return 0
    else:
        return 1
    
affair['had_affair'] = affair.nbaffairs.map(binary_affair)

In [6]:
affair.head()

Unnamed: 0.1,Unnamed: 0,sex,age,ym,child,religious,education,occupation,rate,nbaffairs,had_affair
0,1,male,37.0,10.0,no,3,18,7,4,0,0
1,2,female,27.0,4.0,no,4,14,6,4,0,0
2,3,female,32.0,15.0,yes,1,12,1,4,0,0
3,4,male,57.0,15.0,yes,5,18,6,5,0,0
4,5,male,22.0,0.75,no,2,17,6,3,0,0


## 3. Load sklearn KNeighborsClassifier and initialize with k=3

In [7]:
from sklearn.neighbors import KNeighborsClassifier


## 4. Setup X and Y matrices (predict had_affair) with patsy

In [8]:
affair.head()

Unnamed: 0.1,Unnamed: 0,sex,age,ym,child,religious,education,occupation,rate,nbaffairs,had_affair
0,1,male,37.0,10.0,no,3,18,7,4,0,0
1,2,female,27.0,4.0,no,4,14,6,4,0,0
2,3,female,32.0,15.0,yes,1,12,1,4,0,0
3,4,male,57.0,15.0,yes,5,18,6,5,0,0
4,5,male,22.0,0.75,no,2,17,6,3,0,0


In [9]:
import patsy

#TODO go over patsy formatting
formula = 'had_affair ~ C(sex) + age + ym + C(child) + religious + education + C(occupation) + rate -1'
ymat, xmat = patsy.dmatrices(formula, data=affair)

In [10]:
#type(ymat)
#xmat[0:10]

In [72]:
dfData = affair[list(affair.columns[1:9])]
dfTarget = affair['nbaffairs']

## 5. Fit kNN classifier

In [73]:
print np.array(ymat).shape
# print ymat.shape

print np.ravel(ymat).shape
#print ymat.shape

xmat.shape

(601, 1)
(601,)


(601, 14)

In [74]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(xmat, np.array(ymat))
#knn.fit(xmat, ymat)


  from IPython.kernel.zmq import kernelapp as app


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

## 6. Validate the knn classifier

In [88]:
from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(xmat, np.ravel(ymat), test_size=0.33)
print X_train.shape
print X_test.shape
print Y_train.shape
print Y_test.shape

(402, 14)
(199, 14)
(402,)
(199,)


In [89]:
knn.fit(X_train, Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [90]:
knn.score(X_test, np.ravel(Y_test))

0.68341708542713564

## 7. Look at predictions and predicted probability

In [91]:
predictions = knn.predict(X_test)
pred_probability = knn.predict_proba(X_test)

In [92]:
print predictions[0:15]
print pred_probability[0:15]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.]
[[ 0.66666667  0.33333333]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 0.66666667  0.33333333]
 [ 0.66666667  0.33333333]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 0.66666667  0.33333333]
 [ 1.          0.        ]
 [ 0.66666667  0.33333333]
 [ 0.33333333  0.66666667]
 [ 1.          0.        ]
 [ 0.66666667  0.33333333]]


## 8. Use weights='distance' and examine effect on score and predicted probability

In [105]:
knn_weights = KNeighborsClassifier(n_neighbors=7, weights='distance')

knn_weights.fit(X_train, Y_train)

print knn_weights.score(X_test, Y_test)
print knn_weights.predict_proba(X_test)[0:15]

0.678391959799
[[ 0.85396748  0.14603252]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 0.71817553  0.28182447]
 [ 0.70194883  0.29805117]
 [ 1.          0.        ]
 [ 0.73029957  0.26970043]
 [ 1.          0.        ]
 [ 0.87645598  0.12354402]
 [ 0.71046128  0.28953872]
 [ 0.74056225  0.25943775]
 [ 0.56074531  0.43925469]
 [ 0.          1.        ]
 [ 1.          0.        ]
 [ 0.59174619  0.40825381]]


## 9. Keeping weights 'distance', change k to 7 and look at score, predicted probability

In [27]:
knn = KNeighborsClassifier(n_neighbors=7, weights='distance')

knn.fit(X_train, Y_train)

print knn.score(X_test, Y_test)
print knn.predict_proba(X_test)[0:15]

0.738693467337
[[ 0.66562394  0.33437606]
 [ 0.61149254  0.38850746]
 [ 1.          0.        ]
 [ 0.70772343  0.29227657]
 [ 0.84987755  0.15012245]
 [ 0.44074267  0.55925733]
 [ 1.          0.        ]
 [ 0.87226042  0.12773958]
 [ 0.72697884  0.27302116]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 0.85611528  0.14388472]
 [ 0.74017717  0.25982283]
 [ 0.73045807  0.26954193]
 [ 0.78033097  0.21966903]]


  app.launch_new_instance()
