# Classification and the K Nearest Neighbors algorithm

**Classification** is one of the two main branches of supervised learning, the other being **regression** which we covered last week.

Classification is predicting **target classes**, which are categorical variables, from a set of predictor variables. 
Models for classification are able to assign new data to a class using the derived predicted probability of that class.

## kNN

The pseudocode algorithm for kNN is as follows:

```
for unclassified_point in sample:
    for known_point in known_class_points:
        calculate distances (euclidean or other) between known_point and unclassified_point
    for k in range of specified_neighbors_number:
        find k_nearest_points in known_class_points to unclassified_point
    assign class to unclassified_point using "votes" from k_nearest_points
```

---

[NOTE: in the case of ties, sklearn's `KNeighborsClassifier()` will just choose the first class using uniform weights! If this is unappealing to you you can change the weights keyword argument to 'distance'.]

## 1. Load affairs dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
affair = pd.read_csv('../assets/datasets/Fair.csv')

In [3]:
affair.head()

Unnamed: 0.1,Unnamed: 0,sex,age,ym,child,religious,education,occupation,rate,nbaffairs
0,1,male,37.0,10.0,no,3,18,7,4,0
1,2,female,27.0,4.0,no,4,14,6,4,0
2,3,female,32.0,15.0,yes,1,12,1,4,0
3,4,male,57.0,15.0,yes,5,18,6,5,0
4,5,male,22.0,0.75,no,2,17,6,3,0


## 2. Encode nbaffairs as binary

We just want to see if people have had any affair or not.

In [14]:
affair['affair_binary'] = affair.nbaffairs.map(lambda x: 1 
                                               if x > 0 
                                               else 0)

In [8]:
affair.head()

Unnamed: 0.1,Unnamed: 0,sex,age,ym,child,religious,education,occupation,rate,nbaffairs,affair_binary
0,1,male,37.0,10.0,no,3,18,7,4,0,0
1,2,female,27.0,4.0,no,4,14,6,4,0,0
2,3,female,32.0,15.0,yes,1,12,1,4,0,0
3,4,male,57.0,15.0,yes,5,18,6,5,0,0
4,5,male,22.0,0.75,no,2,17,6,3,0,0


## 3. Load sklearn KNeighborsClassifier and initialize with k=3

In [11]:
from sklearn.neighbors import KNeighborsClassifier

In [12]:
model = KNeighborsClassifier(n_neighbors = 3)

## 4. Setup X and Y matrices (predict had_affair) with patsy

In [13]:
import patsy

In [21]:
formula = "affair_binary ~ C(sex) + age + ym + religious + C(occupation) -1"

# Don't want an intercept, only want to know how far away from the other points
# The -1 at the end of the formula removes the intercept
# C(n) - says 'n' is a categorical variable 

Y, X = patsy.dmatrices(formula, data=affair)

In [22]:
print Y[0:10]
print X[0:10]

[[ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]]
[[  0.     1.     0.     0.     0.     0.     0.     1.    37.    10.     3.  ]
 [  1.     0.     0.     0.     0.     0.     1.     0.    27.     4.     4.  ]
 [  1.     0.     0.     0.     0.     0.     0.     0.    32.    15.     1.  ]
 [  0.     1.     0.     0.     0.     0.     1.     0.    57.    15.     5.  ]
 [  0.     1.     0.     0.     0.     0.     1.     0.    22.     0.75
    2.  ]
 [  1.     0.     0.     0.     0.     1.     0.     0.    32.     1.5    2.  ]
 [  1.     0.     0.     0.     0.     0.     0.     0.    22.     0.75
    2.  ]
 [  0.     1.     0.     0.     1.     0.     0.     0.    57.    15.     2.  ]
 [  1.     0.     0.     0.     0.     0.     0.     0.    32.    15.     4.  ]
 [  0.     1.     0.     0.     1.     0.     0.     0.    22.     1.5    4.  ]]


In [26]:
Y = np.ravel(Y)
print Y[0:10]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


## 5. Fit kNN classifier

In [37]:
model.fit(X,Y) # Error message: model expects Y to be 1D, patsy has Y array as 2D can use np.ravel
model.score(X,Y)

0.81031613976705485

## 6. Validate the knn classifier

In [38]:
from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.33)


In [39]:
model.fit(X_train, Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [40]:
model.score(X_test, Y_test)

0.69346733668341709

## 7. Look at predictions and predicted probability

In [41]:
predictions = model.predict(X_test)

In [42]:
predictions[0:30]

array([ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
        0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.])

In [43]:
pp=model.predict_proba(X_test)

In [44]:
print pp[0:15]

[[ 0.33333333  0.66666667]
 [ 1.          0.        ]
 [ 0.66666667  0.33333333]
 [ 1.          0.        ]
 [ 0.66666667  0.33333333]
 [ 1.          0.        ]
 [ 1.          0.        ]
 [ 0.66666667  0.33333333]
 [ 0.66666667  0.33333333]
 [ 0.66666667  0.33333333]
 [ 0.66666667  0.33333333]
 [ 0.66666667  0.33333333]
 [ 0.          1.        ]
 [ 1.          0.        ]
 [ 1.          0.        ]]


## 8. Use weights='distance' and examine effect on score and predicted probability

In [45]:
model = KNeighborsClassifier(n_neighbors = 3, weights='distance')
model.fit(X_train, Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='distance')

In [46]:
model.predict_proba(X_test)[0:15]

array([[ 0.32214793,  0.67785207],
       [ 1.        ,  0.        ],
       [ 0.66666667,  0.33333333],
       [ 1.        ,  0.        ],
       [ 0.5       ,  0.5       ],
       [ 1.        ,  0.        ],
       [ 1.        ,  0.        ],
       [ 0.66666667,  0.33333333],
       [ 0.5       ,  0.5       ],
       [ 0.71010205,  0.28989795],
       [ 1.        ,  0.        ],
       [ 0.66666667,  0.33333333],
       [ 0.        ,  1.        ],
       [ 1.        ,  0.        ],
       [ 1.        ,  0.        ]])

## 9. Keeping weights 'distance', change k to 7 and look at score, predicted probability

In [47]:
model = KNeighborsClassifier(n_neighbors = 7, weights='distance')
model.fit(X_train, Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=7, p=2,
           weights='distance')

In [48]:
model.predict_proba(X_test)[0:15]

array([[ 0.46423312,  0.53576688],
       [ 1.        ,  0.        ],
       [ 0.74547792,  0.25452208],
       [ 1.        ,  0.        ],
       [ 0.5       ,  0.5       ],
       [ 1.        ,  0.        ],
       [ 1.        ,  0.        ],
       [ 0.60283967,  0.39716033],
       [ 0.5       ,  0.5       ],
       [ 0.61234122,  0.38765878],
       [ 1.        ,  0.        ],
       [ 0.73385564,  0.26614436],
       [ 0.25      ,  0.75      ],
       [ 0.85760003,  0.14239997],
       [ 1.        ,  0.        ]])