# K Nearest Neighbors with Sklearn

This notebook shows how to train, use and measure a neighbors-based classification model.

* Method: [Nearest Neighbors](http://scikit-learn.org/stable/modules/neighbors.html)
* Dataset: Iris

## Imports

In [None]:
import numpy as np

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report

from mlxtend.evaluate import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix

import seaborn as sb
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from pylab import rcParams

%matplotlib inline
rcParams['figure.figsize'] = 10, 8
sb.set_style('whitegrid')

In [None]:
# More matplotlib stuffs
# Step size in the mesh
h = .02  

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

## Load and Prepare the Data

In [None]:
# Load the data
data = load_iris()

In [None]:
# Find out about the data
print(data.DESCR)

In [None]:
# Split the data into targets and features using only the first two features from the data
X = data.data[:, :2]
y = data.target

In [None]:
# Create test and training sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    random_state=42)

## Fit a Neighbors Model

Arguments
* neighbors: number of neighbors to use
* weights: weight function used in prediction
  * uniform : uniform weights. All points in each neighborhood are weighted equally.
  * distance : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
  * callable : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.

In [None]:
# Set the number of neighbors
neighbors = 15

In [None]:
for weight in ['uniform', 'distance']:
    # we create an instance of Neighbours Classifier and fit the data.
    clf = KNeighborsClassifier(neighbors, weights=weight)
    clf.fit(X, y)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
                edgecolor='k', s=20)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("3-Class classification (k = %i, weights = '%s')"
              % (neighbors, weight))

plt.show()

In [None]:
model = KNeighborsClassifier(neighbors, weights='distance')
model.fit(X_train, y_train)

## Create Predictions

In [None]:
# Create predictions
predictions = model.predict(X_test)
print(predictions)

In [None]:
# Predict the probability of each class
pred_probs = model.predict_proba(X_test)
print(pred_probs[0])

In [None]:
# Create a plot to compare actual class (Y_test) and the predicted class (predictions)
fig = plt.figure(figsize=(20,10))
plt.scatter(y_test, predictions)
plt.xlabel("Actual Class: $Y_i$")
plt.ylabel("Predicted Class: $\hat{Y}_i$")
plt.title("Actual vs. Predicted Class: $Y_i$ vs. $\hat{Y}_i$")
plt.show()

## Model Evaluation

### Accuracy

The accuracy score is either the fraction (default) or the count (normalize=False) of correct predictions.

In [None]:
print("Accuracy Score: %.2f" % accuracy_score(y_test, predictions))

### K-Fold Cross Validation

This estimates the accuracy of an SVM model by splitting the data, fitting a model and computing the score 5 consecutive times. The result is a list of the scores from each consecutive run.

In [None]:
# Get scores for 5 folds over the data using uniform
clf = KNeighborsClassifier(neighbors, weights='uniform')
scores = cross_val_score(clf, X_train, y_train, cv=5)

# Print the scores and mean score
print("Scores: {}".format(scores))
print("Mean Score: %0.2f" % np.mean(scores))

In [None]:
# Get scores for 5 folds over the data using distance
clf = KNeighborsClassifier(neighbors, weights='distance')
scores = cross_val_score(clf, X_train, y_train, cv=5)

# Print the scores and mean score
print("Scores: {}".format(scores))
print("Mean Score: %0.2f" % np.mean(scores))

### Confusion Matrix

In [None]:
# Plot the multi-label confusion matrix
cm = confusion_matrix(y_target=y_test, 
                      y_predicted=predictions, 
                      binary=True)

fig, ax = plot_confusion_matrix(conf_mat=cm)
plt.title("Confusion Matrix")
plt.show()

**Interpretation**: 28 and 8 are the number of correct predictions. 7 and 7 are the number of incorrect predictions.