In [1]:
import random
import math
import numpy as np
import pandas as pd
from generated_data import *
from statistics import mode
from sklearn.neighbors import KNeighborsClassifier

## K Nearest Neighbors: A Tutorial

- Introduction and motivation
- General overview of the algorithm
- Implementing KNN from scratch
- Using `SciKitLearn`'s built in function
- Selecting the optimal K
- Final notes and nuances
- Further reading

### Introduction and motivation

Imagine a dataset consisting of $n$ points, each with $M$ numerical features arranged in a vector, $\vec{x}$, and an associated value $y$. If we want to predict $y$ using the feature vector, we can use the K Nearest Neighbors algorithm, whether $y$ is a real number or a discrete label.

Consider predicting the species of an iris based on its physical characteristics. Each data point represents a flower with $M = 2$ features: its sepal length and petal length. Each flower has an associated $y$ value, which is its species: one of _setosa_, _versicolor_, or _virginica_. Here, is the sepal length of a flower plotted against its petal width, and colored to reflect which species it belongs to.

<img src="pics/initial_iris.png">

If we gave this image to a person, along with a new, unclassified data point, they might notice the clusters that have formed and make an educated guess based on the visual: "Any point that falls in or near Circle 1 will be _setosa_, a point in or near Circle 2 will be _versicolor_, and a point in or near Circle 3 will be _virginica_." 

<img src="pics/initial_iris_human.png">

The K Nearest Neighbors algorithm works in a similar way: it stores a dataset, and when given a new $\vec{x}$ will predict its associated $y$ value based on its "proximity" to other points.

### General overview of the algorithm

KNN starts by storing a data set composed of feature vectors and their associated $y$ values. (This is called the training data.) Then, when given a new $\vec{x}$ it searches through the stored data and identifies points that are similar to this new $\vec{x}$. It grabs exactly $K$ of these points, and uses them to predict a $y$ value.

So how do we determine when points are similar to each other?

KNN most commonly uses Euclidian distance. The points with the lowest Euclidian distance to the point are the points that are the most similar. In two dimensions, distance is extremely easy to visualize. In higher dimensions, for $M > 2$ we can compute the distance mathematically:

$$D = \sqrt{\sum_{m=1}^{M} (u_m - v_m)^2}$$

Once the $K$ points of interest are acquired, there are two ways to predict the associated $y$ value:

- If this is a *classification* problem, where there is a fixed set of discrete values that $y$ can take on, then we predict the most common $y$ value of the $K$ points
- If this is a *regression* problem, where $y$ is a real number, then we predict the mean of the $y$ values of the $K$ points

Notice that for the absolute most efficient implementation of KNN, the runtime of the first stage of the algorithim-- storing the data-- is $O(1)$ since there is no computaiton. However, to predict a point, the runtime is $O(n)$ since you have to check the prediction against every one of the $n$ points in the training data.

With this in mind, we can code a from scratch implementation of the algorithm.

### Implementing KNN from scratch

We'll code two version of KNN: a very read-able version that uses loops, and a more concise version that makes use of the vectorized operations in `numpy` and `pandas`. To test our function, we'll use the following generated data set with 30 points:

| n | x1 | x2 | Class |
|---|----|----|-------|
|1 | 2.0187462 |0.14625955 | A|
|2 | 1.8157475 |1.92205393  |A|
| ... | ... | ... | ... |
|29 |3.8982390 |0.35410575 |B|
|30 |3.7462195 |1.29098749 |B|

<img src = "pics/generated.png">

We'll start with a function that uses basic Python data structures, like lists and dictionaries, that predicts a single point. Then we'll call this function as we loop through the whole data set. Notice that each single prediction requires calculating distance $n$ times, where $n$ is the number of points that we have.

In [26]:
def predict_point_basic(X, dataset, K, version):
    '''
    args: X: a list representing an x vector: [x_1, ..., x_m]
          dataset: a list of n data points: [([x_11, ..., x_1m], y_1),
                                             ...,
                                             ([x_n1, ..., x_nm], y_n)]
          K: integer
          version: one of "regression" or "classification"
          
    returns: a predicted y value for the given point
    '''
    
    # Initialize a dict to store the K points closed to the input point
    # Maps distance to y value
    best_points = {}
    
    # Write a helper to calculate Euclidian distance
    def distance(x):
        summation = 0
        for (X_i, x_i) in zip(X, x):
            summation += (X_i - x_i) ** 2
        dist = math.sqrt(summation)
        return dist
    
    # Loop through the points...
    for (x, y) in dataset:
        d = distance(x)
        
        if (len(best_points) < K):
            # We don't even have K points yet, so add this one
            best_points[d] = y
            
        max_dist = max(best_points.keys())
        if (d < max_dist):
            # We have K points, but this one is nearer to X
            best_points.pop(max_dist)
            best_points[d] = y
            
    # Use the K points to make a prediction
    y_values = list(best_points.values())
    if (version == "regression"):
        prediction = mean(y_values)
    else:
        prediction = mode(y_values)
        
    return prediction       

Now let's call the function on a data set! The table I introduced above would be represented as a list of tuples like this:


<center>`[([x_11, class_12], y_1), ..., ([x_n1, x_n2], class_n)]`</center>

In [27]:
from generated_data import *
K = 5
data = build_data_basic()

# Make a prediction for each point
predictions = [predict_point_basic(x, data, K, "classification") for (x, y) in data]
true_classes = [y for (x, y) in data]
    
# See how our predictions compare to the true values
for i in range(len(predictions)):
    # Print the indices of the misclassifed values
    if (predictions[i] != true_classes[i]):
        print(i, end=" ")

# prints:
# 10 11 13 21 23

10 11 13 21 23 

For $K=5$, the model incorrectly predicted 5 of the points in the dataset that we used to train the algorithm. These mistakes are called the training errors. We divide the number of mistakes by the total number of data points to get that $K=5$ has training error of 16.7%. This measurment will come in handy later on.

For now, let's look at another way of predicting points, which assumes the data is stored in a `pandas` dataframe with $M+1$ columns.

In [18]:
def predict_point_vectorized(X, dataset, K, version):
    '''
    args: X: np.array representing an x vector: [x_1, ..., x_m]
          dataset: n x (M + 1) pandas dataframe representing a data set with M features and n points
          K: integer
          version: one of "regression" or "classification"
          
    returns: a predicted y value for the given point
    '''
    
    # Write a helper to calculate Euclidian distance
    def distance(x):
        temp = (x - X) ** 2
        dist = math.sqrt(sum(temp))
        return dist
    
    # Add a column to the dataset that contains the distance of each point to X
    xs = dataset.iloc[:, 0:2]
    dataset["distance"] = xs.apply(distance, axis=1)
    
    # Sort the array by the distance column
    dataset = dataset.sort_values("distance")
    
    # Grab the K y values with the lowest distance
    y_values = dataset.iloc[0:K, -2]
    
    # Use the K points to make a prediction
    if (version == "regression"):
        prediction = mean(y_values)
    else:
        prediction = mode(y_values)
        
    return prediction  
    

The function works for any arbitrary $M$ so long as the last column of the dataframe stores the $y$ values.

In [19]:
K = 21
data = build_data_vectorized()

# Make a prediction for each point
predictions = []
true_classes = []
for i in range(len(data)):
    x = data.iloc[i, 0:2]
    y = data.iloc[i, 2]
    p = predict_point_vectorized(x, data, K, "classification")
    predictions.append(p)
    true_classes.append(y)
    
# See how our predictions compare to the true values
for i in range(len(predictions)):
    # Print the indices of the misclassifed values
    if (predictions[i] != true_classes[i]):
        print(i, end=" ")
        
# prints:
# 4 5 10 11 13 21 23

4 5 10 11 13 14 21 23 

Now we've used $K=21$, and the algorithm we came up with incorrectly predicted 7 of the points, which works out to a 23.3% training error, worse than the first model. Later we'll see if training error is a good way to determine if a model is good.

### Using `SciKitLearn`'s built in function

The nice thing about Python is that it has a built in library or package for everything. And indeed, if you don't feel like writing a from-scratch KNN classifier, Python has a package that you can use. The `SciKitLearn` package has a KNN class that can predict points in a few lines. Here it is in action.

In [20]:
# Explicitly define the X and y from earlier
X = data.iloc[:, 0:2]
y = data.iloc[:, 2]

# Specify the number of neighbors
neighbors = KNeighborsClassifier(n_neighbors=3)

# Pass in your training data
neighbors.fit(X, y)

# Predict values for new points
# (in this case, I just pass in the original X values, this time unlabled.)
predictions = neighbors.predict(X)

# See how our predictions compare to the true values
for i in range(len(predictions)):
    # Print the indices of the misclassifed values
    if (predictions[i] != y[i]):
        print(i, end=" ")
        
# prints:
# 21 23
# Training error = 6.7%

21 23 

### Selecting the optimal K

Something interesting was happening in the three examples above. Though we used the exact same data set each time, adjusting the value of $K$ led to a difference in the predictions, and therefore, a difference in the number of errors made. The KNN prediction relies very heavily on whatever value is chosen for K, so we want to be sure that we are selecting the optimal value.

First, let's return to the irises example to see just how different values of $K$ affect the predictions. A good way to visualize how the algorithm is predicting is to treat each pixel of the plot like a new data point, and shade it to match the class that the algorithm would assign to it. The two plots below have identical training data, but different values for $K$. As you can see, the predictions are quite different. Note how with a larger value of K, the decision boundaries are smoother.

<img src="pics/iris_1.png">
<img src="pics/iris_99.png">

Now let's look at a regression example. Here, visualization works best when $M = 1$, that is, there is just a single predicting feature, which I'll show on the $x$ axis. To illustrate the prediction, you can plot points of all the predicted $y$ values for the range of $x$ values on the graph. The plot below shows three different values of $K$, where we are predicting the value of a house based on the income of its owners. For a small value of $K$, the points lie in the center of the data, but are scattered and chaotic. With a larger $K$, the predictions meld into a smooth line.

<img src="pics/houses.png">

Now that we've seen how K can have a big outcome on the predictions that are made, how do we choose the best value of K? We follow this rough outline:

- Split the data into two parts
- Fit many KNN models using one part of the data (the training data)
- Use these models to make predictions on the second part of the data (the validation data)
- See how well the predictions performed
- Select the value of K that provided the best predictions

The most important thing to notice here is the way we split our data into two parts. This is to reduce overfitting. One of the worst things that could happen while modelling is to create a predictor that captures every detail of the data you used to build it, to the point where it does not accurately describe the true underlying pattern.

To minimize this, we build our model with one data set (the training data) and measure its performance with a totally separate data set (the validation data). Here is the code for doing that using the built in `sci kit learn` class:

In [25]:
# Import the data
data = build_iris()
features = data.iloc[:, 0:2]
labels = data.iloc[:, 2]

# Generate a random sample to split the data
# We want half the data to be training, and half to be test
n = len(data)
sample = list(range(0, n))
random.shuffle(sample)
sample_train = sample[:75]
sample_test = sample[75:]

# Separate data into training and validation
X = features.iloc[sample_train, :]
y = labels[sample_train]
X_valid = features.iloc[sample_test, :]
y_valid = labels[sample_test]

# Initialize lists to store results
train_errors = []
valid_errors = []

# Loop through values of K
K = 75
for k in range (1, K + 1):

    # Specify the number of neighbors
    neighbors = KNeighborsClassifier(n_neighbors=k)

    # Pass in your training data
    neighbors.fit(X, y)

    # Predict values for the data
    pred_train = neighbors.predict(X)
    pred_valid = neighbors.predict(X_valid)
    
    # compute and store error
    mistakes_train = 0
    mistakes_valid = 0
    for i in range(len(pred_train)):
        if (y.iloc[i] != pred_train[i]):
            mistakes_train += 1
        if (y_valid.iloc[i] != pred_valid[i]):
            mistakes_valid += 1
                    
    train_errors.append(mistakes_train / len(y))
    valid_errors.append(mistakes_valid / len(y_valid))

Below are the training and validation errors of various models fit on the iris data set, with training error in black and validation error in red. There are some important trends to notice:

- _The validation error is consistently higher than the training error_

This is because a model fit to the training data has more information about the training data, and will do a better job predicting those values.

- _At a certain value for K, the error shoots up_

This happens when the value for K gets so big that the points the algorithm says are "similar" really aren't that similar to the new point.

Here, the best value for $K$ is 5, 9, or 11, where the validation error is 5.3%. Any one of those should be used to fit the final model.

<img src="pics/error.png">

## Final notes and nuances

The KNN algorithm is not intuitively difficult. However, there are a few nuances to it that come up when it is used with larger and more complicated data sets. 

- **Ties in distance**

Let's say that we have $K=3$ and the algorithm is calculating distances. The closest point is off by 0.3, the next closest is off by 1.2, but then there are _two_ points 1.6. Which of these points should be used? The most common way to deal with this scenario is to simply randomly select one of the tied points. The idea is that since our new point is equally close to each of these tied points, they are equally likely to help predict the assigned value.

- **Ties in classification**

Another way that points can tie has to do with the stage of the algorithm where the collection of $K$ points is voting on the final predicted label. Notice that this only matters in the classification setting-- for regression, we just take the average. For binary classification, this problem can be totally eliminated by always using an odd value for $K$. However, for more complex classification, that doesn't always help. If there are three classes, for instance, and $K=5$ we can end up with a {A, A, B, B, C} tie. The tie is broken the same way as above, a randomized assignment from the tied labels, and for the same reason: each class is equally likely.

- **Weights**

One of the inherent biases to KNN is that it assumes each feature in $\vec{x}$ is equally important in computing $y$. This might not always be true! If you know that some features should have greater importance in predicting the final value of $y$ than you can weight them more highly. There is a `weights` parameter in the `sklearn` class that we used above which takes in a callable function. The function should take in an array of true distances, and return an array of the same length which manipulates the distances in some way. This can also be added to our from-scratch implementation: the `predict`function would simply be modified to take in a similar function.

- **Regression error**

The last thing to discuss is how to compute error in the regression setting. For classification, what we used is called the _misclassifcation rate_ which essentially counts the number of mistakes that are made. In regression, you are likely to make a mistake on every prediction! So instead of counting the number of mistakes, we count the magnitude of mistakes using the following formula:

$$Error = \frac{1}{N}\sum_{i=1}^{N}{(\hat{y} - y)^2}$$

## Further reading

Lastly, here is some further reading:
- [More specific information about the `sklearn` implementation](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
- [A more mathematical description of weighted KNN](http://www.data-machine.com/nmtutorial/distanceweightedknnalgorithm.htm)
- [`matplotlib` instructions on visualizing KNN with decision boundaries](http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html)