# KNN

Now that we have learned about optimizing code.
Let's implement a machine learning algorithm called `K-Nearest Neighbor` or knn.


In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.[1] In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

 * In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

 * In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.

First let's go over a simple implementation of KNN in strict Python.

In [None]:
%%bash
mkdir -p ../.data
cd ../.data
if [ ! -f iris.data.txt ]; then
    echo "File not found. Downloading from github"
    wget -q https://raw.githubusercontent.com/WinVector/Logistic/master/iris.data.txt
else
    echo "File exists, not downloading form github"
fi

In [None]:
import numpy as np
import pandas as pd
import random

def read_iris_data():
    data = pd.read_csv('../.data/iris.data.txt').values
    np.random.shuffle(data)
    return data[:,0:4], data[:,4]

X, Y = read_iris_data()
X = X.astype('float64')
x_train = X[:120]
y_train = Y[:120]
x_test = X[120:]
y_test = Y[120:]

## Running our Model
Below is a simple runner class. It will simply train the model and call model predict to get classifier accuracy.

In [None]:
class Runner():
    def __init__(self, model):
        self.model = model
        model.train(x_train, y_train)

    def test(self):
        correct = 0
        incorrect = 0
        for x, y in zip(x_test, y_test):
            if self.model.predict(x) == y:
                correct += 1
            else:
                incorrect += 1
        return correct, incorrect
        

## Vanilla Python
First let's walk through KNN in strict Python

In [None]:
from collections import Counter
from heapq import nsmallest

import numpy as np


class KNN:
    def __init__(self, k=5):
        self.k = k
        self.train_data = None
        self.labels = None


    def train(self, train_data, labels):
        self.train_data = list(train_data)
        self.labels = labels
    
    def _get_distance_squared(self, point):
        '''
        euclidean distance = sqrt(x^2+y^2...)
        not taking the square root will speed up the code and will not change the ordering of what points are closest to our point
        '''
        distances = [0.] * len(self.train_data)
        for i, row in enumerate(self.train_data):
            for r_n, p_n in zip(row, point):
                distances[i] += abs(p_n-r_n)
        return distances
    
    def predict(self, data):
        distances = self._get_distance_squared(data)
        distance_labels = list(zip(distances, self.labels))
        distance_labels.sort(key=lambda x: x[0])
        k_nearest = [x[1] for x in distance_labels[:self.k]]
        most_common = Counter(k_nearest).most_common(1)[0][0]
        return most_common


In [None]:
knn = KNN()
runner = Runner(knn)
correct, incorrect = runner.test()
print(correct, incorrect)
%timeit runner.test()

## NumPy

Now let's briefly go over the same code implemented in NumPy

In [None]:
from collections import Counter
from heapq import nsmallest

import numpy as np


class KNN_np:
    def __init__(self, k=5):
        self.k = k
        self.train_data = None
        self.labels = None


    def train(self, train_data, labels):
        self.train_data = train_data
        self.labels = labels

    
    def _get_distances(self, data):
        distances = data - self.train_data
        absolute_distances = distances**2
        return np.sum(absolute_distances, axis=1)
    
    def predict(self, data):
        distances = self._get_distances(data)
        
        partition = np.argpartition(distances, 5)
        kclosest = self.labels[partition[0 : self.k]]
        return Counter(kclosest).most_common(1)[0][0]


In [None]:
knn = KNN_np()
runner = Runner(knn)
correct, incorrect = runner.test()
print(correct, incorrect)
%timeit runner.test()
%timeit runner.test()

## Numba

See if you can implement KNN using Numba + NumPy
Or you can review my code and try to optimize it. I left a lot of code to be optimized.

> Numba is still a beta so I ran into an exception durring the jit compile.
Enumerating over a 2d ndarray hasn't been implemented yet.
So in code comments I left the fix (a 2 line refactor refactor from enumerate to range).
But it is still a 0.x release so give Numba some credit. They have done a lot for a JIT compiler.


In [None]:
from collections import Counter
from heapq import nsmallest

import numpy as np
from numba import njit


@njit
def sum(x):
    s = 0
    for i in range(x.shape[0]):
        s += x[i]
    return s

@njit
def get_distances(x, train_data):
    distances = np.zeros(train_data.shape[0], dtype=np.float64)
#     for i, row in enumerate(train_data):
#         This throws a not implemented exception. The bugfix is below
    for i in range(train_data.shape[0]):
        row = train_data[i]
        s = 0
        for j in range(row.shape[0]):
            s += (x[j] - row[j])**2
        distances[i] = s
    return distances


class KNN_Numba:
    def __init__(self, k=5):
        self.k = k
        self.train_data = None
        self.labels = None


    def train(self, train_data, labels):
        self.train_data = train_data
        self.labels = labels
        self.max_values = np.amax(np.absolute(self.train_data), axis=0)


    def predict(self, data):
        distances = get_distances(data, self.train_data)
        
        partition = np.argpartition(distances, 5)
        kclosest = self.labels[partition[0 : self.k]]
        return Counter(kclosest).most_common(1)[0][0]


In [None]:
from collections import Counter
from heapq import nsmallest

import numpy as np
from numba import njit


class KNN_Numba:
    def __init__(self, k=5):
        self.k = k
        self.train_data = None
        self.labels = None

    def train(self, train_data, labels):
        pass
    
    def predict(self, data):
        pass


In [None]:
knn = KNN_Numba()
runner = Runner(knn)
correct, incorrect = runner.test()
print(correct, incorrect)
%timeit runner.test()

## Cython

Try to implement this in Cython.
In my implementation I was able to get speeds much faster than Numpy.

Numpy is really fast, but Cython can be faster when you know exactly what you want.

Give it a shot and see if you can beat Numpy speeds on your machine.

In [None]:
%load_ext Cython

In [None]:
%%cython --annotate
import cython
cimport cython

from collections import Counter

cdef struct Score:
    int index
    float distance

ctypedef double[:,:] Matrix
ctypedef double[:] Vector

cdef float get_distance(int vector_length, Vector vector1, Vector vector2):
    cdef float distance, total_distance
    cdef int i
    total_distance = 0.0
    for i in range(vector_length):
        distance = vector1[i] - vector2[i]
        if distance < 0:
            distance *= -1
        total_distance += distance
    return total_distance

cdef class KNNC:

    cdef int _k, _vector_length, _training_instances
    cdef Matrix _train_data
    cdef object labels

    def __init__(self, int k):
        self._k = k
        self._vector_length = 0
        self._training_instances = 0

    def train(self, Matrix train_data, labels):
        # self._train_data = train_data
        self.labels = labels
        self._vector_length = len(train_data[0])
        self._training_instances = len(train_data)
        self._train_data = train_data


    @cython.nonecheck(False)
    def predict(self, Vector data):
        cdef Score[50] closest
        cdef int i, j, tmp_index
        cdef float distance, tmp_distance
        cdef Vector compared

        for i in range(self._k):
            closest[i] = Score(index=0, distance=100000.)
        for i in range(self._training_instances):
            compared = self._train_data[i]
            distance = get_distance(self._vector_length, data, compared)
            for j in range(self._k):
                if distance < closest[j].distance:
                    tmp_distance = closest[j].distance
                    tmp_index = closest[j].index
                    closest[j].distance = distance
                    closest[j].index = i
                    i = tmp_index
                    distance = tmp_distance
        closest_classes = Counter([self.labels[x.index] for x in closest[:self._k]])
        return closest_classes.most_common(1)[0][0]





In [None]:
%%cython --annotate

cimport numpy as np
import numpy as np

cdef class KNNC:
    cdef int k
    cdef np.ndarray train_data


    def __init__(self, int k):
        self.k = k
        
    def train(self, np.ndarray train_data, np.ndarray labels):
        pass

    def predict(self, Vector data):
        pass


In [None]:
knn = KNNC(5)
runner = Runner(knn)
correct, incorrect = runner.test()
print(correct, incorrect)
%timeit runner.test()


## Tensorflow

Tensorflow is the tool to write machine learning algorithms.
It is designed to run very complex algorithms very quickly with some startup overhead to set up `tensors`

My implementation is definitely not the most efficient algorithm.
But if you are interested in learning tensorflow, knn is a very simple algorithm to help you learn the most common neural net framework.

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import random
import math

import numpy as np
import tensorflow as tf

class KNNtf:
    def __init__(self, k=5):
        self.k = k


    def train(self, train_data, labels):
        instance_shape = train_data[0].shape
        self.labels = labels
        knn_graph = tf.Graph()
        with knn_graph.as_default():
            # set up variables
            train_features = tf.constant(train_data)
            X = tf.placeholder(shape=instance_shape, dtype=train_data.dtype, name='X')
            Y = tf.placeholder(shape=(1,), dtype=np.float32, name='X')
            tf_labels = tf.constant(labels)

            # get distance
            difference = tf.abs(train_features - X)
            distance = tf.reduce_sum(difference, 1)

            # find the k nearest neighbors
            k_nearest_neighbors_indices = tf.nn.top_k(-distance, k=self.k).indices
            k_nearest_neighbors = tf.gather(tf_labels, k_nearest_neighbors_indices)

            # get the class of the most common class in k nearest neighbors
            y, idx, count = tf.unique_with_counts(k_nearest_neighbors)
            max_index = tf.argmax(count)
            nearest_neighbor = tf.gather(y, max_index)

            sess = tf.Session()
        # set a function to call to get the prediction
        self._model = lambda features: sess.run(nearest_neighbor, feed_dict={'X:0': features})

    def predict(self, data):
        nn = tf.compat.as_text(self._model(data))
        return nn


In [None]:
knn = KNNtf()
runner = Runner(knn)
correct, incorrect = runner.test()
print(correct, incorrect)
%timeit runner.test()