# Introduction

   The Kth Nearest Neighbor Algorithm or KNN for short is regarded as a simple yet effective algorithm primarily used for classification in supervised learning problems. Due to its simplicity, it is one of the most widely used learning algorithms and can even out perform many more complicated algorithms in a wide variety of problems. This tutorial serves as an in-depth guide to learning the foundations of KNN, understanding the pros and cons of the algorithm, and ultimately implementing the algorithm from scratch to tackle real world examples. Furthermore, we will analyze our performance on these examples and see ways to improve the algorithm.  

# What is KNN?

To understand the KNN algorithm, we must first understand the problem it is trying to solve. The classical KNN algorithm is usually used for classification and falls under supervised learning. In other words we are trying to use information in the training data to predict or classify the target's group or class. An example of this type of problem might be predicting the gender of a person given hair length, height, and weight. Supervised learning involves using a data set containing training examples with associated correct labels. We use these labels to infer a function or pattern in the data to classify a new test datum. But how does KNN use this training information to classify and predict the class of the test data?
    Well, the KNN achieves this task in a simple and naive way. The underlying premise of the KNN algorithm is that data in the same group or class have similar values for the variables associated with it. Looking back to our gender example, we might see that women tend to have longer hair and less height and weight then men might. This information is then leveraged to classify a new point as either male or female. The KNN algorithm begins by taking the the new point that needs to be classified and finding the k-most similar points/neighbors from the training data. Then it uses a majority vote of the classes its neighbors fall into to ultimately decide the class of the point. Assume we have a person who is 5'6", 135 pounds and has a hair length of 12". Using our KNN algorithm for a k-value of 5, we would then find the 5 people who are most similar to the person's gender we are trying to find. Say of these 5 neighbors, 4 are Female and 1 is Male. The algorithm would then classify that person's gender as Female. The inquisitive reader might ask what metric or guideline should be used to find the k "most similar points/neighbors". The answer would be that similarity between points would have to be some distance metric of which Euclidean distance is commonly used. Euclidean distance is defined as $d(x,x′)=\sqrt{(x_1−x′_1)^{2}+(x_2−x′_2)^{2}+…+(x_n−x′_n)^{2}}$. Other popular distance metrics include the Manhattan distance, Hamming distance, and Chebyshev distance. Using different metrics can influence the results and success of the algorithm as we will see later in an example problem.
    Now that we have learned the basics of how the KNN algorithm works in theory, let us code the algorithm from scratch to deepen our understanding.

# Implementation

Let us begin by implementing a metric for comparing how similar two points are to each othen. We will be using the Euclidean distance, the straight line distance between two points in an Euclidean space. The formula for the distance between the two points can be given by the function $d(x,x′)=\sqrt{(x_1−x′_1)^{2}+(x_2−x′_2)^{2}+…+(x_n−x′_n)^{2}}$.

In [55]:
import math
#Euclidean distance
def euclideanDistance(x,x_i): 
    #define distance to be initially 0
        distance = 0
        # for every explanatory variable calculate the difference between the two points
        # adding the square of the difference to the distance.
        for feature in range(len(x)-1):
            distance += (x[feature] - x_i[feature])**2 
        #return the sqrt of the sum of squared differences 
        return math.sqrt(distance)

Next it is important to note that the KNN function does not need to be trained and thus does not have a training function. This is because it is a lazy learner a property we will discuss later in this tutorial. As such we will now implement the actual "meat" of the algorithm, classifying our test data. This will be done in the predict function. We will begin by first calculating the distances between every point and the point we are predicting our class for. Then we will select the k points/neighbors that have the least distance between the test point. We will do this by sorting the list in ascending order and getting the index of the k smallest distances. We will then store all the classes of these points in a list and find the class which occurs the most amount of times. In case of a tie between classes we will select the tieing class that occured first in the list. We will then return the class that occurs the most.

In [2]:
from collections import Counter
def predict(x_train, x_test, y_train, k):
    #create a list to keep track of the distances between x_test and the points in x_train.
    distList = []
    classList = []
    #iterate through every training point and store the euclidean distance between x_test and 
    #the point in the distance list as a list with distance and its index.
    for i in range(len(x_train)):
        distList.append([euclideanDistance(x_test,x_train[i]),i])
    #sort the nested list by distance
    distList.sort()
    # get the index for the first k elements of the list and index y_train to append its class to the list of classes.
    for j in range(k):
        index = distList[j][1]
        classList.append(y_train[index])
    #create a counter object passing in the list of classes
    c = Counter(classList)
    #return the most common class.
    return c.most_common(1)[0][0]
    
        

Putting it all together we have:

In [65]:

class KNN:
    def __init__(self,x_train,x_test,y_train,y_test,k):
        self.x_train = x_train
        self.y_train = y_train
        self.x_test = x_test
        self.y_test = y_test
        self.k = k
    def train(self):
        #do nothing! remember that the KNN is a lazy learner!
        pass
    def euclideanDistance(self,x,x_i):
        distance = 0
        for feature in range(len(x)-1):
            distance += (x[feature] - x_i[feature])**2
        return math.sqrt(distance)
    def predict(self,x_test):
        distList = []
        classList = []
        for i in range(len(self.x_train)):
            distList.append([self.euclideanDistance(x_test,self.x_train[i]),i])
        distList.sort()
        for j in range(self.k):
            index = distList[j][1]
            classList.append(self.y_train[index])
        c = Counter(classList)
        return c.most_common(1)[0][0]
    def knncomplete(self):
        classifications = []
        for i in range(len(self.x_test)):
            classifications.append(self.predict(x_test[i,:]))
        return classifications
    def accuracy(self):
        classifications = self.knncomplete()
        return sum(1 for x,y in zip(classifications,self.y_test) if x == y) / len(self.y_test)
         
        

# Examples:

## Classifying Iris Flower Data:

The Iris flower data is commonly used as a benchmark for testing implementations of classifying algorithms. The dataset was first introduced by British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problem. The multivariate dataset contains 50 samples from 3 species of iris. The dataset has 4 explanatory variables - length of petals, width of petals, length of sepal, and width of sepal. We will be using these variables to ultimately classify a flower into one the 3 species of iris. We will begin by loading the data into a pandas data frame as shown below. Since the data does not have column names let us first define these to pass into pd.read_csv while we load the data. 

In [63]:
import pandas as pd
names = ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth', 'Class']
df = pd.read_csv('/Users/arjuncomputerscience/Desktop/iris.data', header=None, names=names)
df.head()

Unnamed: 0,sepalLength,sepalWidth,petalLength,petalWidth,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Next we will split the data into training and test data and instantiate our KNN class object. We will then call the accuracy function to see how accurate the model was on the training data.

In [64]:
from sklearn.model_selection import train_test_split
import numpy
x = numpy.array(df.loc[:,['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth']])
y = numpy.array(df['Class'])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=23)
kNN = KNN(x_train,x_test,y_train,y_test,4)
kNN.accuracy()

0.96

We see that the algorithm achieved a 96 percent accuracy in classifying the species of iris given the sepal length and width and the petal length and width. A 96 percent accuracy is pretty good, but we will later explore a better statistic for model validation and comparison and see how to further improve our model by tuning our model parameter k. Let us look at some pros and cons of KNN before talking about algorithm improvements and solving our second example.

# Pros and Cons of KNN

## PROS:
The first major pro for the KNN model is that it is very intuitive and easy to understand and implement. As such it can be used very quickly and as a first approach algorithm in classification problems. Furthermore, we see that the KNN model makes no assumptions about the data. There is no assumptions about the underlying distribution making the algorithm non-parametric. Thus, this algorithm can be used for a larger subset of problems and beat models that might be making assumptions that the data does not infact adhere to. Another major pro of the KNN algorithm is that it is a lazy-learner. This means the KNN algorithm only generalizes the training data when a query to predict a new point is made. As such the KNN model needs no training and has the entire dataset stored. An eager learner is the opposite of a lazy learner and stores a model or distribution learned from the data before making a prediction. Thus, these types of models have to be trained before making a prediction. The last pro we will discuss is the algorithms ability to easily classify into multiple categories and not just binary ones as some other algorithms are hardcoded to.

## Cons:
The first major con is the downside to being a lazy learning alorithm. Since no training is needed, the testing/predicting phase is computationally intensive. Note that the algorithm has to go through all the data each time it needs to make a prediction. This is not ideal in industry settings. Furthermore, a big downside to the KNN algorithm is class imbalance issues. If there are a lot of data classified under one class with less classified under another, the algorithm's majority voting system can be exploited when classifying, selecting the most common class more often. Another big downside to the KNN algorithm is data with high dimensionality. When this occurs the distance between points can be less significant. The distance between the closest and farthest neighbor becomes a lot smaller. The result of this is poorer accuracy and classification from the algorithm.

However it should ultimately be noted that the KNN algorithm is still widely used, and effective in many settings. There are also some ways to minimize the impact of cons as we will see later in this tutorial.

# Examples cont.