# Introduction

This tutorial will introduce you to the k-nearest neighbors algorithm (k-NN), a non-parametric method which in pattern recognition is used for classification and regression. A "non parametric method" means that it does not make any assumptions on the underlying data distribution & the model structure is determined from the data itself. In real world, most of the data does not obey the typical theoretical assumptions. Hence one of the most popular choice for a classification study when there is little or no prior knowledge about the distributuion of data.
Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point. It can be used to assign weights to the contribution of the neighbors so that the nearer neighbors contribute more to the average than the more distant ones. The neighbors are taken from a set of objects for which the class or object property value is known. This can be considered as a training set for the algorithm. A basic visual representation would look like this:
___

[<img src="https://cdn-images-1.medium.com/max/1000/0*Sk18h9op6uK9EpT8.">](https://cdn-images-1.medium.com/max/1000/0*Sk18h9op6uK9EpT8.)
___

The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 (solid line circle) it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle). In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small).


## Tutorial Content

In this tutorial, we will be focussing on the classification aspect with the help of a dataset. The test problem that we will be using is iris classification.  We'll be using data collected from the UCI Machine Learning repository: https://archive.ics.uci.edu/ml/datasets/iris. The iris data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2. The problem is comprised of 150 observations of iris flowers from three different species. There are 4 measurements of given flowers: sepal length, sepal width, petal length and petal width, all in the same unit of centimeters. The predicted attribute is the species, which is one of setosa, versicolor or virginica.

It is a standard dataset where we can split the data into training and test datasets and use the results to evaluate our algorithm implementation. Good classification accuracy on this problem is above 90%, typically 96% or better. We will cover the following topics in this tutorial:

- [Getting started](#Getting-started)
- [Handling the Data](#Handling-the-Data)
- [Calculating the Similarity](#Calculating-the-Similarity)
- [Finding the Neighbors](#Finding-the-Neighbors)
- [Generating a Response](#Generating-a-Response)
- [Checking the Accuracy](#Checking-the-Accuracy)
- [Tieing it all together](#Tieing-it-all-together)

## Getting started
The k-nearest neighbors algorithm is based around the simple idea of predicting unknown values by matching them with the most similar known values. Let's say that we have 3 different types of cars. We know the name of the car, its Horsepower, whether or not it is Fuel Efficient, and whether or not it's Fast. 

| Car | Horsepower | Fuel efficient | Fast |
|-----|------------|----------------|------|
| Honda | 180 | True | False |
| Ferrari | 500 | False | True |
| Audi | 200 | True | True |

Let's say that we now have another car, but we don't know how fast it is:

| Car | Horsepower | Fuel efficient | Fast |
|-----|------------|----------------|------|
| BMW | 400 | True | Unknown |

We want to figure out if the car is fast or not. In order to predict if it is with k nearest neighbors, we first find the most similar known car. In this case, we would compare the horsepower and fuel efficient values to find the most similar car, which is Ferrari. Since Ferrari is fast, we would predict that the BMW is also fast. This is an example of 1-nearest neighbors.

- If we performed a 2-nearest neighbors, we would end up with 2 True values (for Ferrari and Audi), which would average out to True. Hence Ferrari and Audi are the two most similar cars, giving us a k of 2.
- If we did 3-nearest neighbors, we would end up with 2 True values and a False value, which would average out to True.

The number of neighbors we use for k-nearest neighbors (k) can be any value less than the number of rows in our dataset. In practice, looking at only a few neighbors makes the algorithm perform better, because the less similar the neighbors are to our data, the worse the prediction will be.

The steps in the following diagram provides a high-level overview of the tasks that we'll need to accomplish in our code. 

[<img src="https://cambridgecoding.files.wordpress.com/2016/01/knn2.jpg">](https://cambridgecoding.files.wordpress.com/2016/01/knn2.jpg)

- Here the input consists of the k closest training examples in the feature space. The output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. If k = 1, then the object is simply assigned to the class of that single nearest neighbor.


## Handling the Data

- The first step is to download the iris.data from https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data. Save the file in CSV format in the same directory alongside your code to be run on local machine. 
- Next we will load our data file. As the data is in CSV format without a header line, we can directly open the file with the open function and read the data lines using the csv.reader function in the [csv module](https://docs.python.org/3.1/library/csv.html). In order to use the csv reader feature, we first needs to import csv module.

In [1]:
import csv

In [2]:
# open function
with open('iris.data.csv') as data:
    # Returns a reader object line.
    lines = csv.reader(data)
    # Can use the reader object to iterate over the lines in the data
#     for line in lines:
#         print (' '.join(line))

- Next we will split our data into a training dataset that kNN can use to make predictions about the classifications of a new sampling point and into a test dataset that we can use to evaluate how accurately our model is predicting the new sample points.

- We will first convert the flower measures that were loaded as strings into floating numbers so that we can perform various arithmetic operations.

- Next what we need to do is to split our dataset randomly into training datasets and test datasets by importing the [random module](https://docs.python.org/2/library/random.html) in a specific, pre determined ratio. Since 67/33 is a standard ratio used, we will go ahead with the same values. You can pick up some other value if you want. According to a [study](https://scialert.net/fulltext/?doi=jas.2014.171.176), the training set = 95% has the highest accuracy set.

- By bringing it all together, we can define a function called loadData that loads a CSV with the provided name of the file and splits it randomly into training datasets and test datasets using the defined split ratio.

In [3]:
import random
def loadData(file, split, trainSet=[], testSet=[]):
    with open(file) as data:
        # Reader object
        lines = csv.reader(data)
        # Converting the reader object into a list of list of flower measures
        dataset = list(lines)
        for a in range(len(dataset) - 1):
            for b in range(4):
                # Converting the data type of flower measures from string to float
                dataset[a][b] = float(dataset[a][b])
            # Random.random return the next random floating point number in the range [0.0, 1.0).
            if random.random() < split:
                trainSet.append(dataset[a])
            else:
                testSet.append(dataset[a])

We can test this function out with our iris dataset.

In [4]:
trainSet=[]
testSet=[]
loadData('iris.data.csv', 0.67, trainSet, testSet)
print ('Trainining Data: ' + repr(len(trainSet)))
print ('Test Data: ' + repr(len(testSet)))

Trainining Data: 101
Test Data: 49


## Calculating the Similarity

- Now in order to make predictions we need to calculate how similar two given data instances are. This step is crucial as we need to locate the k most similar data instances in the training dataset for a given member of the test dataset in order to make a prediction.

- Before we can predict using KNN, we need to find some way to figure out which data rows are "closest" to the row we're trying to predict on. A simple way to do this is to use Euclidean distance.

- Since all the flower measurements have the same units and are float, we can directly use the [Euclidian Distance Measure](https://en.wikipedia.org/wiki/Euclidean_distance). We can calculate the distance by using math module & performing operations like **math.sqrt(pow((value1 - value2), 2))**. However, in python we can use the [numpy.linalg.norm function](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.linalg.norm.html) to calculate the euclidian distance directly.

- Also since there are 4 float attributes and one string, we need to control the fields to include in the euclidian distance calculation as we only require the first 4 values. One approach is to limit euclidian distance to a fixed length, ignoring the last dimension.

- Hence by putting this all together, we can define a Euclidian Distance function:

In [5]:
import numpy as np
def eucDistance(value1, value2, length):
    distance = 0
    for a in range(length):
        distance += np.linalg.norm(value1[a] - value2[a])
    return distance

We can check our function with some sample data:

In [6]:
data1 = [3, 1, 2, 'a']
data2 = [7, 4, 3, 'b']
distance = eucDistance(data1, data2, 3)
print ('Distance: ' + repr(distance))

Distance: 8.0


## Finding the Neighbors

- Now since we have a similarity measure, we can use it to collect the k most similar instances for a given unseen instance.
- The process involves calculating the distance of all instances and selecting a subset with the smallest distance values
- We will define a function that returns the k most similar neighbors from the training set for a given test instance using the already defined Euclidian Distance function.

In [7]:
# Finding k similar neighbors from train set for a given test instance
def findNeighbors(trainSet, testInstance, k):
    distances = []
    # Reducing the length by 1 to exclude the 5th string parameter  
    length = len(testInstance) - 1
    for a in range(len(trainSet)):
        # Calculating the distance between all instances
        dist = eucDistance(testInstance, trainSet[a], length)
        distances.append((trainSet[a], dist))
    # Sorting the list of distances in ascending order    
    distances.sort(key=lambda x: x[1])
    neighbors = []
    for a in range(k):
        # Returning the k training set instances with the shortest distances
        neighbors.append(distances[a][0])
    return neighbors

You can check whether the function is returning the correct training set with minimum distance on some sample data.

In [8]:
trainSet = [[1, 4, 7, 'class_x'], [2, 5, 8, 'class_y'],[3, 6, 9,'class_z'],[0, 3, 6,'class_y']]
testInstance = [3, 2, 1, 'class_z']
k = 2
neighbors = findNeighbors(trainSet, testInstance, k)
print('Neighbors = ',neighbors)

Neighbors =  [[0, 3, 6, 'class_y'], [1, 4, 7, 'class_x']]


## Generating a Response

- In KNN there is no training phase, for each new data it calculates Euclidean distance and compares with the nearest K neighbors. Class with maximum number of data points in nearest K neighbors list is chosen as the class of new data point.
- So now after locating the most similar neighbors for a particular test instance, we need to devise a prediction or a predicted response based on those neighbors. 
- One way of doing this is by allowing each neighbor to vote for their class attributes. By following such measure, we can take the one with the majority of votes as the prediction.
- Hence by using the nearest neighbors we just identified, we can get a prediction for the class of the test instance by majority voting which involves simply tallying up which class comes up most often among the nearest neighbors.
- Below provides a function for getting the majority voted response from a number of neighbors and according to our used dataset, it assumes the class is the last attribute for each neighbor.

In [9]:
def getPrediction(neighbors):
    votes = {}
    # There can be multiple neighbors based on value of k
    for a in range(len(neighbors)):
        # Getting the last index which is the string class of flowers
        response = neighbors[a][-1]
        if response in votes:
            # If the response is same from the previously checked neighbors
            votes[response] += 1
        else:
            votes[response] = 1
    # Sorting the responses in a descending order based on their values
    sortedVotes = sorted(votes.items(), key=lambda x : x[1], reverse=True)
    # Returning the class which came up most often among the nearest neighbors
    return sortedVotes[0][0]

- Let's check this function with some sample data. This approach returns only one response even in the case of a draw. However, you can handle that accordingly by either returning no response or selecting an unbiased random response.

In [10]:
neighbors = [[1,1,1,'c'], [2,2,2,'a'], [3,3,3,'b'], [4,4,4,'c'], [5,5,5,'a'], [6,6,6,'c']]
response = getPrediction(neighbors)
print('The class which comes up most often as the nearest neighbor is "'+response+'"')

The class which comes up most often as the nearest neighbor is "c"


## Checking the Accuracy

- Now we have all the pieces and have implemented our KNN algorithm. However, an important concern is how to evaluate the accuracy of our predictions? How do we find out that our algorithm is giving acceptable solutions in terms of predictions accuracy?
- In this case, accuracy is the ratio of number of data points correctly classified to total number of data points. An easy way to calculate the accuracy of our model is to calculate a ratio of total correct predictions out of all the predictions made, the Classification Accuracy.
- So below is the calculate Accuracy function that sums the total current predictions and returns the accuracy as a percentage of correct classification.

In [11]:
def checkAccuracy(testSet, predictions):
    correct = 0
    for a in range(len(testSet)):
        if testSet[a][-1] == predictions[a]:
            correct += 1
    return (correct / float(len(testSet))) * 100.0

We can test this function with a sample test dataset and some sample predictions. On the actual data set, accuracy of atleast 90% is what we seek in order to classify our predictions as successful. 

In [12]:
testSet = [[1,1,1,'c'], [2,2,2,'a'], [3,3,3,'b'], [4,4,4,'c'], [5,5,5,'a'], [6,6,6,'c']]
predictions = ['c', 'c', 'c', 'c', 'c', 'c']
accuracy = checkAccuracy(testSet, predictions)
print('Accuracy is',accuracy,'%')

Accuracy is 50.0 %


## Tieing it all together

- We now have all the elements that we needed to test our algorithm. We can now tie them all together with a main function to check the accuracy of our predictions on our loaded dataset. Below is the complete integration of all the steps.

In [13]:
import csv
import random
import numpy as np

def loadData(file, split, trainSet=[], testSet=[]):
    with open(file) as data:
        # Reader object
        lines = csv.reader(data)
        # Converting the reader object into a list of list of flower measures
        dataset = list(lines)
        for a in range(len(dataset) - 1):
            for b in range(4):
                # Converting the data type of flower measures from string to float
                dataset[a][b] = float(dataset[a][b])
            # Random.random return the next random floating point number in the range [0.0, 1.0).
            if random.random() < split:
                trainSet.append(dataset[a])
            else:
                testSet.append(dataset[a])

def eucDistance(value1, value2, length):
    distance = 0
    for a in range(length):
        distance += np.linalg.norm(value1[a] - value2[a])
    return distance

def findNeighbors(trainSet, testInstance, k):
    distances = []
    # Reducing the length by 1 to exclude the 5th string parameter  
    length = len(testInstance) - 1
    for a in range(len(trainSet)):
        # Calculating the distance between all instances
        dist = eucDistance(testInstance, trainSet[a], length)
        distances.append((trainSet[a], dist))
    # Sorting the list of distances in ascending order    
    distances.sort(key=lambda x: x[1])
    neighbors = []
    for a in range(k):
        # Returning the k training set instances with the shortest distances
        neighbors.append(distances[a][0])
    return neighbors

def getPrediction(neighbors):
    votes = {}
    # There can be multiple neighbors based on value of k
    for a in range(len(neighbors)):
        # Getting the last index which is the string class of flowers
        response = neighbors[a][-1]
        if response in votes:
            # If the response is same from the previously checked neighbors
            votes[response] += 1
        else:
            votes[response] = 1
    # Sorting the responses in a descending order based on their values
    sortedVotes = sorted(votes.items(), key=lambda x : x[1], reverse=True)
    # Returning the class which came up most often among the nearest neighbors
    return sortedVotes[0][0]

def checkAccuracy(testSet, predictions):
    correct = 0
    for a in range(len(testSet)):
        if testSet[a][-1] == predictions[a]:
            correct += 1
    return (correct / float(len(testSet))) * 100.0

def kNN():
    trainSet = []
    testSet = []
    # can use split = 0.95 for the best accuracy
    split = 0.67
    # Loading the Dataset
    loadData('iris.data.csv', split, trainSet, testSet)
    print('Train set: ' + repr(len(trainSet)))
    print('Test set: ' + repr(len(testSet)) + '\n')
    # Generating predictions 
    predictions = []
    k = 4
    for a in range(len(testSet)):
        neighbors = findNeighbors(trainSet, testSet[a], k)
        result = getPrediction(neighbors)
        predictions.append(result)
        print('*Predicted Value is = ' + repr(result) + '*--*-Actual Value is = ' + repr(testSet[a][-1]) + '*')
    accuracy = checkAccuracy(testSet, predictions)
    # Formatting the value of accuracy to upto 2 decimal places
    accuracy = "{0:.2f}".format(accuracy)
    print('\nAccuracy = ' + repr(accuracy) + '%')
    
kNN()

Train set: 99
Test set: 51

*Predicted Value is = 'Iris-setosa'*--*-Actual Value is = 'Iris-setosa'*
*Predicted Value is = 'Iris-setosa'*--*-Actual Value is = 'Iris-setosa'*
*Predicted Value is = 'Iris-setosa'*--*-Actual Value is = 'Iris-setosa'*
*Predicted Value is = 'Iris-setosa'*--*-Actual Value is = 'Iris-setosa'*
*Predicted Value is = 'Iris-setosa'*--*-Actual Value is = 'Iris-setosa'*
*Predicted Value is = 'Iris-setosa'*--*-Actual Value is = 'Iris-setosa'*
*Predicted Value is = 'Iris-setosa'*--*-Actual Value is = 'Iris-setosa'*
*Predicted Value is = 'Iris-setosa'*--*-Actual Value is = 'Iris-setosa'*
*Predicted Value is = 'Iris-setosa'*--*-Actual Value is = 'Iris-setosa'*
*Predicted Value is = 'Iris-setosa'*--*-Actual Value is = 'Iris-setosa'*
*Predicted Value is = 'Iris-setosa'*--*-Actual Value is = 'Iris-setosa'*
*Predicted Value is = 'Iris-setosa'*--*-Actual Value is = 'Iris-setosa'*
*Predicted Value is = 'Iris-setosa'*--*-Actual Value is = 'Iris-setosa'*
*Predicted Value is = '

## Summary and references

This tutorial highlighted how kNN algorithm works on a certain dataset & can be used for classifciation purpose. Further details about how to extend the idea and to get a better understanidng are available from the following links.

- Further understanding of algorithm: https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
- Nearest Neighbors: http://scikit-learn.org/stable/modules/neighbors.html
- knn Regression: http://scikit-learn.org/stable/auto_examples/neighbors/plot_regression.html
- Iris flower dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set