# Introduction
"In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

+ In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
+ In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.

k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms." [Wikipedia](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)

<img src="KnnClassification.svg">  

The following steps are necessary to implement a k-nearest neighbor algorithm:
+ process dataset
+ calculate similarity
+ locate nearest neighbors
+ classification of new instance
+ compute accuracy

# Imports

In [3]:
import csv
import random
import math
from operator import itemgetter
from collections import defaultdict

# Example
This example is based on the work of Jason Brownlee. Please see his blog article in the further reading section. I ported the example from python 2.x to python 3.x, added some comments and intermediate results for clarification.

"The problem is comprised of 150 observations of iris flowers from three different species. There are 4 measurements of given flowers: sepal length, sepal width, petal length and petal width, all in the same unit of centimeters. The predicted attribute is the species, which is one of setosa, versicolor or virginica.

It is a standard dataset where the species is known for all instances. As such we can split the data into training and test datasets and use the results to evaluate our algorithm implementation. Good classification accuracy on this problem is above 90% correct, typically 96% or better."

# Exercises

## load a dataset

In [4]:
def load_dataset(filename):
    
    ## Create a list to store your cleaned data
    result = []
    
    return result 
    
            ## append the cleaned row to your output list
            
    
    ## return your result

In [5]:
dataset = load_dataset('data/iris2.csv')
print('Solution:', [5.1, 3.5, 1.4, 0.2, 'setosa'])
print('Your solution:', dataset[0])

Solution: [5.1, 3.5, 1.4, 0.2, 'setosa']


IndexError: list index out of range

## Calculate similarity
The similarity of two instances is derived by the distance between them. There are several different methods to calculate the distance.  
The "ordinary" straight line distance between two points is called the euclidean distance.

### Euclidean distance
The formula to compute the euclidean distance in n-dimensional space is:  
$ {\sqrt{\sum_i^n({q_i - p_i})^2 }}$  
+ calculate the distance of the different axis
+ sum the different distances
+ square root of the intermediate result

For more information see [Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance)

In [6]:
def euclidean_distance(instance1, instance2):
    distance = 0
    
    ## returns the square root of the calculated sum of squared distances 
    return math.sqrt(distance)

In [7]:
## euclidean distance
data1 = [1, 2, 3, 'a']
data2 = [4, 5, 6, 'b']

distance = euclidean_distance(data1, data2)
print('Solution:', 5.196152422706632)
print('Your Solution: ', distance)

Solution: 5.196152422706632
Your Solution:  0.0


# Solutions

In [8]:
def load_dataset(filename):
    result = []
    with open(filename, 'r') as csvfile:
        ## uses csv.reader() to process the input file and create a list
        lines = csv.reader(csvfile)
        raw_dataset = list(lines)
        
        ## range(len()) to get the row index, not the row itself
        for row in range(len(raw_dataset)):
            ## converts the values from string to float, excludes class label
            ## range excludes the specified value, range 4 is from 0 to 3
            for col in range(4):
                floatie = float(raw_dataset[row][col])
                raw_dataset[row][col] = floatie

            result.append(raw_dataset[row])
            
        return result 

In [9]:
def euclidean_distance(instance1, instance2):
    distance = 0
    for x in range(len(instance1) - 1):
        ## distance = (2-5)² + (3-6)² + (4-7)²
        distance += pow((instance1[x] - instance2[x]), 2)
    
    ## returns the square root of the calculated sum of squared distances 
    return math.sqrt(distance)

# Implementation
## Process dataset

In [10]:
def load_dataset(filename, split):
    
    ## create empty lists for training and test set
    training_set = []
    test_set = []
    
    with open(filename, 'r') as csvfile:
        ## uses csv.reader() to process the input file and create a list
        lines = csv.reader(csvfile)
        dataset = list(lines)
        
        for x in range(len(dataset)):
            ## converts the values from string to float, excludes class label
            ## range excludes the specified value, range 4 is from 0 to 3
            for y in range(4):
                dataset[x][y] = float(dataset[x][y])

            ## appends to training or test set according to the specified split
            if random.random() < split:
                training_set.append(dataset[x])
            else:
                test_set.append(dataset[x])

        return training_set, test_set

In [11]:
## load_dataset, adjust the split parameter to change the ratio from training and test set
split = 0.9

training_set, test_set = load_dataset('iris2.csv', split)
print('Training instances: ', len(training_set))
print('Test instances: ', len(test_set))
print('Example Training instance: ', training_set[0])
print('Example Test instance: ', test_set[0])


Training instances:  130
Test instances:  20
Example Training instance:  [5.1, 3.5, 1.4, 0.2, 'setosa']
Example Test instance:  [4.9, 3.0, 1.4, 0.2, 'setosa']


## Locate nearest neighbors
As the k-nearest neighbor algorithm is a lazy learner most of the computation will be done while classification.  
The function computes the distance from the test instance to all the training instances and returns the k nearest neighbors.

In [12]:
def get_neighbors(training_set, test_instance, k):
    distances = []
    
    for x in range(len(training_set)):
        ## finds the distance from the test instance to all the training instances
        dist = euclidean_distance(test_instance, training_set[x])
        distances.append((training_set[x], dist))
    
    #sorts ascending, smallest distance comes first
    # ([2,2,2,'a'], dist)
    distances.sort(key=itemgetter(1))
    neighbors = []
    
    for x in range(k):
        # returns k nearest neighbors
        neighbors.append(distances[x][0])
    return neighbors

In [13]:
## get neighbors, adjust k to calculate k-neighbors
k = 1

train_set = [[2, 2, 2, 'a'], [4, 4, 4, 'b']]
test_instance = [5, 5, 5, 'b']

neighbors = get_neighbors(train_set, test_instance, k)
print(neighbors)

[[4, 4, 4, 'b']]


## Classification of new instance
The classification is based on a majority vote of all the neighbors found in the last step.  
Every neighbor will vote for the assignment of its own class. Therefore 

In [14]:
def get_classification(neighbors):
    class_votes = defaultdict(int)
    
    ## for all neighbors the response is each own class label
    for neighbor in neighbors:
        ## negative indexing indicates the last element, the class label of an instance
        response = neighbor[-1]
        class_votes[response] += 1
    
    ## sorts all the class votes descending by vote count
    ## sorted_votes has the format ('class_name', count)
    sorted_votes = sorted(class_votes.items(), key=itemgetter(1), reverse=True)
    
    ## returns just the class label 
    return sorted_votes[0][0]

In [15]:
## get classification
neighbors = [[1, 1, 1, 'a'], [2, 2, 2, 'a'], [3, 3, 3, 'b']]
response = get_classification(neighbors)
print(response)

a


## Compute accuracy
To assess the quality of the output we calculate the accuracy. Accuracy is returned in percent of correct predictions.

In [16]:
def get_accuracy(test_set, predictions):
    correct = 0

    for x in range(len(test_set)):
        ## negative indexing; -1 targets the last element
        if test_set[x][-1] == predictions[x]:
            correct += 1
    return (correct / len(test_set)) * 100.0

In [17]:
## get accuracy
test_set = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]
predictions = ['a', 'a', 'a']
accuracy = get_accuracy(test_set, predictions)
print(accuracy)

66.66666666666666


## Full test run

In [18]:
## parameters, adjust and see how the result changes
split = 0.66
k = 3

training_set, test_set = load_dataset('iris2.csv', split)
print('Training instances: ', len(training_set))
print('Test instances: ', len(test_set))

## generate predictions
predictions=[]

for test_inst in test_set:
    neighbors = get_neighbors(training_set, test_inst, k)
    result = get_classification(neighbors)
    predictions.append(result)
    #print('> predicted=', result, ', actual=', test_inst[-1])

accuracy = get_accuracy(test_set, predictions)
print('Accuracy: ', accuracy, '%')

Training instances:  99
Test instances:  51
Accuracy:  96.07843137254902 %


## Next steps
This is a really simple implementation which could still be greatly improved. For example:

### Find best k  
Write a function that executes the algorithm with different values for k and return the choice with the highest accuracy.

### Weighted neighbors  
Assigning weights to the neighbors will give the neighbors in close range more influence. 

# Further reading
+ [Jason Brownlee](http://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/)
+ [Natasha Latysheva](https://blog.cambridgecoding.com/2016/01/16/machine-learning-under-the-hood-writing-your-own-k-nearest-neighbour-algorithm/)
+ [scikit-learn](http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py)