# kNN

Is one of the most used tool, widely supported and full of libraries. It makes no conjecture about the fields of the data or how they should be cartegorized, is context agnostic.

As often happens in AI, practice is far ahead compared to theory:
- No scieintific or proof of works.
- Just empirical evidence


In this notebook we will implement KNN step by step, and skip the definition of K for lack of time.

![kNN mechanics](images/knn1.png)

kNN does not assume anything about the data, other than a distance measure can be calculated consistently between any two instances. As such, it is called non-parametric or non-linear as it does not assume a functional form.

## instance-based 

Instance-based algorithms are those algorithms that model the problem using data instances (or rows) in order to make predictive decisions. The kNN algorithm is an extreme form of instance-based methods because all training observations are retained as part of the model.


## competitive learning

It is a competitive learning algorithm, because it internally uses competition between model elements (data instances) in order to make a predictive decision. The objective similarity measure between data instances causes each data instance to compete to “win” or be most similar to a given unseen data instance and contribute to a prediction.


## lazy learning algorithms

Lazy learning refers to the fact that the algorithm does not build a model until the time that a prediction is required. It is lazy because it only does work at the last second. This has the benefit of only including data relevant to the unseen data, called a localized model. A disadvantage is that it can be computationally expensive to repeat the same or similar searches over larger training datasets.

## kNN Algorithm

The kNN task can be broken down into writing 3 primary functions: 

    1. Calculate the distance between any two points 
    2. Find the nearest neighbours based on these pairwise distances
    3. Majority vote on a class labels based on the nearest neighbour list 

![kNN Alogorithm](images/knn2.jpg)

## Data preparation

This should always be the first step when tackling a Machine Learning problem. We must clean, sanitize and organize the data.
In this problem we will skip this part, as usually does not require any domain specific knowledge, but just a lot of patience

Once the data clean we can split to get ready for the three phases
- Train 60%
- Test 20%
- Validate 20%

this can be a good ratio to split your data, also 80, 10, 10 will work fine. Do not underestimate the importance of test and validation set, as they are key points to avoid overfitting and to check the accuracy of your models.

In our case we will need only Train and Test set, as we are working on a simplified problem.

![data preparation](images/knn3.jpg)

In [1]:
from sklearn.datasets import load_iris
from sklearn import cross_validation
from sklearn.metrics import classification_report, accuracy_score
from operator import itemgetter
import numpy as np
import math
from collections import Counter
 
# 1) given two data points, calculate the euclidean distance between them
def get_distance(data1, data2):
    points = zip(data1, data2)
    diffs_squared_distance = [pow(a - b, 2) for (a, b) in points]
    return math.sqrt(sum(diffs_squared_distance))



In [11]:
# 2) given a training set and a test instance, use getDistance to calculate all pairwise distances
def get_neighbours(training_set, test_instance, k):
    distances = [_get_tuple_distance(training_instance, test_instance) for training_instance in training_set]
    # index 1 is the calculated distance between training_instance and test_instance
    sorted_distances = sorted(distances, key=itemgetter(1))
    # extract only training instances
    sorted_training_instances = [tuple[0] for tuple in sorted_distances]
    # select first k elements
    return sorted_training_instances[:k]
 
def _get_tuple_distance(training_instance, test_instance):
    return (training_instance, get_distance(test_instance, training_instance[0]))

In [12]:
# 3) given an array of nearest neighbours for a test case, tally up their classes to vote on test case class
def get_majority_vote(neighbours):
    # index 1 is the class
    classes = [neighbour[1] for neighbour in neighbours]
    count = Counter(classes)
    return count.most_common()[0][0] 

In [17]:
# load the data and create the training and test sets
# random_state = 1 is just a seed to permit reproducibility of the train/test split
iris = load_iris()
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=1)

# reformat train/test datasets for convenience
train = np.array(zip(X_train,y_train))
test = np.array(zip(X_test, y_test))

# generate predictions
predictions = []

# We are skipping k optimization due to shortage of time
# let's arbitrarily set k equal to 5, meaning that to predict the class of new instances,
k = 5
print(test)
# for each instance in the test set, get nearest neighbours and majority vote on predicted class
for x in range(len(X_test)):
        print('Classifying test instance number ' + str(x))
        print(test[x])
        neighbours = get_neighbours(training_set=train, test_instance=test[x][0], k=5)
        majority_vote = get_majority_vote(neighbours)
        predictions.append(majority_vote)
        print('Predicted label=' + str(majority_vote) + ', Actual label=' + str(test[x][1]))

# summarize performance of the classification
print('\nThe overall accuracy of the model is: ' + str(accuracy_score(y_test, predictions)) + "\n")
report = classification_report(y_test, predictions, target_names = iris.target_names)
print('A detailed classification report: \n\n' + report)

<zip object at 0x0000000007DDD5C8>
Classifying test instance number 0


IndexError: too many indices for array

In [None]:
()