## Introduction

In this tutorial, you will learn some basic ideas about k-Neareset Neighbors algorithm (or kNN for short) and apply kNN using available packages. Also, you will implement the kNN algorithm from the scratch for classification analysis. As one of the top 10 data mining algorithms identified by IEEE International Conference on Data Mining (ICDM), kNN is very popular in data analysis for its simplicity and good performance in application. 


### What is kNN

The kNN algorithm is instance-based which can be used for classification and regression. The idea behind the algorithm is very simple: use the characteristics of an object's k-nearest neighbors to evalute itself. It is a supervised learning algorithm which means it learns a model from samples with known labels or values. 

[<img src="https://upload.wikimedia.org/wikipedia/commons/e/e7/KnnClassification.svg">](https://upload.wikimedia.org/wikipedia/commons/e/e7/KnnClassification.svg)

The graph above is a classic graph from wiki to explain the basic idea of kNN. The samples shown in the graph has two classes: one is blue square and the other is red triangle. The green circle at the center location is a sample waiting to be decided which class it belongs to. The result depends on the parameter k by applying the majority vote rule.

If k = 3 which means the class of the green circle depends on the three nearest neighbors, the green circle is a red triangle.

If k = 5 which means the class of the green circle depends on the five nearest neighbors, the green circle is a blue square.

Concerning how to choose a best k, it depends largely on the dataset. A larger k means deciding the label or value of a unknown test based on more neighbors, and thus could reduce the effect of outliers and noise to a certain degree. But the bad side is that the distinction between different classes may not be that clear. 

The kNN algorithm is also a lazy-learning algorithm since the model will be constructed when a prediction is needed to make.


### Analysis

kNN can be applied to both classification and regression problems and will have different output values based on which type the problem belongs to. 

In a classification problem, an output for an object is a class label for the object determined by holding a majority vote of the labels of the object's k number of neighbors. The attribute types decide the measure of similarity. The Euclidean distance is used for real-valued data and Hamming distance is used for categorical data.

In a regression problem, an output for an object is a certain value determined by the average value of the object's k number of neighbors.

Training data using kNN for analysis purposes has a feature space, either scalar or multi-dimensional, indicating that a certain distance can be calculated to compare different objects to find the k-nearest neighbors. 



## sklearn.neighbors

If you want to use available package to apply kNN, then sklearn.neighbors can be your choice.

There are two main classess in sklearn.neighbors for k-Nearest Neighbors: one is sklearn.neighbors.KNeighborsRegressor for regression analysis and the other is sklearn.neighbors.KNeighborsClassifier for classfication analysis. Let's start from classification first.

### KNeighborsClassifier

Actually besides KNeighborsClassifier, scikit-learn provides another class called RadiusNeighborsClassifier for nearest neighbors classification. This class works well especially when the data sample is not well-uniform sampled because the user can appoint a specific R which is the radius to decide a field of reference instances. Thus data points in sparse distribution use fewer neighbors for classification. However, it is not that effective for a dataset with high-dimension spaces and the reason could be referred to a term called "curse of dimensionality" meaning various phenomena that happen only in high-dimensional spaces during the data analysis. 

KNeighborsClassifier is the more popular used one and you will try to use it. The dataset to be used here is the iris dataset from sklearn.datasets. By the way, there are many available datasets in scikit-learn. If you want to try kNN using other datasets, you can download and try the similar commands below.

You can download the data using load_iris(), and the default value for the parameter return_X_y is "False" under which the return value is a Bunch type. The Bunch type is a quite useful object like dictionary, providing information about 'target_names'(label names), 'data'(data without labels), 'target'(labels), 'feature_names'(names of the features) and 'DESCR'(description of the dataset).

In [1]:
from sklearn.datasets import load_iris
import numpy as np

iris_data = load_iris()
# print the basic information of this dataset
print "class: ", iris_data['target_names']
print "featur: ", iris_data['feature_names']
print "first five rows of data: ", iris_data['data'][:5]
print "labels: ", np.unique(iris_data['target']) # Labels are the numeric way to represent classes correspondingly.
print "number of samples: ", len(iris_data['data'])
print "number of samples for "+ iris_data['target_names'][0] +": ", len([iris_data['data'][i] for i in range(len(iris_data['target'])) if iris_data['target'][i] == 0])
print "number of samples for "+ iris_data['target_names'][1] +": ", len([iris_data['data'][i] for i in range(len(iris_data['target'])) if iris_data['target'][i] == 1])
print "number of samples for "+ iris_data['target_names'][2] +": ", len([iris_data['data'][i] for i in range(len(iris_data['target'])) if iris_data['target'][i] == 2])

class:  ['setosa' 'versicolor' 'virginica']
featur:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
first five rows of data:  [[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]
labels:  [0 1 2]
number of samples:  150
number of samples for setosa:  50
number of samples for versicolor:  50
number of samples for virginica:  50


Then we need to split the dataset into a training dataset and a test dataset. We can do the split randomly with a ratio of 66% and 34% respectively. Before that, let's combine data with labels so that it will be easier to randomly select training data and test data. You can do it by inserting the label column to the last column of the data.

In [2]:
# add labels to the dataset
data = np.array(iris_data['data'])
data = np.insert(data, 4, iris_data['target'], axis = 1)

Concerning splitting data, you can create a random selector and then select training and test rows by index. Also, you can split attributes and label at the same time.

In [3]:
# create a random selector
selector = range(len(data))
np.random.shuffle(selector)

# select the training dataset
train = data[selector[:99]][:, :-1]
train_label = data[selector[:99]][:,-1]

# select the test dataset
test = data[selector[99:]][:, :-1]
test_label = data[selector[99:]][:,-1]

Now, with datasets ready, you can initiate a KNeighborsClassifier. All the parameters are optional and the default value for n_neighbors is 5. Another interesting parameter called "weights" is for you to choose whether to assign same weights to each neighbor ('uniform') or to assign different weights based on the inverse distance from the unknown data point.

In [4]:
from sklearn.neighbors import KNeighborsClassifier

knnClassifier = KNeighborsClassifier(n_neighbors = 10, weights = 'uniform')

Then, you need to use the training dataset to train the classifier, and the method is "fit(X, y)".

In [5]:
knnClassifier.fit(train, train_label)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')

Next step is to use the classifier to predict labels of the test dataset and calculate the error rate.

In [6]:
# predict the label of test data
predict_label1 = knnClassifier.predict(test)

# print the error rate
error = 0
for i in range(len(test_label)):
    if test_label[i] != predict_label1[i]:
        error += 1

error = error * 1.0 / len(test)
print "error rate: ", error

error rate:  0.0196078431373


Now you can do kNN using KNeighborsClassifier! Let's look at KNeighborsRegressor.

### KNeighborsRegressor

Similar to the case in classification, besides KNeighborsRegressor, scikit-learn provides another class called RadiusNeighborsRegressor for doing regression analysis. RediusNeighborsRegressor works well when the data is not well-uniform sampled and it lets the user to select a radius r to check the conditions of neighbors.

The way that we use KNeighborsRegressor is similar to that of KNeighborsClassifier. Thus you can try it by using a small test example.


In [7]:
from sklearn.neighbors import KNeighborsRegressor
import numpy as np

# create a small example
x = [[1, 2], [2, 2], [2, 3], [5, 7], [6, 8], [8, 8], [8, 12], [10, 14]]
variables = np.array(x)
value = [10, 8, 1, 1, 3, 2, 4, 3]

# initiate the regressor with a neighbor number of 2
neighbor = KNeighborsRegressor(2)

# fit the regressor with training data and value
neighbor.fit(variables, value)

# use the regressor to predict 
predicted_value = neighbor.predict([[5, 6]])

print "for the predicted value of [[5, 6]]: ", predicted_value

for the predicted value of [[5, 6]]:  [ 2.]


## Implement kNN 

Instead of using available packages, you may be intereseted in writing your own kNN. Let's start to write our own kNN from the scratch for classfication analysis!

One thing to mention about is that I think it's better to keep things complete. So I keep the whole codes in one cell below and add comments in between. Sorry for the inconvenience.

In [8]:
import numpy as np
from heapdict import heapdict
import math
from collections import Counter

class kNNClassfication():
    
    """To initiate, we can pass in the training data, the corresponding labels, 
                    and k which is the number of neighbors we choose to decide the unknown instance """
    def __init__(self, train_data, label_data, k_data):
        self.train = np.array(train_data)        
        self.label = label_data
        self.k = k_data
        pass
    
    """To predict, we pass in the test data and return the predicted labels.
       Based on what we have discussed above, for an unknown instance, we need to find its k neighbors.
       For the classification problem, we decide its label based on the majority vote rule"""
    def predict(self, test_data):
        predict_label = []
        neighbors = []
        label_test = ''
        
        # We find each instance's neighbors and append to the neighbors array
        for instance in test_data:
            kneighbor = self.find_neighbors(instance)
            neighbors.append(kneighbor)
        
        # We find each instance's majority label and append to the predict_label array
        for kneighbor in neighbors:
            label_test = self.find_majority(kneighbor)
            predict_label.append(label_test)

        return predict_label    
        pass
    
    """We use a method to find neighbors of an instance and return the labels of neighbors as a dictionary.
       We call the calculate_distance method to get the distance between the unknown instance and each training data.
       """
    def find_neighbors(self, instance):
        # use a heapdict() to find neighbors with the small distance
        container = heapdict()
        
        # calculate the distance between the unknown instance and each training data
        for i in range(len(self.train)):
            distance = self.calculate_distance(instance, self.train[i])
            container[i] = distance

        # add labels of k nearest neighbors to the dictionary        
        neighbors = {}
        for i in range(self.k):
            neighbor = container.popitem()
            key = neighbor[0]
            value = neighbor[1]
            neighbors[key] = self.label[key]
        
        return neighbors
        pass
    
    """calculate the Euclidean distance between two instances"""
    def calculate_distance(self, x, y):
        distance = 0
        sqr_distance = 0
        
        for i in range(len(x)):
            distance_i = (x[i] - y[i]) ** 2
            distance = distance + distance_i
                       
        sqr_distance = math.sqrt(distance)
        
        return sqr_distance

        pass
                       
    """We pass in an instance's k nearest neighbors, find the majority label and return it."""                   
    def find_majority(self, kneighbor):
        label = []
        for item in kneighbor:
            label.append(kneighbor[item])
        
        # use Counter() to get the count of each label
        c = Counter(label)
        test_label = c.most_common(1)[0][0]
        return test_label
        pass

We finish our own kNN class! Next, let's write a simple test. I make up three classes with different distribution.

In [9]:
train_small = np.array([[1, 2, 3], [2, 2, 1], [2, 3, 4], [5, 7, 6], [6, 8, 7], [8, 8, 4], [8, 6, 10], [10, 16, 8]])
label_small = np.array(['dog', 'dog', 'dog', 'cat', 'cat','cat', 'tiger', 'tiger'])
test_small = np.array([[4,3,2], [12, 10, 10], [5,8,7]])
test_label_small = np.array(['dog', 'tiger', 'cat'])

k = 2

# initiate the classifier and predict the test data
knnClassifier = kNNClassfication(train_small, label_small, 2)
label_test_small = knnClassifier.predict(test_small)
print "the predicted value: ", label_test_small

# calculate the error rate
error_small = 0
for i in range(len(test_small)):
    if (test_label_small[i] != label_test_small[i]):
        error_small += 1

error_small = error_small / len(test_small)
print "error rate: ", error_small

the predicted value:  ['dog', 'tiger', 'cat']
error rate:  0


Let's also test on the iris dataset which we have used when learning how to use KNeighborsClassifier from scikit-learn.

In [10]:
# use the training data split from iris 
knnClassifier3 = kNNClassfication(train, train_label, 10)

# predict the test data
predict_label2 = knnClassifier3.predict(test)

# replace the integer number of output with specific class name
test_class = []
for i in test_label:
    if i == 0:
        test_class.append('setosa')
    elif i == 1:
        test_class.append('versicolor')
    else:
        test_class.append('virginica')

predict_class = []
for i in predict_label2:
    if i == 0:
        predict_class.append('setosa')
    elif i == 1:
        predict_class.append('versicolor')
    else:
        predict_class.append('virginica')

# print the actual labels and the predicted labels
for i in range(len(test_label)):
    print "actual class: ", test_class[i], "   predicted class: ", predict_class[i]

# print the error rate
error2 = 0
for i in range(len(test_label)):
    if test_label[i] != predict_label2[i]:
        error2 += 1

error2 = error2 * 1.0 / len(test)
print "error rate: ", error2

actual class:  setosa    predicted class:  setosa
actual class:  setosa    predicted class:  setosa
actual class:  setosa    predicted class:  setosa
actual class:  setosa    predicted class:  setosa
actual class:  virginica    predicted class:  virginica
actual class:  setosa    predicted class:  setosa
actual class:  versicolor    predicted class:  versicolor
actual class:  virginica    predicted class:  virginica
actual class:  versicolor    predicted class:  versicolor
actual class:  virginica    predicted class:  virginica
actual class:  versicolor    predicted class:  versicolor
actual class:  setosa    predicted class:  setosa
actual class:  virginica    predicted class:  virginica
actual class:  versicolor    predicted class:  versicolor
actual class:  setosa    predicted class:  setosa
actual class:  versicolor    predicted class:  virginica
actual class:  virginica    predicted class:  virginica
actual class:  virginica    predicted class:  virginica
actual class:  versicolor

Now you implement knnClassifier! One point is that we can normalize all the attributes meaning scaling them between 0 and 1 before we calculate the Euclidean distance. You can try it and see whether it improves the model or not!

You can also try knnRegressior to predict a real-valued attribute and the basic ideas are quite similar.


## Applications 

Since kNN is very easy to understand and implement, it is applied in many areas in the actual world.

For example, kNN can be applied in the text mining to conduct tasks of text classification. Based on applying kNN to the training data, some rules could be found about the characteristics of the text which could be used to predict the category of a future word, sentence or paragraph. A special case of the text mining topic is to filter spam mail. Through building a model based on a training dataset of spam emails, when a new email comes in, kNN could based on the model set on the training dataset to predict whether this is a spam mail or not, thus providing convenience to people's daily life.

Besides, kNN can be applied in marketing areas. For a target attribute of a customer, with a training dataset of other informaiton about the customer together with the target attribute, the prediction of the target attribute of a future cusotmer can be gotten from comparing the customer's attributes with these attributes from customers in the training dataset. 

kNN can also be applied together with algorithms like linear regression and ridge regression to solve practical tasks. There is an example of predicting a face's lower half part given the upper half part by using kNN, linear regression, randomized trees and so on from scikit-learn.

## Summary & Resources to Learn More

The k-Nearest Neighbors algorithm is a powerful non-parametric method which could be applied in practical classification and regression problems. You can use available kNN classes in scikit-learn, or you can implement your own kNN algorithm from the scratch. 

What I have introduced above are some basic ideas about k-Nearest Neighbors. I want to share with you some additional sources about kNN so that you can learn more about it if you are interested in this topic.

### Implementation

Here are some implementations about kNN for your reference:

The kNN algorithm in scikit-learn: https://github.com/scikit-learn/scikit-learn/tree/master/sklearn/neighbors

The kNN algorithm in python package: https://pypi.python.org/pypi/KNN

### Examples

Here are some examples from scikit-learn about applying kNN to solve practical tasks:

Classifying documents based on a 20 news groups dataset : 
http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#sphx-glr-auto-examples-text-document-classification-20newsgroups-py

Predicting people's lower half face given the upper half from scikit-learn: 
http://scikit-learn.org/stable/auto_examples/plot_multioutput_face_completion.html#sphx-glr-auto-examples-plot-multioutput-face-completion-py

### Papers

#### More on theory:
Shemim Begum, Debasis Chakraborty, Ram Sarkar, "Data Classification Using Feature Selection and kNN Machine Learning Approach" (http://ieeexplore.ieee.org/document/7546208/).

Lei Wang, Latifur Khan, Bhavani Thuraisingham, "An Effective Evidence Theory Based K-Nearest Neighbor (KNN) Classification" (http://ieeexplore.ieee.org/document/4740552/).

#### More on practice:
Q. Peter He, Jin Wang, "Fault Detection Using the k-Nearest Neighbor Rule for Semiconductor Manufacturing Processes" (http://ieeexplore.ieee.org/document/4369338/).

Anand Upadhyay, Aditya Shetty, etc, "Land use and land cover classification of LISS-III satellite image using KNN and decision tree"(http://ieeexplore.ieee.org/document/7724471/).

