### K-Nearest Neighbors
The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.

### The KNN Algorithm
1. Load the data
2. Initialize K to your chosen number of neighbors
3. For each example in the data
    - Calculate the distance between the query example and the current example from the data
    - Add the distance and the index of the example to ordered collection
4. Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances
5. Pick the first K entries form the sorted collection
6. Get the labels of the selected K entries
7. If regresson, retrun the mean of the K labels
8. If classfication, return the mode of the K labels.


In [16]:
import numpy as np
import pandas as pd
from collections import Counter

In [14]:
def euclidean_distance(x1, x2):
    return np.sqrt(np.sum(np.power((x1-x2), 2)))

In [41]:
class KNN:
    """
    K Nearest Neibors Algorithm python implementation
    """
    
    def __init__(self, k=3):
        """
        Args:
            k: number of nearest neighbors
        """
        self.k = k
    
    def fit(self, X, y):
        self.X_train = X
        self.y_train = y
    
    def predict(self, X):
        predicted_labels = [self._predict(x) for x in X]
        return np.array(predicted_labels)
        
    def _predict(self, x):
        # compute distances
        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
        
        # get k nearest samples, labels
        k_indices = np.argsort(distances)[:self.k]
        k_nearest_labels = [self.y_train[i] for i in k_indices]
        
        # majority vote, most common class label 
        most_common = Counter(k_nearest_labels).most_common(1)
        return most_common[0][0]

In [42]:
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.model_selection import train_test_split

In [43]:
cmap = ListedColormap(["#FF0000", "#00FF00", "#0000FF"])

In [44]:
def accuracy(y_true, y_pred):
    return np.sum(y_true == y_pred)/len(y_true)

In [45]:
iris = datasets.load_iris()

X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=530)

In [46]:
clf = KNN(k=3)
clf.fit(X_train, y_train)

In [47]:
predictions = clf.predict(X_test)
print(f"KNN classification accuracy {accuracy(y_test, predictions)}")

KNN classification accuracy 0.9333333333333333


### Choosing the right value for K
1. As we decrease the value of K to 1, our predictions become less stable. Less k will result less accuracy.
2. Large value of k will generate a more stable result but if error labels are too much in the dataset, then the model become less reliable.
3. Prefer to choose a number k with odd to avoid tie situation.

### Advantages
1. The algorithm is simple and easy to implement.
2. There is no need to build a model, tune several parameters or make addtional assumptions.
3. The algorithm is versatile. It can be used for classification, regression, and research.

### Disadvantages
1. The algorithm gets significantly slower as the number of examples and/or predictors/independent variables increase.