# KNN

[Source](https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/)

## What is k-Nearest Neighbors

The model for kNN is the entire training dataset. When a prediction is required for a unseen data instance, the kNN algorithm will search through the training dataset for the k-most similar instances. The prediction attribute of the most similar instances is summarized and returned as the prediction for the unseen instance.

The similarity measure is dependent on the type of data. For real-valued data, the Euclidean distance can be used. Other other types of data such as categorical or binary data, Hamming distance can be used.

## How does k-Nearest Neighbors Work

The kNN algorithm is belongs to the family of instance-based, competitive learning and lazy learning algorithms.

Instance-based algorithms are those algorithms that model the problem using data instances (or rows) in order to make predictive decisions. The kNN algorithm is an extreme form of instance-based methods because all training observations are retained as part of the model.

It is a competitive learning algorithm, because it internally uses competition between model elements (data instances) in order to make a predictive decision. The objective similarity measure between data instances causes each data instance to compete to “win” or be most similar to a given unseen data instance and contribute to a prediction.

Lazy learning refers to the fact that the algorithm does not build a model until the time that a prediction is required. It is lazy because it only does work at the last second. This has the benefit of only including data relevant to the unseen data, called a localized model. A disadvantage is that it can be computationally expensive to repeat the same or similar searches over larger training datasets.

In [0]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    random_state=42)

In [0]:
d_train = np.hstack([X_train, y_train.reshape((len(y_train), 1))])
d_test = np.hstack([X_test, y_test.reshape((len(y_test), 1))])

In [3]:
d_test[:3]

array([[6.1, 2.8, 4.7, 1.2, 1. ],
       [5.7, 3.8, 1.7, 0.3, 0. ],
       [7.7, 2.6, 6.9, 2.3, 2. ]])

Step to compute Knn

- Similarity: Calculate the distance between two data instances.
- Neighbors: Locate k most similar data instances.
  - Need to compute the distance column wise and then sort it
  - Then select the k most similar neighbors from the training set for a given test instance
- Response: Generate a response from a set of data instances.
- Accuracy: Summarize the accuracy of predictions.
- Main: Tie it all together.

### SImilarity

In order to make predictions we need to calculate the similarity between any two given data instances. This is needed so that we can locate the k most similar data instances in the training dataset for a given member of the test dataset and in turn make a prediction.

$$\begin{aligned} d(\mathbf{p}, \mathbf{q})=d(\mathbf{q}, \mathbf{p}) &=\sqrt{\left(q_{1}-p_{1}\right)^{2}+\left(q_{2}-p_{2}\right)^{2}+\cdots+\left(q_{n}-p_{n}\right)^{2}} \\ &=\sqrt{\sum_{i=1}^{n}\left(q_{i}-p_{i}\right)^{2}} \end{aligned}$$


In [0]:
def euclideanDistance(vect_1, vect_2):
  
  """
  Compute the Euclidean distance 
  """
  
  sum_ = np.sum(np.power(vect_1 - vect_2, 2))
  
  euclideanD = np.sqrt(sum_)
  return euclideanD

In [5]:
vect1 = np.array([2, 3, 4])
vect2 = np.array([3,3,5])

euclideanDistance(vect1, vect2)

1.4142135623730951

### Neighbors

Calculate the distance between test data and each row of training data.

We want to sum each column of the test rows with all the columns of each rows in the train data

Take an example, we want to compute the Euclidean distance of this test data:
- [7.2, 3.6, 5.1, 2.5]: It has 4 columns and one row

we will compute the Euclidean distance with all the rows of the training dataset. The training dataset has also four columns. We are substracting each columns of the test and train dataset respecively 

Exemple with the first rows of the training data:
- [5.1, 3.5, 1.4, 0.2]

The Euclidean distance is $$\sqrt{((7.2 - 5.1)^2  + (3.6 - 3.5)^2 + (5.1 -1.4)^2 + (2.5 - 0.2)^2)}$$

We repeat for all the rows in the training dataset

When we have all the distances, we need to perform an ascending sorting

Finaly, we extract the K neighbor. Note that if k = 1, we only select the closest neighbort
 

In [0]:
def neighbors(train, test, k = 1):
  """
  Compute the Euclidean distance with each rows of the train dataset
  Train contains the Y column at the last column of the array. Should 
  not use it in the cmputation
  """
  
  columns_x = train[:, :-1]
  
  n_rows = len(columns_x)
  list_distance = []
  list_label = []
  
  for i in range(0, n_rows):
    
    distance = euclideanDistance(columns_x[i], test)
    
    ### Store the distance
    list_distance.append(distance)
    
    ### Store the label from the training data
    list_label.append(train[i,-1])
    
  dic_distance_label = {
      'distances':list_distance,
      'labels' : list_label
  }
  
  #### Get a DataFrame, easier to visualize
  
  EuclideanDis = pd.DataFrame(dic_distance_label).sort_values('distances',
                                                              ascending = True)
  
  #### Extract the k neighbor
  
  k_neighbors = EuclideanDis.head(k)
    
  return k_neighbors

Let's make a test with the first row of the test array.

The test array has the following values:

- [6.1, 2.8, 4.7, 1.2]
- class:  1

In [10]:
vect1 = d_test[:1,:-1]

k_neighbors = neighbors(train = d_train, test = vect1, k = 1)
k_neighbors

Unnamed: 0,distances,labels
59,0.223607,1.0


## Responce

We want to get the most frequent class of k distances return in the previous steps. 

We can do this by allowing each neighbor to vote for their class attribute, and take the majority vote as the prediction.

Since we return a pandas Dataframe in the `neighbors` function, we can jointly us the build-in `groupby` with `count`
to get the majority

In [0]:
def responce(k_neighbors):
  """
  We need to count the number of occurences for each class. 
  
  The one with the most vote win, and be declared the predicted class.
  
  Not that, if k = 1, then obviously no vote needed.
  
  We also want to return the index of the nearest rows in the training set

  """
  
  groupby = k_neighbors.groupby('labels').aggregate(
      {'distances': 'count'}).sort_values('distances',
                                          ascending = False)
  
  groupby = groupby.rename(index=str, columns={"distances": "Count"})
  
  majority = groupby.reset_index().head(1)
  winner = float(majority['labels'][0])
  #### 
  index_winner = k_neighbors[k_neighbors['labels'] == winner].index.tolist()
  
  dic_final = {
      
      'class_winner': winner,
      'index_train': index_winner
  }
  
  return dic_final
  

In [68]:
responce(k_neighbors = k_neighbors)

{'class_winner': 1.0, 'index_train': [59]}

## Wrap algorithm

In [0]:
def knn(train, test, k = 1):
  """
  Wrap two functions to get the predicted class
  
  1: neighbors: Compute Euclidean distance columns wise between
  train and test set
  Ascending sort and extract k nearest neighbors
  2: responce: Compute the winning class as the majority by label
  
  return the winning class and the index in the train set
  """
  
  k_neighbors = neighbors(train = train, test = test, k = k)
  
  winner = responce(k_neighbors = k_neighbors)
  
  return winner

In [75]:
knn(train = d_train, test = vect1, k = 3)

{'class_winner': 1.0, 'index_train': [59, 70, 19]}

## Test Scikit learn

We can compare the results with Scikit learn

In [76]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)

neigh.fit(X_train, y_train)

print(neigh.predict(vect1.reshape((1,4))),
neigh.kneighbors(vect1.reshape((1,4)))[1])

[1] [[59 70 19]]
