# K-Nearest Neighbors

Considered a lay learner because it doesn't rechnically train the model to make predictions, instead an observation is predicted to a class depending on the proportion of k-nearest observations. (if there are some elemets of class A around the item, will be predicted as A.

It is one of the most used classifiers in supervised machine learning. 

Iindices contain the locations of the observations in oir dataset that are closest.

x[indices] display the values of those observations.

The distance is a measure of similarity so the closest the elements are the more similar they are.

By default the distance used is Minkowsky distance. (p = 1 Manhattan ; p = 2 Euclidean)



In [40]:
# Load libraries

from sklearn.datasets import load_iris
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

In [41]:
# Load data

iris = load_iris()
features = iris.data

In [42]:
# Create Standardizer. Must be in the same scale-

standardizer = StandardScaler()

In [43]:
# Standardize features

features_standardized  = standardizer.fit_transform(features)

In [44]:
# Two nearest neighbors

nearest_neighbors = NearestNeighbors(n_neighbors = 2,p=1).fit(features_standardized)

In [45]:
# Create an observation 

observation = [1,1,1,1]

In [46]:
# Find distances and indices of the onservations's nearest neighbors

distances , indices = nearest_neighbors.kneighbors([observation])

In [47]:
distances

array([[0.76874397, 1.16709368]])

In [48]:
indices

array([[124, 110]])

In [49]:
# View the nearest reighbors

features_standardized[indices]

array([[[1.03800476, 0.55861082, 1.10378283, 1.18556721],
        [0.79566902, 0.32841405, 0.76275827, 1.05393502]]])

In [50]:
# Set distance metric using metric 

nearestneighbors_euclidean = NearestNeighbors (n_neighbors= 2 , metric = "euclidean" ).fit(features_standardized)
nearestneighbors_euclidean

NearestNeighbors(algorithm='auto', leaf_size=30, metric='euclidean',
                 metric_params=None, n_jobs=None, n_neighbors=2, p=2,
                 radius=1.0)

In [51]:
distances , indices = nearestneighbors_euclidean.kneighbors([observation])

In [52]:
distances

array([[0.49140089, 0.74294782]])

In [53]:
indices

array([[124, 110]])

 We can use kneighbors_graph to create a matrix indicating each observation nearest neighbors.



In [54]:
# Find each observation's three nearest neighbors based on euclidean distance (indluding it self)

nearest_neighbors_with_self = nearestneighbors_euclidean.kneighbors_graph(features_standardized).toarray()
nearest_neighbors_with_self


array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [55]:
# Remove 1's marking an observation is a nearest neighbor to itself

for i,x in enumerate(nearest_neighbors_with_self):
    x[i]=0

In [56]:
# View first observation's two nearest neighbors

nearest_neighbors_with_self[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

# Creating K-Nearest Neighbor Classifier (KNN)

Given an observation of unknown class you need to predict its class based on the class of its neighbors.

In KNN, given an observation with an unknown target class, the algorithm first identifies the k closest observations based on some distance metric (euclidean distance).

Then these k observations vote basd on their class and the class that wins the vote is the predicted class.

The class with the highest probability becomes the predicted class. 



In [58]:
# Load libraries

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

In [61]:
# Load data

iris = load_iris()

X = iris.data
y = iris.target

In [None]:
# Create standardizer

standardizer = StandardScaler()

In [62]:
# Standardize features

X_std = standardizer.fit_transform(X)

In [64]:
# Train a knn classidier with 5 neighbors

knn = KNeighborsClassifier(n_neighbors = 5, n_jobs = -1).fit(X_std, y)

In [65]:
# Create two observations 

new_observations = [[0.75, 0.75, 0.75, 0.75],[1,1,1,1]]

In [66]:
# Predict the class of two observations 

knn.predict(new_observations)

array([1, 2])

In [67]:
# View probability  each observation is one of three classes

knn.predict_proba(new_observations)

array([[0. , 0.6, 0.4],
       [0. , 0. , 1. ]])