# Nearest Neighbors Classification

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

Scilit-learn provides two different nearest neighbors classifiers: KneighborsClassifier and RadiusNeighborsClassifier. The first classifier is based on the k nearest neighbors of each query point, k can be spcified by users. The second classifier is based on the mumber of neighbors within a fixed radiusr of each traning point, r is a floating-point value can be specified by user.

There are three nearest neighbor algorithms including brute force, K-D Tree and Ball Tree. The brute force is the most naive method which will calculate the distances between all pairs of points in the dataset. The scales of this approach is O[D N^2]

The K-D Tree address the computational inefficiencies of the brute foce approach. The basic idea is that if point A is very distant from point B, and point B is very close to point C, then we know that points A and C are very distant.

The Ball Tree address the inefficiencies of KD Tree in higher dimensions.

# Example by using the wine data

In [2]:
import numpy as np
import pandas as pd
from sklearn import neighbors
import matplotlib.pyplot as plt
%matplotlib inline

## Read in the wine data

In [3]:
wine = pd.read_csv('../Data/wine.data.csv', header=None)
wine.columns = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
wine.head(10)

Unnamed: 0,Class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735
5,1,14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450
6,1,14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290
7,1,14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295
8,1,14.83,1.64,2.17,14.0,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045
9,1,13.86,1.35,2.27,16.0,98,2.98,3.15,0.22,1.85,7.22,1.01,3.55,1045


## Nearest Neighbors Classification

In [4]:
from sklearn.model_selection import train_test_split

#### Prepare the data

In [5]:
X = wine.iloc[:, 1:14]
Y = wine['Class']
print(X.shape)
print(Y.shape)

(178, 13)
(178,)


#### Seperate the data into training data and test data

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.33, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(119, 13)
(59, 13)
(119,)
(59,)


#### Instantiate a nearest neighbors model

In [7]:
nbrs = neighbors.KNeighborsClassifier(n_neighbors= 20)

In [8]:
nbrs.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=20, p=2,
                     weights='uniform')

Predict value based on trained model

In [9]:
y_pred = nbrs.predict(X_test)
y_pred

array([3, 2, 3, 3, 1, 3, 2, 1, 2, 2, 1, 3, 2, 1, 2, 2, 2, 1, 2, 1, 1, 2,
       3, 1, 3, 3, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 3, 2, 1, 1, 2, 2, 3,
       1, 3, 3, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, 3])

In [10]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred)

0.6949152542372882

In [11]:
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           1       0.79      0.90      0.84        21
           2       0.77      0.68      0.72        25
           3       0.38      0.38      0.38        13

    accuracy                           0.69        59
   macro avg       0.65      0.66      0.65        59
weighted avg       0.69      0.69      0.69        59



From the report above, the perfomance of this model is not so good. The model is not accurate for class 3 wine. The perfomance for this model was increased when the n_neighbors number increased, but after value 20 the accuracy stays the same. The maximum of accuracy for this model is 69.5%

In [12]:
cm = metrics.confusion_matrix(y_test, y_pred)
cmdf = pd.DataFrame(cm, index = ['0','1', '2'], columns = ['0', '1','2'])
cmdf

Unnamed: 0,0,1,2
0,19,0,5
1,2,17,3
2,0,8,5


Subset the data and only including the first three variable

In [13]:
X1 = wine.iloc[:, 1:5]
X1

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash
0,14.23,1.71,2.43,15.6
1,13.20,1.78,2.14,11.2
2,13.16,2.36,2.67,18.6
3,14.37,1.95,2.50,16.8
4,13.24,2.59,2.87,21.0
...,...,...,...,...
173,13.71,5.65,2.45,20.5
174,13.40,3.91,2.48,23.0
175,13.27,4.28,2.26,20.0
176,13.17,2.59,2.37,20.0


In [14]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, Y, test_size = 0.33, random_state=1)
print(X1_train.shape)
print(X1_test.shape)
print(y1_train.shape)
print(y1_test.shape)

(119, 4)
(59, 4)
(119,)
(59,)


In [15]:
nbrs1 = neighbors.KNeighborsClassifier(n_neighbors=10)
nbrs1.fit(X1_train, y1_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform')

In [16]:
y1_pred = nbrs1.predict(X1_test)

In [17]:
metrics.accuracy_score(y1_test, y1_pred)

0.847457627118644

In [18]:
cm1 = metrics.confusion_matrix(y1_test, y1_pred)
cmdf1 = pd.DataFrame(cm1, index = ['0','1', '2'], columns = ['0', '1','2'])
cmdf1

Unnamed: 0,0,1,2
0,22,1,1
1,2,19,1
2,0,4,9


In [19]:
print(classification_report(y1_pred, y1_test))

              precision    recall  f1-score   support

           1       0.92      0.92      0.92        24
           2       0.86      0.79      0.83        24
           3       0.69      0.82      0.75        11

    accuracy                           0.85        59
   macro avg       0.82      0.84      0.83        59
weighted avg       0.85      0.85      0.85        59



Once reducing the variables number the accuracy of the model increased to 85%. From the report above, the class 3 still has the lowest precision value. 

Bias are the assumptions made by a model or algorithm, low bias means less assumptions to the target function, while high bias means more assumption to the target function. Variance is the estimate value that describe how spread of the data, low variance means small value change to the target function, while high variance means large vaule change to the target function.

The goal for any model is to achieve both low bias and low variance.Overfitting happens when a  model or algorithm including the noise within the data. In another words, overfitting means the model or the algorithm fits the data too well.  Specifically, overfitting occurs if the model or algorithm shows low bias but high variance.

The K-neareast neighbors is the example of low bias and high variance. The situation could be change when increaing the value k which is the number of neighbors. The increase of the k value will lower the variance and increase the bias of the model.