# Nearest Neighbors Classifier

### Hands-on example

In this hands-on example we will explore a multiclass classification problem.  
We will use a wine dataset to classify **3 classes** of wines using some real valued features.  
  
Outline:
- Demonstrate working of a simple k-NN classifier in Scikit-learn
- Load wine dataset
- Perform k-NN using Scikit-learn on the wine dataset
- Vary similarity measures to see performance

In [1]:
# Import the classifier
from sklearn.neighbors import KNeighborsClassifier

API Reference : https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [2]:
# Sample dataset - 4 samples, two classes (0 and 1)
# Use 2-D data
X = [[0], [1], [2], [3]] # 4x1-dimensional
y = [0, 0, 1, 1]

What we have above are one dimensional points. Logically, a decision boundary should exist at x = 1.5. We will verify if it holds

In [3]:
# Initialize classifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y);

In [4]:
# Make a prediction
print (neigh.predict([[1.1]]))

[0]


In [5]:
print (neigh.predict([[1.6]]))

[1]


In [6]:
# Get probability
print (neigh.predict_proba([[2]]))

[[0.33333333 0.66666667]]


Now let's train a k-NN on Wine dataset!

### Data Description
1. Multiclass classification problem with 3 labels: $\{1,2,3\}$ representing 3 different cultivators
2. 13 continuous attributes describing the properties of wine like _'Alcohol'_, _'Malic Acid'_ etc.
3. UCI Machine Learning Repository

In [7]:
from matplotlib.colors import ListedColormap         # Visualization
import numpy as np                                   # Numerical operations
import matplotlib.pyplot as plt                      # Plotting
from sklearn.model_selection import train_test_split # Data splitting
import pandas as pd                                  # Data management

data = pd.read_csv('wine_original.csv')
labels = data['class']
del data['class']
X = data
y = labels

In [8]:
data.head()

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315,Proline
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [9]:
data.describe()

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315,Proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


In [10]:
# Split into testing and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

**Distance functions**  
k-NN performance depends on the distance function used.  

![Minkowski distance](minkowski2.png)

In [11]:
k = 3

clf = KNeighborsClassifier(k)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
print ('accuracy = ' + str(np.sum(predictions == y_test)/(len(y_test))))

accuracy = 0.6944444444444444


By default 'Minkowski' distance with $p = 2$ is used -> $l_2$  distance  
Let us modify the distance measure to Manhattan distance ( $p = 1$ ) -> $l_1$ distance

In [12]:
# Parameter 'p' is the Power parameter for the Minkowski metric.
# p = 1 --> Manhattan distance
# p = 2 --> Euclidean distance

clf = KNeighborsClassifier(k, p=1) # p = 1 corresponds to Manhattan distance
# p = 2 gives Euclidean distance (default is p = 2 and hence, Euclidean distance)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
accuracy = np.sum(predictions == y_test)/(len(y_test))
print ("Accuracy = " + str(accuracy) + " at k = 3")

Accuracy = 0.7777777777777778 at k = 3
