## K nearest neighbor classifier

We start with the iris dataset. It has three classes (species of flower) and 4 features (sepal and petal length in cm).

We can split the datset into training and test parts and check the out-of-sample performance of the trained model.



In [84]:
import numpy as np
from sklearn import datasets, metrics
np.random.seed(123456)

In [94]:
iris = datasets.load_iris()
# print(iris.DESCR)
# iris.data is the data container. It is a numpy.ndarray object
# iris.target contains the class variable (in numeric form)

# iris.data.shape
# iris.data[0:5,:]    #  See first five rows. numpy indexing is zero-based, start:stop:step 
                    # https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
    
dm = iris.data      # data matrix    

In [95]:
# Splitting the data and classifications into training and test sets

nrows = dm.shape[0]         # number of rows in the data matrix
f = 2/3                     # fraction to be used for training

ntrain = int(f*nrows)     # Number of observations to be used for training
ntest = nrows - ntrain

v1 = np.random.choice(nrows,size=ntrain,replace=False) # vector of random integers from 0 to nrows-1
v2 = np.setdiff1d(np.arange(nrows),v1)


dtrain = dm[v1,:] ; ctrain = iris.target[v1]    # training data and corresponding classifications

dtest = dm[v2,:]  ; ctest = iris.target[v2]     # test data and classifications

From the [documentation](https://scikit-learn.org/stable/tutorial/statistical_inference/settings.html)

> Fitting data: the main API implemented by scikit-learn is that of the estimator. An estimator is any object
> that learns from data; it may be a classification, regression or clustering algorithm or a transformer that
> extracts/filters useful features from raw data.

> All estimator objects expose a fit method that takes a dataset (usually a 2-d array):

In [87]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()                       # Instantiate a classifier object

In [62]:
knn.fit(dtrain,ctrain)                            # Train the object. Note the parameters in the output

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [72]:
yhat = knn.predict(dtest)
ctest
yhat
metrics.mean_squared_error(ctest,yhat)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
       2, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2])

0.08

In [86]:
# Testing if the distance metric makes a difference

for m in ["euclidean","manhattan","chebyshev"]:     # manhattan: sum(|x-y|), chebyshev: max(|x-y|)
    knn = KNeighborsClassifier(metric=m)
    knn.fit(dtrain,ctrain)
    yhat = knn.predict(dtest)
    MSE = metrics.mean_squared_error(ctest,yhat)
    print(m,MSE)
    
    

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

euclidean 0.08


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='manhattan',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

manhattan 0.1


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='chebyshev',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

chebyshev 0.06


In [91]:
# Same distance metric, but different numbers of neighbors

for n in [3,5,7,10]:     
    knn = KNeighborsClassifier(metric="euclidean",n_neighbors=n)
    knn.fit(dtrain,ctrain)
    yhat = knn.predict(dtest)
    MSE = metrics.mean_squared_error(ctest,yhat)
    print(n,MSE)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

3 0.08


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

5 0.08


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=7, p=2,
                     weights='uniform')

7 0.06


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform')

10 0.06


## Logistic regression