# **CS 5361/6361 Machine Learning**

**Classification using k-nn and the scikit-learn library**

**Author:** Ruben Martinez
**Last modified:** 9/16/2024<br>


In [1]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import time

Intitializing Scala interpreter ...

Download data.

In [2]:
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = np.float32(X_train.reshape(X_train.shape[0],-1)/255)
X_test = np.float32(X_test.reshape(X_test.shape[0],-1)/255)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(60000, 784)
(60000,)
(10000, 784)
(10000,)


Now we will classify the test set using the sklearn implementation of k-nearest neighbors with default parameters.

The documentation can be found here:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html



In [3]:
classifier = KNeighborsClassifier()
start = time.time()
classifier.fit(X_train, y_train)
end = time.time()
print('Elapsed time training= {:.4f} secs'.format(end-start))
start = time.time()
pred = classifier.predict(X_test)
end = time.time()
print('Elapsed time testing= {:.4f} secs'.format(end-start))
print(f'Accuracy: {accuracy_score(y_test,pred):6.4f}')
print(f'Confusion matrix:\n{confusion_matrix(y_test,pred)}')

Elapsed time training= 0.1290 secs
Elapsed time testing= 48.7400 secs
Accuracy: 0.9688
Confusion matrix:
[[ 974    1    1    0    0    1    2    1    0    0]
 [   0 1133    2    0    0    0    0    0    0    0]
 [  11    8  991    2    1    0    1   15    3    0]
 [   0    3    3  976    1   13    1    6    3    4]
 [   3    7    0    0  944    0    4    2    1   21]
 [   5    0    0   12    2  862    4    1    2    4]
 [   5    3    0    0    3    2  945    0    0    0]
 [   0   22    4    0    3    0    0  988    0   11]
 [   8    3    5   13    6   12    5    5  913    4]
 [   5    7    3    9    7    3    1   10    2  962]]


By default, the k-neighbors classifier uses 5 nearest neighbors (n_neighbors = 5). Would you expect the accuracy to increase or decrease using:
*   k = 1?
*   k = 15?



By default, the algorithm assigns the same weight to all the nearest neighbors of a test example (weights = 'uniform'), while distance weighting assigns a larger weight to neighbors that are closer to the test example. Would you expect the accuracy to increase or decrease using distance weighting and:
*   k = 1?
*   k = 2?
*   k = 5?
*   k = 15?

Suppose we select the n features with the largest variance in the training set. If we select half of the features (that is, n =  784/2 = 392), what do you expect to change in terms of


*   Training time
*   Classification time
*   Accuracy





# Predicting with 1 neighbor (unweighted)

In [4]:
classifier = KNeighborsClassifier(n_neighbors = 1)
start = time.time()
classifier.fit(X_train, y_train)
end = time.time()
print('Elapsed time training= {:.4f} secs'.format(end-start))
start = time.time()
pred = classifier.predict(X_test)
end = time.time()
print('Elapsed time testing= {:.4f} secs'.format(end-start))
print(f'Accuracy: {accuracy_score(y_test,pred):6.4f}')
print(f'Confusion matrix:\n{confusion_matrix(y_test,pred)}')

Elapsed time training= 0.0450 secs
Elapsed time testing= 41.5560 secs
Accuracy: 0.9691
Confusion matrix:
[[ 973    1    1    0    0    1    3    1    0    0]
 [   0 1129    3    0    1    1    1    0    0    0]
 [   7    6  992    5    1    0    2   16    3    0]
 [   0    1    2  970    1   19    0    7    7    3]
 [   0    7    0    0  944    0    3    5    1   22]
 [   1    1    0   12    2  860    5    1    6    4]
 [   4    2    0    0    3    5  944    0    0    0]
 [   0   14    6    2    4    0    0  992    0   10]
 [   6    1    3   14    5   13    3    4  920    5]
 [   2    5    1    6   10    5    1   11    1  967]]


# Predicting with 15 neighbors (unweighted)

In [5]:
classifier = KNeighborsClassifier(n_neighbors = 15)
start = time.time()
classifier.fit(X_train, y_train)
end = time.time()
print('Elapsed time training= {:.4f} secs'.format(end-start))
start = time.time()
pred = classifier.predict(X_test)
end = time.time()
print('Elapsed time testing= {:.4f} secs'.format(end-start))
print(f'Accuracy: {accuracy_score(y_test,pred):6.4f}')
print(f'Confusion matrix:\n{confusion_matrix(y_test,pred)}')

Elapsed time training= 0.0276 secs
Elapsed time testing= 42.2809 secs
Accuracy: 0.9633
Confusion matrix:
[[ 970    1    1    0    0    2    5    1    0    0]
 [   0 1131    2    1    0    0    1    0    0    0]
 [  15   15  968    3    1    0    3   20    7    0]
 [   0    3    2  975    1   14    0    7    4    4]
 [   1   13    0    0  934    0    5    2    1   26]
 [   3    1    0   10    1  863    8    2    0    4]
 [   7    4    0    0    3    1  943    0    0    0]
 [   0   28    3    0    2    0    0  980    0   15]
 [   7    4    5   13    7   12    5    7  907    7]
 [   6    7    2    9    9    2    1   10    1  962]]


# Predicting with 1 neighbor (weighted)


In [6]:
classifier = KNeighborsClassifier(n_neighbors = 1, weights = "distance")
start = time.time()
classifier.fit(X_train, y_train)
end = time.time()
print('Elapsed time training= {:.4f} secs'.format(end-start))
start = time.time()
pred = classifier.predict(X_test)
end = time.time()
print('Elapsed time testing= {:.4f} secs'.format(end-start))
print(f'Accuracy: {accuracy_score(y_test,pred):6.4f}')
print(f'Confusion matrix:\n{confusion_matrix(y_test,pred)}')

Elapsed time training= 0.0391 secs
Elapsed time testing= 44.4569 secs
Accuracy: 0.9691
Confusion matrix:
[[ 973    1    1    0    0    1    3    1    0    0]
 [   0 1129    3    0    1    1    1    0    0    0]
 [   7    6  992    5    1    0    2   16    3    0]
 [   0    1    2  970    1   19    0    7    7    3]
 [   0    7    0    0  944    0    3    5    1   22]
 [   1    1    0   12    2  860    5    1    6    4]
 [   4    2    0    0    3    5  944    0    0    0]
 [   0   14    6    2    4    0    0  992    0   10]
 [   6    1    3   14    5   13    3    4  920    5]
 [   2    5    1    6   10    5    1   11    1  967]]


# Predicting with 2 neighbors (weighted)

In [7]:
classifier = KNeighborsClassifier(n_neighbors = 2, weights = "distance")
start = time.time()
classifier.fit(X_train, y_train)
end = time.time()
print('Elapsed time training= {:.4f} secs'.format(end-start))
start = time.time()
pred = classifier.predict(X_test)
end = time.time()
print('Elapsed time testing= {:.4f} secs'.format(end-start))
print(f'Accuracy: {accuracy_score(y_test,pred):6.4f}')
print(f'Confusion matrix:\n{confusion_matrix(y_test,pred)}')

Elapsed time training= 0.0341 secs
Elapsed time testing= 42.1870 secs
Accuracy: 0.9691
Confusion matrix:
[[ 973    1    1    0    0    1    3    1    0    0]
 [   0 1129    3    0    1    1    1    0    0    0]
 [   7    6  992    5    1    0    2   16    3    0]
 [   0    1    2  970    1   19    0    7    7    3]
 [   0    7    0    0  944    0    3    5    1   22]
 [   1    1    0   12    2  860    5    1    6    4]
 [   4    2    0    0    3    5  944    0    0    0]
 [   0   14    6    2    4    0    0  992    0   10]
 [   6    1    3   14    5   13    3    4  920    5]
 [   2    5    1    6   10    5    1   11    1  967]]


# Predicting with 5 neighbors (weighted)

In [8]:
classifier = KNeighborsClassifier(n_neighbors = 5, weights = "distance")
start = time.time()
classifier.fit(X_train, y_train)
end = time.time()
print('Elapsed time training= {:.4f} secs'.format(end-start))
start = time.time()
pred = classifier.predict(X_test)
end = time.time()
print('Elapsed time testing= {:.4f} secs'.format(end-start))
print(f'Accuracy: {accuracy_score(y_test,pred):6.4f}')
print(f'Confusion matrix:\n{confusion_matrix(y_test,pred)}')

Elapsed time training= 0.0365 secs
Elapsed time testing= 44.1822 secs
Accuracy: 0.9691
Confusion matrix:
[[ 974    1    1    0    0    1    2    1    0    0]
 [   0 1133    2    0    0    0    0    0    0    0]
 [  11    7  989    2    0    0    2   17    4    0]
 [   0    2    3  973    1   13    1    7    4    6]
 [   2    7    0    0  943    0    4    3    0   23]
 [   4    0    0    9    2  861    6    1    4    5]
 [   5    3    0    0    3    2  945    0    0    0]
 [   0   20    4    0    3    0    0  990    0   11]
 [   7    3    5   12    5   11    5    5  916    5]
 [   3    5    3    7    7    3    1   11    2  967]]


# Predicting with 15 neighbors (weighted)

In [9]:
classifier = KNeighborsClassifier(n_neighbors = 15, weights = "distance")
start = time.time()
classifier.fit(X_train, y_train)
end = time.time()
print('Elapsed time training= {:.4f} secs'.format(end-start))
start = time.time()
pred = classifier.predict(X_test)
end = time.time()
print('Elapsed time testing= {:.4f} secs'.format(end-start))
print(f'Accuracy: {accuracy_score(y_test,pred):6.4f}')
print(f'Confusion matrix:\n{confusion_matrix(y_test,pred)}')

Elapsed time training= 0.0281 secs
Elapsed time testing= 42.2814 secs
Accuracy: 0.9647
Confusion matrix:
[[ 970    1    1    0    0    2    5    1    0    0]
 [   0 1131    2    1    0    0    1    0    0    0]
 [  13   13  972    3    1    0    3   20    7    0]
 [   0    3    2  973    1   13    1    8    5    4]
 [   1   13    0    0  933    0    5    2    1   27]
 [   3    1    0    8    1  865    8    1    0    5]
 [   7    4    0    0    3    1  943    0    0    0]
 [   0   27    3    0    2    0    0  981    0   15]
 [   6    4    4   12    6   11    3    8  913    7]
 [   5    6    2    8    9    2    1    9    1  966]]
