# Name: Harsh Siddhapura
# ASU ID: 1230169813

# Lab 8: Building a KNN Classification Model in SciKit Learn

In [3]:
import sklearn.datasets as sk
import sklearn.neighbors as kn
import pandas as pd

(X,y) = sk.load_svmlight_file("colon_cancer_train_1.libsvm")
(XTesting,yTesting) = sk.load_svmlight_file("colon_cancer_test_1.libsvm")
knn_c = kn.KNeighborsClassifier(n_neighbors=3)
knn_c.fit(X,y)
testingAccuracy = knn_c.score(XTesting,yTesting).round(3)
print(f'Testing Accuracy: {testingAccuracy*100}%')


Testing Accuracy: 78.3%


In [4]:
trainingAccuracy = knn_c.score(X,y).round(3)
print(f'Training Accuracy: {trainingAccuracy*100}%')

Training Accuracy: 82.1%


### Experimenting with Different Parameters

In [5]:
neighbors = [3,5,7,9,11]

In [6]:
metrics = list(kn.VALID_METRICS_SPARSE['brute'])
print(f'Metrics: {metrics}')

Metrics: ['cosine', 'l1', 'manhattan', 'precomputed', 'euclidean', 'cityblock', 'l2']


In [7]:
if 'precomputed' in metrics:
    metrics.remove('precomputed')
print(f'Metrics: {metrics}')

Metrics: ['cosine', 'l1', 'manhattan', 'euclidean', 'cityblock', 'l2']


In [8]:
testingList = []
trainingList = []
columnNames = ['Neighbors', 'Metric', 'Accuracy']

### Testing Dataset

In [9]:
for n in neighbors:
    for m in metrics:
        knn_c = kn.KNeighborsClassifier(n_neighbors=n, metric=m)
        knn_c.fit(X,y)
        testingAccuracy = knn_c.score(XTesting,yTesting).round(3)
        testingList.append([n,m,testingAccuracy])
        print(f'{n} neighbors with {m} metric has testing accuracy of {testingAccuracy}')
    print()

3 neighbors with cosine metric has testing accuracy of 0.826
3 neighbors with l1 metric has testing accuracy of 0.783
3 neighbors with manhattan metric has testing accuracy of 0.783
3 neighbors with euclidean metric has testing accuracy of 0.783
3 neighbors with cityblock metric has testing accuracy of 0.783
3 neighbors with l2 metric has testing accuracy of 0.783

5 neighbors with cosine metric has testing accuracy of 0.87
5 neighbors with l1 metric has testing accuracy of 0.87
5 neighbors with manhattan metric has testing accuracy of 0.87
5 neighbors with euclidean metric has testing accuracy of 0.87
5 neighbors with cityblock metric has testing accuracy of 0.87
5 neighbors with l2 metric has testing accuracy of 0.87

7 neighbors with cosine metric has testing accuracy of 0.87
7 neighbors with l1 metric has testing accuracy of 0.913
7 neighbors with manhattan metric has testing accuracy of 0.913
7 neighbors with euclidean metric has testing accuracy of 0.87
7 neighbors with cityblock

In [10]:
testingDF = pd.DataFrame(testingList, columns = columnNames)
testingDF

Unnamed: 0,Neighbors,Metric,Accuracy
0,3,cosine,0.826
1,3,l1,0.783
2,3,manhattan,0.783
3,3,euclidean,0.783
4,3,cityblock,0.783
5,3,l2,0.783
6,5,cosine,0.87
7,5,l1,0.87
8,5,manhattan,0.87
9,5,euclidean,0.87


### Training Dataset

In [11]:
for n in neighbors:
    for m in metrics:
        knn_c = kn.KNeighborsClassifier(n_neighbors=n, metric=m)
        knn_c.fit(X,y)
        trainingAccuracy = knn_c.score(X,y).round(3)
        trainingList.append([n,m,trainingAccuracy])
        print(f'{n} neighbors with {m} metric has training accuracy of {trainingAccuracy}')
    print()

3 neighbors with cosine metric has training accuracy of 0.795
3 neighbors with l1 metric has training accuracy of 0.821
3 neighbors with manhattan metric has training accuracy of 0.821
3 neighbors with euclidean metric has training accuracy of 0.821
3 neighbors with cityblock metric has training accuracy of 0.821
3 neighbors with l2 metric has training accuracy of 0.821

5 neighbors with cosine metric has training accuracy of 0.692
5 neighbors with l1 metric has training accuracy of 0.744
5 neighbors with manhattan metric has training accuracy of 0.744
5 neighbors with euclidean metric has training accuracy of 0.744
5 neighbors with cityblock metric has training accuracy of 0.744
5 neighbors with l2 metric has training accuracy of 0.744

7 neighbors with cosine metric has training accuracy of 0.769
7 neighbors with l1 metric has training accuracy of 0.692
7 neighbors with manhattan metric has training accuracy of 0.692
7 neighbors with euclidean metric has training accuracy of 0.692
7 

In [12]:
trainingDF = pd.DataFrame(trainingList, columns = columnNames)
trainingDF

Unnamed: 0,Neighbors,Metric,Accuracy
0,3,cosine,0.795
1,3,l1,0.821
2,3,manhattan,0.821
3,3,euclidean,0.821
4,3,cityblock,0.821
5,3,l2,0.821
6,5,cosine,0.692
7,5,l1,0.744
8,5,manhattan,0.744
9,5,euclidean,0.744


### Maximum Values

1) What is the maximum obtained accuracy on the test set? What are the parameter values (number of neighbors and distance metric) associated with this accuracy?  

    - Maximum Accuracy: 91.3%
    - Neighbors: 7
    - Metrics: l1, manhattan, cityblock

In [13]:
maxTestingAccuracy = testingDF['Accuracy'].max()
maxTestingRows = testingDF[testingDF['Accuracy'] == maxTestingAccuracy]
maxTestingRows

Unnamed: 0,Neighbors,Metric,Accuracy
13,7,l1,0.913
14,7,manhattan,0.913
16,7,cityblock,0.913


2) What is the maximum obtained accuracy on the training set? What are the parameter values associated with this accuracy?  

    - Maximum Accuracy: 82.1%
    - Neighbors: 3
    - Metrics: l1, manhattan, euclidean, cityblock, ls

In [14]:
maxTrainingAccuracy = trainingDF['Accuracy'].max()
maxTrainingRows = trainingDF[trainingDF['Accuracy'] == maxTrainingAccuracy]
maxTrainingRows

Unnamed: 0,Neighbors,Metric,Accuracy
1,3,l1,0.821
2,3,manhattan,0.821
3,3,euclidean,0.821
4,3,cityblock,0.821
5,3,l2,0.821


3) Are the parameter values achieving maximum accuracy the same for the training and test sets?  

    No, the parameter values achieving maximum accuracy for training and testing dataset is different. The parameter values achieving maximum accuracy are not the same for the training and test sets. In the training set, the highest accuracy is achieved with 3 neighbors and various metrics (l1, manhattan, euclidean, cityblock, l2), while in the test set, the highest accuracy is achieved with 7 neighbors and l1, manhattan, and cityblock metrics.

4) If no, what does this tell you about the classification performance of the KNN classifier?  

    This KNN classifier is overfitting to the training data. Overfitting occurs when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the model’s ability to generalize. The KNN classifier is performing better on the training set than on the test set, which is a common sign of overfitting. 

    It’s also worth noting that KNN classifiers are especially prone to overfitting due to their reliance on local data points: in high-dimensional spaces, data points are often “far away” from each other, which can lead to overfitting. To improve the model, you could try adjusting the parameters, using feature selection to reduce the dimensionality, or using regularization techniques. It’s also important to ensure that the training data is representative of the data the model will encounter in production.