<a href="https://colab.research.google.com/github/shreyashrestha07/CUS615_ShreyaShrestha/blob/master/Problem_set_02_Instance_classifiers_KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is part the of Dr. Christoforos Christoforou's course materials. You may not, nor may you knowingly allow others to reproduce or distribute lecture notes, course materials or any of their derivatives without the instructor's express written consent.

# Problem Set 02 - Instance-based Classifiers
**Professor:** Dr. Christoforos Christoforou

For this problem set you will need the following libraries, which are pre-installed with the colab environment: 

* [Numpy](https://www.numpy.org/) is an array manipulation library, used for linear algebra, Fourier transform, and random number capabilities.
* [Pandas](https://pandas.pydata.org/) is a library for data manipulation and data analysis.
* [Matplotlib](https://matplotlib.org/) is a library which generates figures and provides graphical user interface toolkit.

You can load them using the following import statement:

In [240]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt

# import sklearn 
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix

import warnings
warnings.simplefilter("ignore")

## 1. Objective 
As part of this problem set, you will expore work on the `wine quality dataset`  in order to: 
- To explore the physiocochemical properties of red wine
- To determine an optimal machine learning model for red wine quality classification

For that, you will be using an `instance-based` classifier, namely K-NN algorithm. Review the information provided in the problem set, and complete all challenges listed.  

## 2. Wine Quality Dataset - Data Description

For this dataset you will be using the `wine quality dataset`. Below is a description of the various parameters listed in that dataset (i.e. potential features):

* fixed.acidity (tartaric acid - g / dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily) 
* volatile.acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste 
* citric.acid (g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste 
* residual.sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet 
* chlorides (sodium chloride - g / dm^3): the amount of salt in the wine 
* free.sulfur.dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine 
* total.sulfur.dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine 
* density (g / cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content 
* pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale 
* sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant 
* alcohol (% by volume): the percent alcohol content of the wine 
* quality: quality score between 0 and 10



## Download dataset from kaggle
You will use the Kaggle CLI to dowload the `Wine Quality Dataset` to your colab enviroment. You will need to upload your kaggle API (see problem_set 01 for direction on how to obtain your API key. 

In [241]:
# install kaggle CLI
!pip install -q kaggle

In [242]:
# Upload the kaggle API key of your account 
from google.colab import files 
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle (5).json
mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [243]:
# View list of data files available in the dataset. 
# Format : kaggle dataset files <dataset-URI>
!kaggle datasets files cchristoforou/practice-dataset-for-tutorials

name                  size  creationDate         
-------------------  -----  -------------------  
wine.data             11KB  2021-01-23 15:26:18  
country_total.csv    533KB  2021-01-23 15:26:18  
countries.csv          2KB  2021-01-23 15:26:18  
wineQualityReds.csv   92KB  2021-01-23 15:26:18  


In [244]:
# Download - Specify the parameters.  
kaggle_dataset_URI = "cchristoforou/practice-dataset-for-tutorials"
output_folder = "sample_data/problem_set02"
kaggle_data_file1 = "wineQualityReds.csv"

In [245]:
# Download the first file from dataset - countries.csv
!kaggle datasets download $kaggle_dataset_URI --file $kaggle_data_file1 --path $output_folder 


wineQualityReds.csv: Skipping, found more recently modified local copy (use --force to force download)


## Load the data 
The code below showcase how to load the data in a pandas `DataFrame` and apply a train_test_split on the data. 

In [246]:
# Code to load the data from file. Here we use the pandas library to read the csv file. 
datafile = "./sample_data/problem_set02/wineQualityReds.csv"
wine_df = pd.read_csv(datafile)
wine_df.drop(wine_df.columns[0],axis=1,inplace=True)

In [247]:
# Split the data into a training and testing set using the sklearn function train_test_split
# Noteice that 
X_train, X_test, y_train, y_test = train_test_split(wine_df.drop('quality',axis=1), wine_df['quality'], test_size=.25, random_state=42)


## Challenge 1
Use the variables `X_train`, `X_test`, `y_train`, and `y_test` to explore your data. In particular, calculate and display the following information.

* Number of samples in the training set in total and in each class.
* Number of samples in the testing set in total and in each class.
* Number of features in the dataset. 
* Number of classes in the dataset.
* IDs of the number of classes.


In [248]:
# Solution 1:

wine_df_train, wine_df_features = X_train.shape 
wine_df_test, _ = X_test.shape
wine_df_classes = len(np.unique(y_train))

#1 Number of samples in the training set in total and in each class
print("Number of samples in training set:  %d \n(%d : Class 3, %d : Class 4, %d : Class 5, %d Class 6, %d : Class 7, %d : Class 8) " % (wine_df_train, np.sum(y_train==3), np.sum(y_train==4), np.sum(y_train==5), np.sum(y_train==6), np.sum(y_train==7), np.sum(y_train==8)))

#2 Number of samples in the testing set in total and in each class
print("Number of samples in training set: %d \n(%d : Class 3, %d : Class 4, %d : Class 5, %d Class 6, %d : Class 7, %d : Class 8)" % (wine_df_test, np.sum(y_test==3), np.sum(y_test==4), np.sum(y_test==5), np.sum(y_test==6), np.sum(y_test==7), np.sum(y_test==8)))

#3 Number of features in the dataset.
print("Number of features: "+ str(wine_df_features))

#4 Number of classes in the dataset.
print("Number of classes: " + str(wine_df_classes))

#5 IDs of the number of classes.
print("IDs for class labels: " + str(np.unique(y_train)))

Number of samples in training set:  1199 
(9 : Class 3, 40 : Class 4, 517 : Class 5, 469 Class 6, 151 : Class 7, 13 : Class 8) 
Number of samples in training set: 400 
(1 : Class 3, 13 : Class 4, 164 : Class 5, 169 Class 6, 48 : Class 7, 5 : Class 8)
Number of features: 11
Number of classes: 6
IDs for class labels: [3 4 5 6 7 8]


# Challenge 2

Train a **K-NN** classifier using the `(X_train,y_train)` dataset and use the trained model to predict the underlying classes for the observations in the test dataset `X_test`. Store your prediction in a variable called `y_pred`.

In [249]:
# Solution 2:

# To train the KNN classifier
model = KNeighborsClassifier(n_neighbors=5) 

# To fit the model
model.fit(X_train, y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [250]:
# To predict output
y_pred = model.predict(X_test)
print(f"x_test.shape {X_test.shape} y_pred.shape{y_pred.shape}")

x_test.shape (400, 11) y_pred.shape(400,)


In [251]:
# Inspect the content of y_pred
y_pred

array([5, 5, 6, 5, 6, 6, 5, 5, 5, 5, 8, 5, 6, 6, 6, 7, 5, 6, 7, 6, 7, 5,
       5, 6, 5, 6, 5, 5, 5, 6, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 5, 6, 5,
       5, 6, 6, 6, 5, 6, 5, 5, 5, 5, 5, 5, 6, 5, 6, 6, 5, 5, 5, 5, 6, 6,
       6, 5, 5, 6, 5, 5, 6, 5, 6, 5, 5, 5, 5, 5, 5, 5, 7, 5, 6, 5, 5, 6,
       6, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 5, 5, 5, 6, 7, 6, 6, 5, 6, 5, 5,
       6, 6, 7, 5, 6, 6, 6, 5, 5, 6, 5, 5, 6, 5, 5, 5, 6, 6, 5, 6, 6, 6,
       5, 7, 6, 5, 6, 5, 6, 6, 5, 5, 6, 6, 5, 5, 6, 7, 6, 5, 5, 6, 5, 5,
       6, 6, 5, 6, 6, 5, 5, 6, 5, 5, 6, 6, 5, 5, 5, 5, 6, 5, 6, 5, 5, 5,
       5, 5, 5, 5, 5, 6, 5, 7, 5, 5, 6, 6, 7, 5, 5, 6, 5, 6, 5, 5, 6, 6,
       6, 6, 5, 5, 5, 6, 6, 5, 6, 5, 4, 5, 5, 5, 6, 6, 6, 6, 6, 6, 5, 6,
       6, 5, 5, 6, 6, 6, 6, 5, 5, 5, 7, 6, 7, 5, 5, 6, 5, 6, 5, 5, 6, 5,
       6, 6, 6, 6, 5, 5, 7, 5, 6, 5, 5, 6, 6, 5, 6, 6, 6, 6, 5, 6, 5, 7,
       6, 7, 5, 5, 6, 5, 6, 5, 6, 6, 5, 5, 6, 7, 5, 6, 5, 6, 5, 5, 5, 6,
       5, 5, 6, 6, 5, 6, 5, 6, 5, 6, 5, 6, 6, 6, 5,

# Challenge 3

Evaluate the performance of your classifier. Calculate and display the following:
* print the `confusion matrix`.
* `normalized confusion matrix`. 
* the probablitity of correct classification (accuracy score). 
* the `precision`, `recall`, and `f1-score` for each class.

In [252]:
y_true = y_test

In [253]:
# Solution 3:

# Print the confusing matrix  
print("This is the confusion matrix")
cnf_mx = metrics.confusion_matrix(y_true, y_pred)
print(cnf_mx)

# Normalized cnfusion matrix
print("\nThis is the normalized confusion matrix")
cnf_mx_joint = cnf_mx.astype('float')/cnf_mx.sum()
print(cnf_mx_joint)

# The probablitity of correct classification (accuracy score).
acc = metrics.accuracy_score(y_true, y_pred)
print("\n Accuracy: %.3f" % acc)

# The precision, recall, and f1-score for each class.

# Precision
precision = metrics.precision_score(y_true, y_pred, average = None)
print("\n Precision for each class" , precision)

# Recall 
recall = metrics.recall_score(y_true, y_pred, average = None)
print("\n Recall for each class:" , recall)

# f1-score
f1 = metrics.f1_score(y_true, y_pred, average = None)
print("\n f1-score for each class:" , f1)

This is the confusion matrix
[[  0   0   1   0   0   0]
 [  0   0   6   6   1   0]
 [  0   1 105  53   5   0]
 [  0   1  84  75   9   0]
 [  0   0  16  24   7   1]
 [  0   0   1   3   1   0]]

This is the normalized confusion matrix
[[0.     0.     0.0025 0.     0.     0.    ]
 [0.     0.     0.015  0.015  0.0025 0.    ]
 [0.     0.0025 0.2625 0.1325 0.0125 0.    ]
 [0.     0.0025 0.21   0.1875 0.0225 0.    ]
 [0.     0.     0.04   0.06   0.0175 0.0025]
 [0.     0.     0.0025 0.0075 0.0025 0.    ]]

 Accuracy: 0.468

 Precision for each class [0.         0.         0.49295775 0.46583851 0.30434783 0.        ]

 Recall for each class: [0.         0.         0.6402439  0.44378698 0.14583333 0.        ]

 f1-score for each class: [0.         0.         0.55702918 0.45454545 0.1971831  0.        ]


# Challenge 4

The code below loads the same dataset, but treats it as a binary classification problem. That is, instead of classifying an observation into one of 10 categories (0..10), we consider all observations with score above 5 as being good and all observation below or equal to five as being bad.





In [254]:
# Code to load the data from file. Here we use the pandas library to read the csv file. 
datafile = "./sample_data/problem_set02/wineQualityReds.csv"
wine_df = pd.read_csv(datafile)
wine_df.drop(wine_df.columns[0],axis=1,inplace=True)

wine_df['quality'] = np.where(wine_df['quality']>5,"Good","Bad")

In [255]:
wine_df.head()

Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,Bad
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,Bad
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,Bad
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,Good
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,Bad


In [256]:
X_train, X_test, y_train, y_test = train_test_split(wine_df.drop('quality',axis=1), wine_df['quality'], test_size=.25, random_state=42)


## Callenge 4.1
Use the variables `X_train`, `X_test`, `y_train`, and `y_test` to explore your data. In particular, calculate and display the following information.
* Number of samples in the training set in total and in each class.
* Number of samples in the testing set in total and in each class.
* Number of features in the dataset. 
* Number of classes in the dataset.
* IDs of the number of classes.




In [257]:
# Solution 4.1:

wine_df_train, wine_df_features = X_train.shape 
wine_df_test, _ = X_test.shape
wine_df_classes = len(np.unique(y_train))

#1 Number of samples in the training set in total and in each class
print("Number of samples in training set:  %d  (%d Good, %d Bad) " % (wine_df_train, np.sum(y_train=="Good"), np.sum(y_train=="Bad")))

#2 Number of samples in the testing set in total and in each class
print("Number of samples in testing set:  %d  (%d Good, %d Bad) " % (wine_df_test, np.sum(y_test=="Good"), np.sum(y_test=="Bad")))

#3 Number of features in the dataset.
print("Number of features: "+ str(wine_df_features))

#4 Number of classes in the dataset.
print("Number of classes: " + str(wine_df_classes))

#5 IDs of the number of classes.
print("IDs for class labels: " + str(np.unique(y_train)))

Number of samples in training set:  1199  (633 Good, 566 Bad) 
Number of samples in testing set:  400  (222 Good, 178 Bad) 
Number of features: 11
Number of classes: 2
IDs for class labels: ['Bad' 'Good']


## Challenge 4.2 
Train a **K-NN** classifier using the `(X_train,y_train)` dataset and use trained model to predict the underlying classes for the observations in the test dataset `X_test`. Store your prediction in a variable called `y_pred`.

In [258]:
# To train the KNN classifier
model = KNeighborsClassifier(n_neighbors=5) 

# To fit the model
model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [259]:
# To predict output
y_pred = model.predict(X_test)
print(f"x_test.shape {X_test.shape} y_pred.shape{y_pred.shape}")

x_test.shape (400, 11) y_pred.shape(400,)


In [260]:
# Inspect the content of y_pred
y_pred

array(['Bad', 'Bad', 'Good', 'Good', 'Good', 'Good', 'Bad', 'Bad', 'Bad',
       'Bad', 'Good', 'Bad', 'Good', 'Good', 'Good', 'Good', 'Bad',
       'Good', 'Good', 'Good', 'Bad', 'Bad', 'Good', 'Good', 'Good',
       'Good', 'Bad', 'Bad', 'Bad', 'Good', 'Bad', 'Bad', 'Bad', 'Bad',
       'Good', 'Bad', 'Good', 'Good', 'Good', 'Good', 'Good', 'Bad',
       'Good', 'Bad', 'Good', 'Good', 'Good', 'Good', 'Bad', 'Good',
       'Bad', 'Bad', 'Bad', 'Good', 'Bad', 'Bad', 'Good', 'Good', 'Good',
       'Good', 'Bad', 'Good', 'Bad', 'Bad', 'Good', 'Good', 'Good', 'Bad',
       'Good', 'Good', 'Bad', 'Bad', 'Good', 'Bad', 'Good', 'Bad', 'Bad',
       'Bad', 'Bad', 'Bad', 'Bad', 'Good', 'Good', 'Bad', 'Good', 'Bad',
       'Bad', 'Good', 'Good', 'Bad', 'Good', 'Bad', 'Bad', 'Bad', 'Bad',
       'Good', 'Good', 'Good', 'Good', 'Bad', 'Bad', 'Bad', 'Good',
       'Good', 'Good', 'Good', 'Bad', 'Good', 'Bad', 'Bad', 'Good',
       'Good', 'Good', 'Bad', 'Good', 'Good', 'Good', 'Bad', 'Bad',
      

## Challenge 4.3
Evaluate the performance of your classifier. Calculate and display the following:
* print the `confusion matrix`.
* `normalized confusion matrix`. 
* the probablitity of correct classification (accuracy score). 
* the `precision`, `recall`, and `f1-score` for each class.

In [261]:
y_true = y_test

In [262]:
# Solution 4.3: 

# Confusion Matrix
print("This is the confusion matrix")
cnf_mx = metrics.confusion_matrix(y_true, y_pred)
print(cnf_mx)

# Normalized cnfusion matrix
print("\nThis is the normalized confusion matrix")
cnf_mx_joint = cnf_mx.astype('float')/cnf_mx.sum()
print(cnf_mx_joint)

# The probablitity of correct classification (accuracy score).
acc = metrics.accuracy_score(y_true, y_pred)
print("\nAccuracy: %.3f" % acc)

# The precision, recall, and f1-score for each class.

# Precision
precision = metrics.precision_score(y_true, y_pred, pos_label="Good")
print("\nPrecision (for class 'Good'): %.3f" % precision)
precision = metrics.precision_score(y_true, y_pred, pos_label="Bad")
print("Precision (for class 'Bad'): %.3f" % precision)

# Recall 
recall = metrics.recall_score(y_true, y_pred,pos_label="Good")
print("\nRecall (for class 'Good'): %.3f" % recall)
recall = metrics.recall_score(y_true, y_pred,pos_label="Bad")
print("Recall (for class 'Bad'): %.3f" % recall)

# f1-score
f1 = metrics.f1_score(y_true, y_pred, pos_label="Good")
print("\nf1-score (for class 'Good'): %.3f" % f1)
f1 = metrics.f1_score(y_true, y_pred, pos_label="Bad")
print("f1-score (for class 'Bad'): %.3f" % f1)



This is the confusion matrix
[[104  74]
 [ 78 144]]

This is the normalized confusion matrix
[[0.26  0.185]
 [0.195 0.36 ]]

Accuracy: 0.620

Precision (for class 'Good'): 0.661
Precision (for class 'Bad'): 0.571

Recall (for class 'Good'): 0.649
Recall (for class 'Bad'): 0.584

f1-score (for class 'Good'): 0.655
f1-score (for class 'Bad'): 0.578


# Challenge 5

The **Knn** classifier accepts a number of parameters. One of those parameters is the number K (i.e. the number of nearest neighbors to consider when making a prediction. Evaluate the classifier for different values of K and identify which configuration achieve the best performance on the testing set. Plot or print your results.


In [263]:
# Solution 5:

# 2 Neighbours
# To train the KNN classifier
model = KNeighborsClassifier(n_neighbors=2) 
# To fit the model
model.fit(X_train, y_train)
# To predict output
y_pred = model.predict(X_test)
print(f"x_test.shape {X_test.shape} y_pred.shape{y_pred.shape}")
# The probablitity of correct classification (accuracy score).
acc = metrics.accuracy_score(y_true, y_pred)
print("\nAccuracy score of 2 Neighbours: %.3f" % acc)
# Confusion Matrix
print("This is the confusion matrix")
cnf_mx = metrics.confusion_matrix(y_true, y_pred)
print(cnf_mx)

# 3 Neighbours
# To train the KNN classifier
model = KNeighborsClassifier(n_neighbors=3) 
# To fit the model
model.fit(X_train, y_train)
# To predict output
y_pred = model.predict(X_test)
# The probablitity of correct classification (accuracy score).
acc = metrics.accuracy_score(y_true, y_pred)
print("\nAccuracy score of 3 Neighbours: %.3f" % acc)
# Confusion Matrix
print("This is the confusion matrix")
cnf_mx = metrics.confusion_matrix(y_true, y_pred)
print(cnf_mx)

# 4 Neighbours
# To train the KNN classifier
model = KNeighborsClassifier(n_neighbors=4) 
# To fit the model
model.fit(X_train, y_train)
# To predict output
y_pred = model.predict(X_test)
# The probablitity of correct classification (accuracy score).
acc = metrics.accuracy_score(y_true, y_pred)
print("\nAccuracy score of 4 Neighbours: %.3f" % acc)
# Confusion Matrix
print("This is the confusion matrix")
cnf_mx = metrics.confusion_matrix(y_true, y_pred)
print(cnf_mx)

# 5 Neighbours
# To train the KNN classifier
model = KNeighborsClassifier(n_neighbors=5) 
# To fit the model
model.fit(X_train, y_train)
# To predict output
y_pred = model.predict(X_test)
# The probablitity of correct classification (accuracy score).
acc = metrics.accuracy_score(y_true, y_pred)
print("\nAccuracy score of 5 Neighbours: %.3f" % acc)
# Confusion Matrix
print("This is the confusion matrix")
cnf_mx = metrics.confusion_matrix(y_true, y_pred)
print(cnf_mx)

# 6 Neighbours
# To train the KNN classifier
model = KNeighborsClassifier(n_neighbors=6) 
# To fit the model
model.fit(X_train, y_train)
# To predict output
y_pred = model.predict(X_test)
# The probablitity of correct classification (accuracy score).
acc = metrics.accuracy_score(y_true, y_pred)
print("\nAccuracy score of 6 Neighbours: %.3f" % acc)
# Confusion Matrix
print("This is the confusion matrix")
cnf_mx = metrics.confusion_matrix(y_true, y_pred)
print(cnf_mx)

# 7 Neighbours
# To train the KNN classifier
model = KNeighborsClassifier(n_neighbors=7) 
# To fit the model
model.fit(X_train, y_train)
# To predict output
y_pred = model.predict(X_test)
# The probablitity of correct classification (accuracy score).
acc = metrics.accuracy_score(y_true, y_pred)
print("\nAccuracy score of 7 Neighbours: %.3f" % acc)
# Confusion Matrix
print("This is the confusion matrix")
cnf_mx = metrics.confusion_matrix(y_true, y_pred)
print(cnf_mx)

# 8 Neighbours
# To train the KNN classifier
model = KNeighborsClassifier(n_neighbors=8) 
# To fit the model
model.fit(X_train, y_train)
# To predict output
y_pred = model.predict(X_test)
# The probablitity of correct classification (accuracy score).
acc = metrics.accuracy_score(y_true, y_pred)
print("\nAccuracy score of 8 Neighbours: %.3f" % acc)
# Confusion Matrix
print("This is the confusion matrix")
cnf_mx = metrics.confusion_matrix(y_true, y_pred)
print(cnf_mx)

# 9 Neighbours
# To train the KNN classifier
model = KNeighborsClassifier(n_neighbors=9) 
# To fit the model
model.fit(X_train, y_train)
# To predict output
y_pred = model.predict(X_test)
# The probablitity of correct classification (accuracy score).
acc = metrics.accuracy_score(y_true, y_pred)
print("\nAccuracy score of 9 Neighbours: %.3f" % acc)
# Confusion Matrix
print("This is the confusion matrix")
cnf_mx = metrics.confusion_matrix(y_true, y_pred)
print(cnf_mx)

# 10 Neighbours
# To train the KNN classifier
model = KNeighborsClassifier(n_neighbors=10) 
# To fit the model
model.fit(X_train, y_train)
# To predict output
y_pred = model.predict(X_test)
# The probablitity of correct classification (accuracy score).
acc = metrics.accuracy_score(y_true, y_pred)
print("\nAccuracy score of 10 Neighbours: %.3f" % acc)
# Confusion Matrix
print("This is the confusion matrix")
cnf_mx = metrics.confusion_matrix(y_true, y_pred)
print(cnf_mx)

x_test.shape (400, 11) y_pred.shape(400,)

Accuracy score of 2 Neighbours: 0.588
This is the confusion matrix
[[136  42]
 [123  99]]

Accuracy score of 3 Neighbours: 0.620
This is the confusion matrix
[[108  70]
 [ 82 140]]

Accuracy score of 4 Neighbours: 0.593
This is the confusion matrix
[[130  48]
 [115 107]]

Accuracy score of 5 Neighbours: 0.620
This is the confusion matrix
[[104  74]
 [ 78 144]]

Accuracy score of 6 Neighbours: 0.598
This is the confusion matrix
[[128  50]
 [111 111]]

Accuracy score of 7 Neighbours: 0.598
This is the confusion matrix
[[103  75]
 [ 86 136]]

Accuracy score of 8 Neighbours: 0.598
This is the confusion matrix
[[119  59]
 [102 120]]

Accuracy score of 9 Neighbours: 0.605
This is the confusion matrix
[[101  77]
 [ 81 141]]

Accuracy score of 10 Neighbours: 0.595
This is the confusion matrix
[[116  62]
 [100 122]]


**K = 3 and K = 5 has the best performance on the testing set in terms of Accuracy score and Confusion Matrix**


Copyright Statement: Copyright © 2020 Christoforou. The materials provided by the instructor of this course, including this notebook, are for the use of the students enrolled in the course. Materials are presented in an educational context for personal use and study and should not be shared, distributed, disseminated or sold in print — or digitally — outside the course without permission. You may not, nor may you knowingly allow others to reproduce or distribute lecture notes, course materials as well as any of their derivatives without the instructor's express written consent.