This notebook is part the of Dr. Christoforos Christoforou's course materials. You may not, nor may you knowingly allow others to reproduce or distribute lecture notes, course materials or any of their derivatives without the instructor's express written consent.

# Problem Set 02 - Instance-based Classifiers
**Professor:** Dr. Christoforos Christoforou

For this problem set you will need the following libraries, which are pre-installed with the colab environment: 

* [Numpy](https://www.numpy.org/) is an array manipulation library, used for linear algebra, Fourier transform, and random number capabilities.
* [Pandas](https://pandas.pydata.org/) is a library for data manipulation and data analysis.
* [Matplotlib](https://matplotlib.org/) is a library which generates figures and provides graphical user interface toolkit.

You can load them using the following import statement:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt

## 1. Objective 
As part of this problem set, you will expore work on the `wine quality dataset`  in order to: 
- To explore the physiocochemical properties of red wine
- To determine an optimal machine learning model for red wine quality classification

For that, you will be using an `instance-based` classifier, namely K-NN algorithm. Review the information provided in the problem set, and complete all challenges listed.  

## 2. Wine Quality Dataset - Data Description

For this dataset you will be using the `wine quality dataset`. Below is a description of the various parameters listed in that dataset (i.e. potential features):

* fixed.acidity (tartaric acid - g / dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily) 
* volatile.acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste 
* citric.acid (g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste 
* residual.sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet 
* chlorides (sodium chloride - g / dm^3): the amount of salt in the wine 
* free.sulfur.dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine 
* total.sulfur.dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine 
* density (g / cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content 
* pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale 
* sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant 
* alcohol (% by volume): the percent alcohol content of the wine 
* quality: quality score between 0 and 10



## Download dataset from kaggle
You will use the Kaggle CLI to dowload the `Wine Quality Dataset` to your colab enviroment. You will need to upload your kaggle API (see problem_set 01 for direction on how to obtain your API key. 

In [None]:
# install kaggle CLI
!pip install -q kaggle

In [None]:
# Upload the kaggle API key of your account 
from google.colab import files 
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


In [None]:
# View list of data files available in the dataset. 
# Format : kaggle dataset files <dataset-URI>
!kaggle datasets files cchristoforou/practice-dataset-for-tutorials

name                  size  creationDate         
-------------------  -----  -------------------  
wine.data             11KB  2021-01-23 15:26:18  
countries.csv          2KB  2021-01-23 15:26:18  
wineQualityReds.csv   92KB  2021-01-23 15:26:18  
country_total.csv    533KB  2021-01-23 15:26:18  


In [None]:
# Download - Specify the parameters.  
kaggle_dataset_URI = "cchristoforou/practice-dataset-for-tutorials"
output_folder = "sample_data/problem_set02"
kaggle_data_file1 = "wineQualityReds.csv"

In [None]:
# Download the first file from dataset - countries.csv
!kaggle datasets download $kaggle_dataset_URI --file $kaggle_data_file1 --path $output_folder 


Downloading wineQualityReds.csv to sample_data/problem_set02
  0% 0.00/92.1k [00:00<?, ?B/s]
100% 92.1k/92.1k [00:00<00:00, 39.2MB/s]


## Load the data 
The code below showcase how to load the data in a pandas `DataFrame` and apply a train_test_split on the data. 

In [3]:
# Code to load the data from file. Here we use the pandas library to read the csv file. 
datafile = "/content/sample_data/wineQualityReds.csv"
wine_df = pd.read_csv(datafile)
wine_df.drop(wine_df.columns[0],axis=1,inplace=True)

In [6]:
# Split the data into a training and testing set using the sklearn function train_test_split
# Noteice that
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
 
X_train, X_test, y_train, y_test = train_test_split(wine_df.drop('quality',axis=1), wine_df['quality'], test_size=.25, random_state=42)

## Challenge 1
Use the variables `X_train`, `X_test`, `y_train`, and `y_test` to explore your data. In particular, calculate and display the following information.

* Number of samples in the training set in total and in each class.
* Number of samples in the testing set in total and in each class.
* Number of features in the dataset. 
* Number of classes in the dataset.
* IDs of the number of classes.


In [14]:
# Your Solution here
print("x_train:", X_train.shape, ". y_train:", y_train.shape)
print("x_test:", X_test.shape, ". y_test:", y_test.shape)
print("ID: " + str(np.unique(y_test)))

x_train: (1199, 11) . y_train: (1199,)
x_test: (400, 11) . y_test: (400,)
ID: [3 4 5 6 7 8]


In [None]:
########
# Features = 1599
# Classes = 11
# ID's: 3,4,5,6,7,8

# Challenge 2

Train a **K-NN** classifier using the `(X_train,y_train)` dataset and use the trained model to predict the underlying classes for the observations in the test dataset `X_test`. Store your prediction in a variable called `y_pred`.

In [13]:
# Your solution 
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred

array([5, 5, 6, 5, 6, 5, 5, 5, 6, 6, 8, 5, 6, 6, 6, 7, 6, 5, 7, 5, 4, 5,
       5, 5, 5, 5, 7, 5, 5, 5, 5, 5, 5, 6, 5, 5, 6, 6, 6, 6, 6, 5, 6, 5,
       6, 6, 6, 6, 5, 4, 5, 5, 5, 7, 4, 4, 6, 7, 6, 5, 5, 8, 6, 5, 6, 6,
       7, 5, 5, 6, 5, 5, 6, 5, 6, 5, 5, 5, 5, 5, 5, 7, 5, 5, 6, 5, 5, 6,
       6, 4, 5, 5, 4, 6, 5, 6, 5, 4, 5, 5, 5, 5, 6, 7, 6, 6, 6, 6, 5, 5,
       6, 6, 7, 5, 6, 6, 5, 5, 5, 7, 5, 6, 7, 5, 5, 5, 6, 6, 5, 6, 6, 6,
       5, 5, 4, 5, 6, 6, 4, 6, 5, 5, 7, 6, 6, 5, 6, 7, 6, 5, 6, 6, 5, 5,
       6, 6, 5, 4, 6, 5, 7, 5, 5, 5, 6, 6, 6, 5, 5, 5, 6, 4, 7, 6, 5, 5,
       4, 4, 5, 7, 6, 5, 5, 6, 5, 6, 6, 6, 7, 6, 6, 6, 5, 7, 4, 5, 6, 5,
       3, 6, 5, 5, 5, 6, 7, 5, 6, 7, 4, 5, 7, 5, 6, 7, 6, 5, 7, 6, 5, 5,
       6, 5, 6, 6, 6, 6, 5, 6, 5, 5, 5, 5, 7, 4, 5, 6, 5, 6, 5, 5, 7, 5,
       5, 5, 6, 7, 5, 5, 7, 5, 6, 5, 5, 6, 6, 5, 6, 6, 8, 6, 6, 6, 4, 7,
       6, 6, 5, 5, 6, 6, 6, 4, 6, 6, 5, 5, 6, 7, 5, 6, 5, 6, 5, 5, 5, 6,
       5, 5, 6, 6, 5, 6, 5, 6, 5, 5, 5, 6, 5, 5, 5,

# Challenge 3

Evaluate the performance of your classifier. Calculate and display the following:
* print the `confusion matrix`.
* `normalized confusion matrix`. 
* the probablitity of correct classification (accuracy score). 
* the `precision`, `recall`, and `f1-score` for each class.

In [18]:
# Your solution 
matrix = metrics.confusion_matrix(y_test, y_pred)
matrix

array([[ 0,  0,  1,  0,  0,  0],
       [ 0,  1,  3,  8,  1,  0],
       [ 1, 12, 98, 48,  4,  1],
       [ 0, 15, 74, 62, 18,  0],
       [ 0,  0, 10, 21, 15,  2],
       [ 0,  0,  1,  2,  2,  0]])

In [20]:
normalized_matrix = matrix.astype('float32') / matrix.sum()
normalized_matrix

array([[0.    , 0.    , 0.0025, 0.    , 0.    , 0.    ],
       [0.    , 0.0025, 0.0075, 0.02  , 0.0025, 0.    ],
       [0.0025, 0.03  , 0.245 , 0.12  , 0.01  , 0.0025],
       [0.    , 0.0375, 0.185 , 0.155 , 0.045 , 0.    ],
       [0.    , 0.    , 0.025 , 0.0525, 0.0375, 0.005 ],
       [0.    , 0.    , 0.0025, 0.005 , 0.005 , 0.    ]], dtype=float32)

In [28]:
print(metrics.classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           3       0.00      0.00      0.00         1
           4       0.04      0.08      0.05        13
           5       0.52      0.60      0.56       164
           6       0.44      0.37      0.40       169
           7       0.38      0.31      0.34        48
           8       0.00      0.00      0.00         5

    accuracy                           0.44       400
   macro avg       0.23      0.23      0.22       400
weighted avg       0.45      0.44      0.44       400



# Challenge 4

The code below loads the same dataset, but treats it as a binary classification problem. That is, instead of classifying an observation into one of 10 categories (0..10), we consider all observations with score above 5 as being good and all observation below or equal to five as being bad.





In [29]:
# Code to load the data from file. Here we use the pandas library to read the csv file. 
datafile = "/content/sample_data/wineQualityReds.csv"
wine_df = pd.read_csv(datafile)
wine_df.drop(wine_df.columns[0],axis=1,inplace=True)

wine_df['quality'] = np.where(wine_df['quality']>5,"Good","Bad")

In [30]:
X_train, X_test, y_train, y_test = train_test_split(wine_df.drop('quality',axis=1), wine_df['quality'], test_size=.25, random_state=42)

## Callenge 4.1
Use the variables `X_train`, `X_test`, `y_train`, and `y_test` to explore your data. In particular, calculate and display the following information.
* Number of samples in the training set in total and in each class.
* Number of samples in the testing set in total and in each class.
* Number of features in the dataset. 
* Number of classes in the dataset.
* IDs of the number of classes.




In [31]:
# Your Solution 
print("x_train:", X_train.shape, ". y_train:", y_train.shape)
print("x_test:", X_test.shape, ". y_test:", y_test.shape)
print("ID: " + str(np.unique(y_test)))

x_train: (1199, 11) . y_train: (1199,)
x_test: (400, 11) . y_test: (400,)
ID: ['Bad' 'Good']


## Challenge 4.2 
Train a **K-NN** classifier using the `(X_train,y_train)` dataset and use trained model to predict the underlying classes for the observations in the test dataset `X_test`. Store your prediction in a variable called `y_pred`.

In [34]:
# Your solution 
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred

array(['Bad', 'Bad', 'Good', 'Bad', 'Good', 'Bad', 'Bad', 'Bad', 'Good',
       'Good', 'Good', 'Bad', 'Good', 'Good', 'Good', 'Good', 'Good',
       'Bad', 'Good', 'Bad', 'Bad', 'Good', 'Bad', 'Good', 'Bad', 'Good',
       'Good', 'Bad', 'Bad', 'Good', 'Bad', 'Bad', 'Bad', 'Good', 'Bad',
       'Bad', 'Good', 'Good', 'Good', 'Good', 'Good', 'Bad', 'Good',
       'Bad', 'Good', 'Good', 'Good', 'Good', 'Bad', 'Bad', 'Bad', 'Bad',
       'Bad', 'Good', 'Bad', 'Bad', 'Good', 'Good', 'Good', 'Bad', 'Bad',
       'Good', 'Good', 'Bad', 'Good', 'Good', 'Good', 'Bad', 'Bad',
       'Good', 'Bad', 'Bad', 'Good', 'Bad', 'Good', 'Bad', 'Bad', 'Bad',
       'Bad', 'Bad', 'Bad', 'Good', 'Bad', 'Bad', 'Good', 'Bad', 'Bad',
       'Good', 'Good', 'Bad', 'Good', 'Bad', 'Bad', 'Good', 'Bad', 'Good',
       'Bad', 'Good', 'Bad', 'Bad', 'Bad', 'Bad', 'Good', 'Good', 'Good',
       'Good', 'Good', 'Good', 'Bad', 'Bad', 'Good', 'Good', 'Good',
       'Bad', 'Good', 'Good', 'Good', 'Bad', 'Bad', 'Good', 'B

## Challenge 4.3
Evaluate the performance of your classifier. Calculate and display the following:
* print the `confusion matrix`.
* `normalized confusion matrix`. 
* the probablitity of correct classification (accuracy score). 
* the `precision`, `recall`, and `f1-score` for each class.

In [40]:
# Your Solution 
matrix = metrics.confusion_matrix(y_test, y_pred)
normalized_matrix = matrix.astype('float32') / matrix.sum()
print("Matrix:",matrix)
print("Normalized Matrix:", normalized_matrix)
class_report = metrics.classification_report(y_test, y_pred)
print(class_report)

Matrix: [[108  70]
 [ 82 140]]
Normalized Matrix: [[0.27  0.175]
 [0.205 0.35 ]]
              precision    recall  f1-score   support

         Bad       0.57      0.61      0.59       178
        Good       0.67      0.63      0.65       222

    accuracy                           0.62       400
   macro avg       0.62      0.62      0.62       400
weighted avg       0.62      0.62      0.62       400



# Challenge 5

The **Knn** classifier accepts a number of parameters. One of those parameters is the number K (i.e. the number of nearest neighbors to consider when making a prediction. Evaluate the classifier for different values of K and identify which configuration achieve the best performance on the testing set. Plot or print your results.


In [48]:
# Your solution here.
cluster2 = KNeighborsClassifier(n_neighbors=2)
cluster2.fit(X_train, y_train)
y_pred2 = cluster2.predict(X_test)
print("K = 2")
print(metrics.classification_report(y_test, y_pred2))

cluster3 = KNeighborsClassifier(n_neighbors=3)
cluster3.fit(X_train, y_train)
y_pred3 = cluster3.predict(X_test)
print("K = 3")
print(metrics.classification_report(y_pred3, y_test))

cluster4 = KNeighborsClassifier(n_neighbors=4)
cluster4.fit(X_train, y_train)
y_pred4 = cluster4.predict(X_test)
print("K = 4")
print(metrics.classification_report(y_pred4, y_test))

K = 2
              precision    recall  f1-score   support

         Bad       0.53      0.76      0.62       178
        Good       0.70      0.45      0.55       222

    accuracy                           0.59       400
   macro avg       0.61      0.60      0.58       400
weighted avg       0.62      0.59      0.58       400

K = 3
              precision    recall  f1-score   support

         Bad       0.61      0.57      0.59       190
        Good       0.63      0.67      0.65       210

    accuracy                           0.62       400
   macro avg       0.62      0.62      0.62       400
weighted avg       0.62      0.62      0.62       400

K = 4
              precision    recall  f1-score   support

         Bad       0.73      0.53      0.61       245
        Good       0.48      0.69      0.57       155

    accuracy                           0.59       400
   macro avg       0.61      0.61      0.59       400
weighted avg       0.63      0.59      0.60       400



In [None]:
# K=3 performed the best accuracy wise


Copyright Statement: Copyright © 2020 Christoforou. The materials provided by the instructor of this course, including this notebook, are for the use of the students enrolled in the course. Materials are presented in an educational context for personal use and study and should not be shared, distributed, disseminated or sold in print — or digitally — outside the course without permission. You may not, nor may you knowingly allow others to reproduce or distribute lecture notes, course materials as well as any of their derivatives without the instructor's express written consent.