<a href="https://colab.research.google.com/github/shreyashrestha07/CUS615_ShreyaShrestha/blob/master/Problem_set_05_ModelEvaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is part the of Dr. Christoforos Christoforou's course materials. You may not, nor may you knowingly allow others to reproduce or distribute lecture notes, course materials or any of their derivatives without the instructor's express written consent.





## Problem Set 05: Hyperparameter Tuning and Model Selection

## Cross-validation example code
Oftentimes, we might consider one of several classification models as our predictive function, or we might want to identify the optimal set parameters for a particular classifier. For example, the KNN classifier expects the parameter `K` which indicates the number neighbors, the SVM classifier expects some hyperparameters such as the value `C`, and the `kernel` function. 

Typically, we want to select the best set of parameters for our particular model. One method for selecting such hyperparameters is `cross-validation`. 

The code below illustrates how you can use `sklearn` functions to perform `cross-validation`.

In [157]:
import numpy as np
import matplotlib.pyplot as plt 
from sklearn import metrics
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold 
from sklearn.svm import SVC

In [158]:
# Generate a sample Dataset. 
n_samples = 100
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5, noise=.35)
X, y = noisy_circles

In [159]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

#
# Split the original dataset into Training and Testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=54, test_size=0.45, 
                                                    shuffle=True, stratify=y)

In [160]:
y_test

array([0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0,
       0])

In [161]:
# Run one of the data generation cell above before executing this cell; 

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

#
# Split the original dataset into Training and Testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=54, test_size=0.45, 
                                                    shuffle=True, stratify=y)

n_splits = 25       # Specify cross validation splits parameters (i.e. how many folds)
n_neighbors = 10    # Specify classifer parameters

# Generate a cross validation slplit iterator. 
cv = StratifiedKFold(n_splits=n_splits, random_state=54, shuffle=True)

# use it to keep track of the accuracy total accross all folds 
kfold_acc = 0.

for train_idx, valid_idx in cv.split(X_train, y_train):
  
  # Define the classificaiton model - KNeighborsClassifier() classifer.
  model = KNeighborsClassifier(n_neighbors=n_neighbors)

  # Train the classifier on the subset of the training data that are in the training index set current cross validation fold. 
  model.fit(X_train[train_idx],y_train[train_idx])

  # Test the classifer on the subset of the training data that are in the validation index set of the current cross validation fold.
  y_pred = model.predict(X_train[valid_idx])

  # Obtain an estimate of the accuracy of the prediciton. 
  # Here we can use any of the classifer perfomrance matrics we want. 

  acc_metric = metrics.accuracy_score(y_pred,y_train[valid_idx])
  kfold_acc += acc_metric 

# cross validation loop is complete; calcualte the average accuracy. 
kfold_acc = kfold_acc/n_splits

#
# Calculate the accuracy of the classifer on an independent test set. 
# 
 
model = KNeighborsClassifier(n_neighbors=n_neighbors)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
test_acc = metrics.accuracy_score(y_pred, y_test)

print("Cross-validation Accuracy: {:.2f}".format(kfold_acc))
print("Test Accuracy: {:.2f}".format(test_acc))


Cross-validation Accuracy: 0.75
Test Accuracy: 0.53


Alternatively, we can use the `cross_val_score` to simplify the procedure as follows:

In [162]:
#
# Since cross-validation use-case is so common when assessing the performance of a classifer, sklearn library 
# has a dedicated method (i.e. the corss_val_score) that implements the entire cross-validation pipeline. 
#
# A more compact way to implement the same logic as the code above is to use a method provided by sklearn library.
 

from sklearn.model_selection import cross_val_score

#
# Split the original dataset into Training and Testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=54, test_size=0.45, 
                                                    shuffle=True, stratify=y)
# Specify the classifier. 
model = KNeighborsClassifier(n_neighbors=n_neighbors)

# Specify the corss validation parameters
cv = StratifiedKFold(n_splits=10, random_state=43, shuffle=True)

# Rum the corss-validation processing pipline 
cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1)

# Report results of the accuracy. 
print("Cross-validation: {:.2f}".format(np.mean(cv_acc)))

Cross-validation: 0.75


### Challenges : Cross Valudation for Hyperparameter Tuning
In this exercise, you will be working with the Pima Indians Diabetes Database database by Vincent Sigillito, which is available from the UCI database (https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes) or OpenML (https://www.openml.org/d/37).

The dataset contains information about 768 patients along with the Diabetes diagnosis. The Diabetes diagnosis is a binary label, where "tested_positive" means that a patient has diabetes and "tested_negative" means that a patient does not have diabetes.

I addition to the class label, there are 8 numeric features in the dataset, which are listed below:

- Number of times pregnant
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Triceps skin fold thickness (mm)
- 2-Hour serum insulin (mu U/ml)
- Body mass index (weight in kg/(height in m)^2)
- Diabetes pedigree function
- Age (years)

### Download the dataset 

In [163]:
!pip install -q kaggle

In [164]:
#Upload the kaggle API key of your account 
from google.colab import files 
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle (1).json
mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [165]:
# Download - Specify the parameters.  
kaggle_dataset_URI = "cchristoforou/practice-dataset-for-tutorials"
output_folder = "sample_data/problem_set05"
kaggle_data_file1 = "dataset_37_diabetes.csv"

In [166]:
# Download the first file from dataset - countries.csv
!kaggle datasets download $kaggle_dataset_URI --file $kaggle_data_file1 --path $output_folder

dataset_37_diabetes.csv: Skipping, found more recently modified local copy (use --force to force download)


###  Load the Dataset

Use pandas to load the dataset from the `dataset_37_diabetes.csv` CSV file located under the `sample_data/problem_set05` folder in your colab environment. Make sure you downloaded the data by executing `kaggle` command at the top of this notebook.


In [167]:
import pandas as pd
df = pd.read_csv('./sample_data/problem_set05/dataset_37_diabetes.csv')
df.head()

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,tested_positive
1,1,85,66,29,0,26.6,0.351,31,tested_negative
2,8,183,64,0,0,23.3,0.672,32,tested_positive
3,1,89,66,23,94,28.1,0.167,21,tested_negative
4,0,137,40,35,168,43.1,2.288,33,tested_positive


### Challenge 1 : Preprocess the class label

Convert the class label using pandas `apply` or `map` method. The mapping should be as follows:

- 'tested_positive' should be converted to 1 
- 'tested_negative' should be converted to 0

Check the documentation of the `map` method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) and the `appy` method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply)

In [168]:
# Solution 1:
df['class'] = df['class'].map(lambda v: 0 if v == 'tested_negative' else 1)
df.head()

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Challenge 2:  Split dataset into training and test sets

Split the dataset into 70% training and 30% test data. Perform a `stratified split` use 0 as the random seed for shuffling. You might want to check the documentatin of the `train_test_split` method of sklearn available [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [169]:
# Solution 2:

X_train, X_test, y_train, y_test = train_test_split(df.drop('class',axis=1), df['class'], random_state=0, test_size=0.30, shuffle=True, stratify=None)

### Challenge 3:  Perform model selection on the KNN classifier

Use the above dataset to identify and evaluate the best model for this dataset when you are using the K-NN classifier. Explore different parameters for K.

1. Report which parameter K gives best result under the cross validation procedure.

2. Report the classification accuracy for the optimal parameter using cross validation for model evaluation.

3. Report the generalization performance of the optimal model on the holdout dataset.

In [171]:
# 1. Report which parameter K gives best result under the cross validation procedure.

crossValResults = [] # to store the cross validation results of all the cv_acc 1-25
for i in range(1,26): # to loop the n_neighbours parameter 1-25
  n_neighbors = i 

  model = KNeighborsClassifier(n_neighbors=n_neighbors)
  model.fit(X_train, y_train)

  # Specify the corss validation parameters
  cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

  # Rum the corss-validation processing pipline 
  cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1) 
  result = np.mean(cv_acc)

  crossValResults.append(result) # to append (add) the result to "crossValResults" list
print("\nThe Cross Validation results are: ",crossValResults) # to print the list



The Cross Validation results are:  [0.6740567670474212, 0.6889235029421945, 0.6647802007615091, 0.700103842159917, 0.7001384562132225, 0.7038421599169263, 0.7225856697819314, 0.7187608168916579, 0.737435098650052, 0.7336967808930426, 0.7336967808930426, 0.7168916580131534, 0.7205607476635513, 0.7113014884042921, 0.7243336794738664, 0.7243509865005192, 0.7430252682589131, 0.7373831775700934, 0.7299584631360332, 0.7262720664589823, 0.7336967808930426, 0.718760816891658, 0.7225164416753201, 0.7169608861197645, 0.7262547594323295]


In [172]:
# Solution 3(i)
optimalCrossVal = np.argmax(crossValResults) # to see the 
print(optimalCrossVal)
# since the indexing starts from 0 the best parameter for K should be 16+1 = 17

16


In [173]:
# to check if the answer is correct

crossValResults = [] # to store the cross validation results of all the cv_acc 1-25
for i in range(1,26): # to loop the n_neighbours parameter 1-25
  n_neighbors = i 

  model = KNeighborsClassifier(n_neighbors=n_neighbors)
  model.fit(X_train, y_train)

  # Specify the corss validation parameters
  cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

  # Rum the corss-validation processing pipline 
  cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1) 
  result = np.mean(cv_acc)
  print("When the n_neighbour parameter is %0.f" %i,"'Cross-validation' is: {:.3f}".format(np.mean(cv_acc)))

When the n_neighbour parameter is 1 'Cross-validation' is: 0.674
When the n_neighbour parameter is 2 'Cross-validation' is: 0.689
When the n_neighbour parameter is 3 'Cross-validation' is: 0.665
When the n_neighbour parameter is 4 'Cross-validation' is: 0.700
When the n_neighbour parameter is 5 'Cross-validation' is: 0.700
When the n_neighbour parameter is 6 'Cross-validation' is: 0.704
When the n_neighbour parameter is 7 'Cross-validation' is: 0.723
When the n_neighbour parameter is 8 'Cross-validation' is: 0.719
When the n_neighbour parameter is 9 'Cross-validation' is: 0.737
When the n_neighbour parameter is 10 'Cross-validation' is: 0.734
When the n_neighbour parameter is 11 'Cross-validation' is: 0.734
When the n_neighbour parameter is 12 'Cross-validation' is: 0.717
When the n_neighbour parameter is 13 'Cross-validation' is: 0.721
When the n_neighbour parameter is 14 'Cross-validation' is: 0.711
When the n_neighbour parameter is 15 'Cross-validation' is: 0.724
When the n_neighbou

In [174]:
# 2. Report the classification accuracy for the optimal parameter using cross validation for model evaluation.
# Solution 3 (ii)

# since the optimal parameter for K is 17 now we see the test accuracy for K=17

model = KNeighborsClassifier(n_neighbors = 17) # 17 is the optimal parameter of K
model.fit(X_train, y_train)

# Specify the corss validation parameters
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

# Rum the corss-validation processing pipline 
cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1) 

print(" The classification accuracy of the optimal parameter of K = 17 is: {:.3f}".format(np.mean(cv_acc)))

 The classification accuracy of the optimal parameter of K = 17 is: 0.743


In [175]:
# 3. Report the generalization performance of the optimal model on the holdout dataset.
# Solution 3(iii)
model = KNeighborsClassifier(n_neighbors = 17)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
test_acc = metrics.accuracy_score(y_pred, y_test)

# Specify the corss validation parameters
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

# Rum the corss-validation processing pipline 
cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1) 


print("Cross-validation is: {:.3f}".format(np.mean(cv_acc)), "and the 'Test Accuracy' is: {:.3f}".format(test_acc) )

# Confidence interval
# z = 1.96 for 95% confidence
import math

confidenceInterval = 1.96*math.sqrt((test_acc*(1-test_acc))/len(y_test))
print("Confidence interval is: {:.3f}".format(test_acc), "+/- {:.3f}".format(confidenceInterval))

Cross-validation is: 0.743 and the 'Test Accuracy' is: 0.745
Confidence interval is: 0.745 +/- 0.056


##  3 Perform model selection on the SVM classifier.
Use the above dataset to identify and evaluate the best model for this dataset when you are using the SVM classifier. Explore different parameters for C, gamma and kernel.

1. Report which parameter configuration (C, gamma, kernel) gives best result under the cross validation procedure.

2. Report the classification accuracy for the optimal parameter using cross validation for model evaluation.

3. Report the generalization performance of the optimal model on the holdout dataset.

In [176]:
# Use this cell to run cross validation and select the optimial model for SVM classifier
# (i.e. the optimal C, gamma, kernal parameter configuration)

# 1. Report which parameter configuration (C, gamma, kernel) gives best result under the cross validation procedure.

crossValResults = [] # to store the cross validation results of all the cv_acc 1-25
for i in range(1,26): # to loop the n_neighbours parameter 1-25
  C = i 

  model = SVC(C=C, kernel="rbf")
  model.fit(X_train, y_train)

  # Specify the corss validation parameters
  cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

  # Rum the corss-validation processing pipline 
  cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1) 
  result = np.mean(cv_acc)

  crossValResults.append(result) # to append (add) the result to "crossValResults" list
print("\nThe Cross Validation results are: ",crossValResults) # to print the list



The Cross Validation results are:  [0.7523710626514364, 0.7523710626514364, 0.7598476981654552, 0.7598476981654552, 0.7598476981654552, 0.7561266874350986, 0.7505192107995846, 0.7523883696780892, 0.7486500519210799, 0.7505019037729318, 0.7486327448944271, 0.7393042575285566, 0.7393042575285566, 0.7411734164070612, 0.7430425752855659, 0.7411734164070612, 0.7374524056767048, 0.7374524056767048, 0.7374524056767048, 0.7355832467982001, 0.7355832467982001, 0.7355832467982001, 0.7337140879196954, 0.7337140879196954, 0.7336967808930426]


In [177]:
# Solution 3(i)
optimalCrossVal = np.argmax(crossValResults) # to see the 
print(optimalCrossVal)
# since the indexing starts from 0 the best parameter for K should be 2+1 = 3

2


In [178]:
# to check
crossValResults = [] # to store the cross validation results of all the cv_acc 1-25
for i in range(1,26): # to loop the n_neighbours parameter 1-25
  C = i

  model = SVC(C=C, kernel="rbf")
  model.fit(X_train, y_train)

  # Specify the corss validation parameters
  cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

  # Rum the corss-validation processing pipline 
  cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1) 
  result = np.mean(cv_acc)

  crossValResults.append(result) # to append (add) the result to "crossValResults" list
  
  print("When the C parameter is %0.f" %i,"'Cross-validation' is: {:.4f}".format(np.mean(cv_acc)))

When the C parameter is 1 'Cross-validation' is: 0.7524
When the C parameter is 2 'Cross-validation' is: 0.7524
When the C parameter is 3 'Cross-validation' is: 0.7598
When the C parameter is 4 'Cross-validation' is: 0.7598
When the C parameter is 5 'Cross-validation' is: 0.7598
When the C parameter is 6 'Cross-validation' is: 0.7561
When the C parameter is 7 'Cross-validation' is: 0.7505
When the C parameter is 8 'Cross-validation' is: 0.7524
When the C parameter is 9 'Cross-validation' is: 0.7487
When the C parameter is 10 'Cross-validation' is: 0.7505
When the C parameter is 11 'Cross-validation' is: 0.7486
When the C parameter is 12 'Cross-validation' is: 0.7393
When the C parameter is 13 'Cross-validation' is: 0.7393
When the C parameter is 14 'Cross-validation' is: 0.7412
When the C parameter is 15 'Cross-validation' is: 0.7430
When the C parameter is 16 'Cross-validation' is: 0.7412
When the C parameter is 17 'Cross-validation' is: 0.7375
When the C parameter is 18 'Cross-valida

**Using kernel = sigmoid**

In [179]:

crossValResults = [] # to store the cross validation results of all the cv_acc 1-25
for i in range(1,26): # to loop the n_neighbours parameter 1-25
  C = i 

  model = SVC(C=C, kernel="sigmoid")
  model.fit(X_train, y_train)

  # Specify the corss validation parameters
  cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

  # Rum the corss-validation processing pipline 
  cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1) 
  result = np.mean(cv_acc)

  crossValResults.append(result) # to append (add) the result to "crossValResults" list
print("\nThe Cross Validation results are: ",crossValResults) # to print the list


The Cross Validation results are:  [0.47677397023191415, 0.4674800969193493, 0.4525614399446175, 0.4451021114572516, 0.43766008999653855, 0.43579093111803396, 0.43579093111803396, 0.43579093111803396, 0.43579093111803396, 0.433939079266182, 0.4358082381446867, 0.43206992038767744, 0.43392177223952927, 0.4320526133610246, 0.43020076150917264, 0.43206992038767733, 0.43206992038767733, 0.43206992038767733, 0.43206992038767733, 0.43206992038767733, 0.4320526133610246, 0.4320526133610246, 0.4320526133610246, 0.4320526133610246, 0.43020076150917264]


In [180]:
# Solution 3(i)
optimalCrossVal = np.argmax(crossValResults) # to see the 
print(optimalCrossVal)
# since the indexing starts from 0 the best parameter for K should be 0+1 = 1

0


**Now we know that C = 3, kernel = rbf and gamma = 0 is the best parameter of K now we look at the cross validation score**


Note: I did not test different parameters for gamma because i got the same results for every parameter

In [183]:
# Use this cell to calculate and report the generalization performance of the optimal model
# 2. Report the classification accuracy for the optimal parameter using cross validation for model evaluation.
# Solution 3 (ii)

# since the optimal parameter for K is 17 now we see the test accuracy for K=17

model = SVC(C=3, kernel="rbf")
model.fit(X_train, y_train)

# Specify the corss validation parameters
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

# Rum the corss-validation processing pipline 
cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1) 

print(" The classification accuracy of the optimal parameter of K = 17 is: {:.3f}".format(np.mean(cv_acc)))

 The classification accuracy of the optimal parameter of K = 17 is: 0.760


In [184]:
# 3. Report the generalization performance of the optimal model on the holdout dataset.
# Solution 3(iii)
model = SVC(C=3, kernel="rbf")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
test_acc = metrics.accuracy_score(y_pred, y_test)

# Specify the corss validation parameters
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

# Rum the corss-validation processing pipline 
cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1) 


print("Cross-validation is: {:.3f}".format(np.mean(cv_acc)), "and the 'Test Accuracy' is: {:.3f}".format(test_acc) )

# Confidence interval
# z = 1.96 for 95% confidence
import math

confidenceInterval = 1.96*math.sqrt((test_acc*(1-test_acc))/len(y_test))
print("Confidence interval is: {:.3f}".format(test_acc), "+/- {:.3f}".format(confidenceInterval)) 


Cross-validation is: 0.760 and the 'Test Accuracy' is: 0.766
Confidence interval is: 0.766 +/- 0.055


Copyright Statement: Copyright © 2020 Christoforou. The materials provided by the instructor of this course, including this notebook, are for the use of the students enrolled in the course. Materials are presented in an educational context for personal use and study and should not be shared, distributed, disseminated or sold in print — or digitally — outside the course without permission. You may not, nor may you knowingly allow others to reproduce or distribute lecture notes, course materials as well as any of their derivatives without the instructor's express written consent