This notebook is part the of Dr. Christoforos Christoforou's course materials. You may not, nor may you knowingly allow others to reproduce or distribute lecture notes, course materials or any of their derivatives without the instructor's express written consent.





## Problem Set 05: Hyperparameter Tuning and Model Selection

## Cross-validation example code
Oftentimes, we might consider one of several classification models as our predictive function, or we might want to identify the optimal set parameters for a particular classifier. For example, the KNN classifier expects the parameter `K` which indicates the number neighbors, the SVM classifier expects some hyperparameters such as the value `C`, and the `kernel` function. 

Typically, we want to select the best set of parameters for our particular model. One method for selecting such hyperparameters is `cross-validation`. 

The code below illustrates how you can use `sklearn` functions to perform `cross-validation`.

In [1]:
import numpy as np
import matplotlib.pyplot as plt 
from sklearn import metrics
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold 

In [2]:
# Generate a sample Dataset. 
n_samples = 100
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5, noise=.35)
X, y = noisy_circles

In [3]:
# Run one of the data generation cell above before executing this cell; 

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

#
# Split the original dataset into Training and Testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=54, test_size=0.45, 
                                                    shuffle=True, stratify=y)

n_splits = 25       # Specify cross validation splits parameters (i.e. how many folds)
n_neighbors = 10    # Specify classifer parameters

# Generate a cross validation slplit iterator. 
cv = StratifiedKFold(n_splits=n_splits, random_state=54, shuffle=True)

# use it to keep track of the accuracy total accross all folds 
kfold_acc = 0.

for train_idx, valid_idx in cv.split(X_train, y_train):
  
  # Define the classificaiton model - KNeighborsClassifier() classifer.
  model = KNeighborsClassifier(n_neighbors=n_neighbors)

  # Train the classifier on the subset of the training data that are in the training index set current cross validation fold. 
  model.fit(X_train[train_idx],y_train[train_idx])

  # Test the classifer on the subset of the training data that are in the validation index set of the current cross validation fold.
  y_pred = model.predict(X_train[valid_idx])

  # Obtain an estimate of the accuracy of the prediciton. 
  # Here we can use any of the classifer perfomrance matrics we want. 

  acc_metric = metrics.accuracy_score(y_pred,y_train[valid_idx])
  kfold_acc += acc_metric 

# cross validation loop is complete; calcualte the average accuracy. 
kfold_acc = kfold_acc/n_splits

#
# Calculate the accuracy of the classifer on an independent test set. 
# 
 
model = KNeighborsClassifier(n_neighbors=n_neighbors)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
test_acc = metrics.accuracy_score(y_pred, y_test)

print("Cross-validation Accuracy: {:.2f}".format(kfold_acc))
print("Test Accuracy: {:.2f}".format(test_acc))


Cross-validation Accuracy: 0.74
Test Accuracy: 0.71


Alternatively, we can use the `cross_val_score` to simplify the procedure as follows:

In [4]:
#
# Since cross-validation use-case is so common when assessing the performance of a classifer, sklearn library 
# has a dedicated method (i.e. the corss_val_score) that implements the entire cross-validation pipeline. 
#
# A more compact way to implement the same logic as the code above is to use a method provided by sklearn library.
 

from sklearn.model_selection import cross_val_score

#
# Split the original dataset into Training and Testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=54, test_size=0.45, 
                                                    shuffle=True, stratify=y)
# Specify the classifier. 
model = KNeighborsClassifier(n_neighbors=n_neighbors)

# Specify the corss validation parameters
cv = StratifiedKFold(n_splits=10, random_state=43, shuffle=True)

# Rum the corss-validation processing pipline 
cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1)

# Report results of the accuracy. 
print("Cross-validation: {:.2f}".format(np.mean(cv_acc)))

Cross-validation: 0.80


### Challenges : Cross Valudation for Hyperparameter Tuning
In this exercise, you will be working with the Pima Indians Diabetes Database database by Vincent Sigillito, which is available from the UCI database (https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes) or OpenML (https://www.openml.org/d/37).

The dataset contains information about 768 patients along with the Diabetes diagnosis. The Diabetes diagnosis is a binary label, where "tested_positive" means that a patient has diabetes and "tested_negative" means that a patient does not have diabetes.

I addition to the class label, there are 8 numeric features in the dataset, which are listed below:

- Number of times pregnant
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Triceps skin fold thickness (mm)
- 2-Hour serum insulin (mu U/ml)
- Body mass index (weight in kg/(height in m)^2)
- Diabetes pedigree function
- Age (years)

### Download the dataset 

In [None]:
!pip install -q kaggle

In [None]:
#Upload the kaggle API key of your account 
from google.colab import files 
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Download - Specify the parameters.  
kaggle_dataset_URI = "cchristoforou/practice-dataset-for-tutorials"
output_folder = "sample_data/problem_set05"
kaggle_data_file1 = "dataset_37_diabetes.csv"

In [None]:
# Download the first file from dataset - countries.csv
!kaggle datasets download $kaggle_dataset_URI --file $kaggle_data_file1 --path $output_folder

###  Load the Dataset

Use pandas to load the dataset from the `dataset_37_diabetes.csv` CSV file located under the `sample_data/problem_set05` folder in your colab environment. Make sure you downloaded the data by executing `kaggle` command at the top of this notebook.


In [30]:
import pandas as pd
df = pd.read_csv('/content/sample_data/dataset_37_diabetes.csv')
df.drop(df.columns[0],axis=1,inplace=True)
df.head()

Unnamed: 0,plas,pres,skin,insu,mass,pedi,age,class
0,148,72,35,0,33.6,0.627,50,tested_positive
1,85,66,29,0,26.6,0.351,31,tested_negative
2,183,64,0,0,23.3,0.672,32,tested_positive
3,89,66,23,94,28.1,0.167,21,tested_negative
4,137,40,35,168,43.1,2.288,33,tested_positive


### Challenge 1 : Preprocess the class label

Convert the class label using pandas `apply` or `map` method. The mapping should be as follows:

- 'tested_positive' should be converted to 1 
- 'tested_negative' should be converted to 0

Check the documentation of the `map` method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) and the `appy` method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply)

In [31]:
# Implement Challenge 1 here.
df['class'] = df['class'].map({'tested_positive': 1, 'tested_negative': 0})
df.head()

Unnamed: 0,plas,pres,skin,insu,mass,pedi,age,class
0,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
2,183,64,0,0,23.3,0.672,32,1
3,89,66,23,94,28.1,0.167,21,0
4,137,40,35,168,43.1,2.288,33,1


### Challenge 2:  Split dataset into training and test sets

Split the dataset into 70% training and 30% test data. Perform a `stratified split` use 0 as the random seed for shuffling. You might want to check the documentatin of the `train_test_split` method of sklearn available [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [44]:
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
 
X_train, X_test, y_train, y_test = train_test_split(df.drop('class',axis=1), df['class'], test_size=.30, random_state=0)
n_splits = 25        

# Specify the corss validation parameters
cv = StratifiedKFold(n_splits=10, random_state=43, shuffle=True)

# Rum the corss-validation processing pipline 
cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1)

kfold_acc = 0.

In [45]:
print("x_train:", X_train.shape, ". y_train:", y_train.shape)
print("x_test:", X_test.shape, ". y_test:", y_test.shape)
print("ID: " + str(np.unique(y_test)))

x_train: (537, 7) . y_train: (537,)
x_test: (231, 7) . y_test: (231,)
ID: [0 1]


### Challenge 3:  Perform model selection on the KNN classifier

Use the above dataset to identify and evaluate the best model for this dataset when you are using the K-NN classifier. Explore different parameters for K.

1. Report which parameter K gives best result under the cross validation procedure.

2. Report the classification accuracy for the optimal parameter using cross validation for model evaluation.

3. Report the generalization performance of the optimal model on the holdout dataset.

In [58]:
kfold_acc = kfold_acc/n_splits

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
test_acc = metrics.accuracy_score(y_pred, y_test)

print("Cross-validation: {:.2f}".format(np.mean(cv_acc)))
print("Test Accuracy: {:.2f}".format(test_acc))

Cross-validation: 0.76
Test Accuracy: 0.71


In [59]:
# Use this cell to calculate and report optimal model parameters and cross validation performance. 
kfold_acc = kfold_acc/n_splits

model = KNeighborsClassifier(n_neighbors=7)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
test_acc = metrics.accuracy_score(y_pred, y_test)

print("Cross-validation: {:.2f}".format(np.mean(cv_acc)))
print("Test Accuracy: {:.2f}".format(test_acc))

Cross-validation: 0.76
Test Accuracy: 0.75


In [60]:
# Use this cell to calculate and report the generalization performance of the optimal model
kfold_acc = kfold_acc/n_splits

model = KNeighborsClassifier(n_neighbors=15)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
test_acc = metrics.accuracy_score(y_pred, y_test)

print("Cross-validation: {:.2f}".format(np.mean(cv_acc)))
print("Test Accuracy: {:.2f}".format(test_acc))

Cross-validation: 0.76
Test Accuracy: 0.74


##  3 Perform model selection on the SVM classifier.
Use the above dataset to identify and evaluate the best model for this dataset when you are using the SVM classifier. Explore different parameters for C, gamma and kernel.

1. Report which parameter configuration (C, gamma, kernel) gives best result under the cross validation procedure.

2. Report the classification accuracy for the optimal parameter using cross validation for model evaluation.

3. Report the generalization performance of the optimal model on the holdout dataset.

In [61]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

df = pd.read_csv('/content/sample_data/dataset_37_diabetes.csv')
df.drop(df.columns[0],axis=1,inplace=True)

# Implement Challenge 1 here.
df['class'] = df['class'].map({'tested_positive': 1, 'tested_negative': 0})

from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
 
X_train, X_test, y_train, y_test = train_test_split(df.drop('class',axis=1), df['class'], test_size=.30, random_state=0)
n_splits = 25      
# Specify the corss validation parameters
cv = StratifiedKFold(n_splits=10, random_state=43, shuffle=True)

# Rum the corss-validation processing pipline 
cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1)

kfold_acc = kfold_acc/n_splits

model = SVC(C = 5.0, kernel="linear", gamma = 25)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
test_acc = metrics.accuracy_score(y_pred, y_test)

print("Cross-validation: {:.2f}".format(np.mean(cv_acc)))
print("Test Accuracy: {:.2f}".format(test_acc))
print(classification_report(y_test,y_pred))

Cross-validation: 0.74
Test Accuracy: 0.77
              precision    recall  f1-score   support

           0       0.80      0.89      0.84       157
           1       0.70      0.53      0.60        74

    accuracy                           0.77       231
   macro avg       0.75      0.71      0.72       231
weighted avg       0.77      0.77      0.77       231



In [57]:
# Use this cell to run cross validation and select the optimial model for SVM classifier
# (i.e. the optimal C, gamma, kernal parameter configuration)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

df = pd.read_csv('/content/sample_data/dataset_37_diabetes.csv')
df.drop(df.columns[0],axis=1,inplace=True)

# Implement Challenge 1 here.
df['class'] = df['class'].map({'tested_positive': 1, 'tested_negative': 0})

from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
 
X_train, X_test, y_train, y_test = train_test_split(df.drop('class',axis=1), df['class'], test_size=.30, random_state=0)
n_splits = 10      

# Specify the corss validation parameters
cv = StratifiedKFold(n_splits=10, random_state=43, shuffle=True)

# Rum the corss-validation processing pipline 
cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1)

kfold_acc = kfold_acc/n_splits

model = SVC(C = 5.0, kernel="linear", gamma = 25)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
test_acc = metrics.accuracy_score(y_pred, y_test)

print("Cross-validation: {:.2f}".format(np.mean(cv_acc)))
print("Test Accuracy: {:.2f}".format(test_acc))
print(classification_report(y_test,y_pred))

Cross-validation: 0.76
Test Accuracy: 0.77
              precision    recall  f1-score   support

           0       0.80      0.89      0.84       157
           1       0.70      0.53      0.60        74

    accuracy                           0.77       231
   macro avg       0.75      0.71      0.72       231
weighted avg       0.77      0.77      0.77       231



In [62]:
# Use this cell to run cross validation and select the optimial model for SVM classifier
# (i.e. the optimal C, gamma, kernal parameter configuration)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

df = pd.read_csv('/content/sample_data/dataset_37_diabetes.csv')
df.drop(df.columns[0],axis=1,inplace=True)

# Implement Challenge 1 here.
df['class'] = df['class'].map({'tested_positive': 1, 'tested_negative': 0})

from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
 
X_train, X_test, y_train, y_test = train_test_split(df.drop('class',axis=1), df['class'], test_size=.30, random_state=0)
n_splits = 25      
n_neighbors = 10   

# Specify the corss validation parameters
cv = StratifiedKFold(n_splits=10, random_state=43, shuffle=True)

# Rum the corss-validation processing pipline 
cv_acc = cross_val_score(estimator=model,
                         X=X_train,
                         y=y_train,
                         cv=cv,
                         n_jobs=-1)

kfold_acc = kfold_acc/n_splits

model = SVC(C = 10.0, kernel="rbf", gamma = 'auto')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
test_acc = metrics.accuracy_score(y_pred, y_test)

print("Cross-validation: {:.2f}".format(np.mean(cv_acc)))
print("Test Accuracy: {:.2f}".format(test_acc))
print(classification_report(y_test,y_pred))

Cross-validation: 0.77
Test Accuracy: 0.68
              precision    recall  f1-score   support

           0       0.68      1.00      0.81       157
           1       0.00      0.00      0.00        74

    accuracy                           0.68       231
   macro avg       0.34      0.50      0.40       231
weighted avg       0.46      0.68      0.55       231



  _warn_prf(average, modifier, msg_start, len(result))


Copyright Statement: Copyright © 2020 Christoforou. The materials provided by the instructor of this course, including this notebook, are for the use of the students enrolled in the course. Materials are presented in an educational context for personal use and study and should not be shared, distributed, disseminated or sold in print — or digitally — outside the course without permission. You may not, nor may you knowingly allow others to reproduce or distribute lecture notes, course materials as well as any of their derivatives without the instructor's express written consent