<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Model Selection & Evaluation

In this notebook we are going to look at strategies to divide your dataset in order to perform model selection and testing using subsets of data in ways that do not create bias in your measurement of model performance.

We are going to use a dataset which comes from a study done to try to use sonar signals to differentiate between a mine (simulated using a metal cylinder) and a rock.  We have 208 observations (sonar readings), and each observation has 60 features (energy in a particular frequency band summed over a set period of time) and a target value (rock 'R' or mine 'M').  Our goal will be to build a model which can use the sonar readings to predict whether the object is a mine or rock.

Details on the dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks))

In [1]:
# Import the libraries we know we need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

In [2]:
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
data = pd.read_csv(url, header=None)
print(data.shape)
data.head()

(208, 61)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094,R


We can see that we have 208 observations (sonar readings), and each observation has 60 features (energy in a particular frequency band summed over a set period of time) and a target value (rock 'R' or mine 'M')

In [3]:
# Create feature matrix using the first 60 columns as the features
X = data.iloc[:,:60]

# Create target vector from the last column
y = data.iloc[:,60]

X.shape,y.shape

((208, 60), (208,))

## Model Evaluation: Splitting data into training and test sets

When we split the data into a training and a test set, we use only the training data to fit the model.  Once we have trained our model, we use it to generate predictions on the test set data and calculate error metrics based on those predictions.  This ensures that we are evaluating the model based on its ability to create predictions for data it has not seen before, which is more representative of what the model will need to do in the real world.

In [4]:
# Split the data into training and test sets, using 85% of the data for training
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=0,test_size=0.15)

In [5]:
# Let's use a multi-layer perceptron, a form of neural network
from sklearn.neural_network import MLPClassifier

# Create an instance of the MLPClassifier algorithm and set the hyperparameter values
model = MLPClassifier(hidden_layer_sizes=(100,50,10),activation='tanh',
                      solver='sgd',learning_rate_init=0.01,max_iter=2000)

In [6]:
# Fit the model using only the training set data
model.fit(X_train,y_train)

MLPClassifier(activation='tanh', hidden_layer_sizes=(100, 50, 10),
              learning_rate_init=0.01, max_iter=2000, solver='sgd')

In [7]:
# Evaluate the performance of our model using the test set data
preds = model.predict(X_test)
acc_test = sum(preds==y_test)/len(y_test)
print('Accuracy of our classifier on the test set is {:.3f}'.format(acc_test))

Accuracy of our classifier on the test set is 0.875


But what if we want to compare different models (for example, evaluate different algorithms or fine-tune our hyperparameters)?  Can we use the same strategy of training each model on the training data and then comparing their performance on the test set to select the best model?

## Model Selection: Splitting data into training, validation and test sets
When we are seeking to optimize models by tuning hyperparameters or comparing different algorithms, it is a best practice to do so by comparing the performance of your model options using a "validation" set, and then reserve use of the test set to evaluate the performance of the final model you have selected.  To utilize this approach we must split our data three ways to create a training set, validation set, and test set.

In [8]:
# Split data first into training and testing to get test set using 15% of data for test
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=0,test_size=0.15)

# Now split the training set again into training and validation, using 15% of training data for validation
X_train,X_val,y_train,y_val = train_test_split(X_train,y_train,random_state=0,test_size=0.15)

# Verify we have what we expect in each set
X_train.shape, X_val.shape, X_test.shape

((149, 60), (27, 60), (32, 60))

Now let's compare two different models and determine which one gives us better performance.  Both of the models below are Multilayer Perceptron (simple neural network) models with different shapes.  Do not worry about what these are or how they work for now, we will get to that in a later lesson.  For now you can treat them as black box models to compare.

In [9]:
# Create an instance of each model we want to evaluate

model1 = MLPClassifier(hidden_layer_sizes=(100,50,10),activation='tanh',
                      solver='sgd',learning_rate_init=0.01,max_iter=2000)

model2 = MLPClassifier(hidden_layer_sizes=(100,50),activation='relu',
                      solver='sgd',learning_rate_init=0.01,max_iter=2000)

In [10]:
# Compare the performance of the two models using the validation set
model1.fit(X_train,y_train)
val_preds_model1 = model1.predict(X_val)
acc_val_model1 = sum(val_preds_model1==y_val)/len(y_val)

model2.fit(X_train,y_train)
val_preds_model2 = model2.predict(X_val)
acc_val_model2 = sum(val_preds_model2==y_val)/len(y_val)

print('Accuracy of model1 on the validation set is {:.3f}'.format(acc_val_model1))
print('Accuracy of model2 on the validation set is {:.3f}'.format(acc_val_model2))

Accuracy of model1 on the validation set is 0.778
Accuracy of model2 on the validation set is 0.852


Based on the performance of our two models on the validation set, we would select model2 to use as our model.  Let's now use the test set to evaluate its performance on data it has not yet seen so we can state a more accurate performance level.  Since we are using the test set to evaluate performance, we can now train it on a combination of the train and validation data.

In [11]:
# Train our selected model on the training plus validation sets
model2.fit(pd.concat([X_train,X_val]),pd.concat([y_train,y_val]))

# Evaluate its performance on the test set
preds_test = model2.predict(X_test)
acc_test = sum(preds_test==y_test)/len(y_test)
print('Accuracy of our model on the test set is {:.3f}'.format(acc_test))

Accuracy of our model on the test set is 0.938


## Model Selection: Cross-validation

A common approach to comparing and optimizing models is to use cross-validation rather than a single validation set to compare model performace.  We will then select the better model based on the cross-validation performance and use the test set to determine its performance.

Let's do another example comparing two different models (model2 and model3) using cross-validation, with accuracy as our evaluation metric.  Don't worry about the details of model2 and model3 for now, you can treat them as black box models we want to compare and select the better one.

In [None]:
# Let's set aside a test set and use the remainder for training and cross-validation
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=0,test_size=0.15)

In [12]:
# Instantiate the KFold class
from sklearn.model_selection import KFold
kf = KFold(n_splits=10)

# Set up the two models we want to compare: a neural network model and a KNN model
model2 = MLPClassifier(hidden_layer_sizes=(100,50),activation='relu',
                      solver='sgd',learning_rate_init=0.01,max_iter=1000)

from sklearn.neighbors import KNeighborsClassifier
model3 = KNeighborsClassifier(n_neighbors=5)

# For each model, use K-folds cross validation to calculate the CV accuracy

for model in [model2,model3]:
    print(model)
    
    acc_folds = [] # List to hold the validation fold accuracy at each iteration
    # For each iteration, train the model on the training folds and calculate the accuracy on the validation folds
    for (train_idx,val_idx) in kf.split(X=X_train,y=y_train):

        # Split training and validation sets for each fold
        X_fold_train, X_fold_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_fold_train, y_fold_val = y_train.iloc[train_idx], y_train.iloc[val_idx]

        # Fit model to the training data for this iteration
        model.fit(X_fold_train,y_fold_train)

        # Get predictions for the validation fold and calculate accuracy
        preds = model.predict(X_fold_val)
        acc_val = sum(preds==y_fold_val)/len(y_fold_val)
        
        print('Fold accuracy: {:.3f}'.format(acc_val))

        # Add the accuracy score to the acc_folds list
        acc_folds.append(acc_val)
        
    # Calculate the mean accuracy across all iterations
    mean_acc = np.mean(acc_folds)

    print('Mean cross-validation accuracy across all folds is {:.3f} \n'.format(mean_acc))
        

MLPClassifier(hidden_layer_sizes=(100, 50), learning_rate_init=0.01,
              max_iter=1000, solver='sgd')
Fold accuracy: 0.933
Fold accuracy: 0.600
Fold accuracy: 0.933
Fold accuracy: 0.933
Fold accuracy: 0.800
Fold accuracy: 0.600
Fold accuracy: 0.867
Fold accuracy: 0.667
Fold accuracy: 0.867
Fold accuracy: 0.929
Mean cross-validation accuracy across all folds is 0.813 

KNeighborsClassifier()
Fold accuracy: 0.733
Fold accuracy: 0.533
Fold accuracy: 0.933
Fold accuracy: 0.800
Fold accuracy: 0.667
Fold accuracy: 0.600
Fold accuracy: 0.600
Fold accuracy: 0.667
Fold accuracy: 0.867
Fold accuracy: 0.786
Mean cross-validation accuracy across all folds is 0.719 



As we can see above, the cross-validation accuracy of model2 is higher than model3, so we will use model2 as our final model.  Let's now evaluate the performance of model2 on the test set

In [13]:
# Train our selected model on the full training set
model2.fit(X_train,y_train)
    
# Evaluate its performance on the test set
preds_test = model2.predict(X_test)
acc_test = sum(preds_test==y_test)/len(y_test)
print('Accuracy of our model on the test set is {:.3f}'.format(acc_test))

Accuracy of our model on the test set is 0.875
