<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Model Selection & Evaluation

In this notebook we are going to look at strategies to divide your dataset in order to perform model selection and testing using subsets of data in ways that do not create bias in your measurement of model performance.

We are going to use a dataset which comes from a study done to try to use sonar signals to differentiate between a mine (simulated using a metal cylinder) and a rock.  Details on the dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks))

In [1]:
# Import the libraries we know we need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import KFold

import warnings
warnings.filterwarnings("ignore")

In [2]:
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
data = pd.read_csv(url, header=None)
print(data.shape)
data.head()

(208, 61)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094,R


We can see that we have 208 observations (sonar readings), and each observation has 60 features (energy in a particular frequency band summed over a set period of time) and a target value (rock 'R' or mine 'M').  

Let's do one more thing right now, which is to set up an instance of our model.  We will use a Multi-layer Perceptron classifier (a simple form of neural network).  Don't worry about the details now, we will learn them later.  For now, you can treat this model as a black box.

In [18]:
# Create an instance of the MLPClassifier algorithm and set the hyperparameter values
model = MLPClassifier(hidden_layer_sizes=(100,50,10),activation='tanh',
                      solver='sgd',learning_rate_init=0.001,max_iter=2000, random_state=0)

## Part 1: Training and test sets
In this part, you should complete the following:  
- Split your data into a feature matrix X and a target vector y  
- Split the data into a training set and a test set, using 85% of the data for training and 15% for testing (hint: use scikit-learn's train_test_split() method, already imported for you.  Name the resulting arrays `X_train, y_train, X_test, y_test`
- Train (fit) your model on the X and y training sets  
- Use your trained model to get predictions on the `X_test` test set, and name the predictions `preds`  
- Finally, run the next code cell to calculate the display the accuracy of your classifier model

In [19]:
### YOUR CODE HERE ###
# Create feature matrix using the first 60 columns as the features
X = data.iloc[:,:60]

# Create target vector from the last column
y = data.iloc[:,60]

X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=0,test_size=0.15)

model.fit(X_train,y_train)

preds = model.predict(X_test)

### END CODE ###

In [20]:
# Evaluate the performance of our model using the test predictions
acc_test = np.sum(preds==y_test)/len(y_test)
print('Accuracy of our classifier on the test set is {:.3f}'.format(acc_test))

Accuracy of our classifier on the test set is 0.812


## Part 2: Model selection using validation sets
But what if we want to compare different models (for example, evaluate different algorithms or fine-tune our hyperparameters)?  Can we use the same strategy of training each model on the training data and then comparing their performance on the test set to select the best model?

When we are seeking to optimize models by tuning hyperparameters or comparing different algorithms, it is a best practice to do so by comparing the performance of your model options using a "validation" set, and then reserve use of the test set to evaluate the performance of the final model you have selected.  To utilize this approach we must split our data three ways to create a training set, validation set, and test set.

To illustrate this, let's compare two different models, which are defined for your below

In [21]:
# Create an instance of each model we want to evaluate

model1 = MLPClassifier(hidden_layer_sizes=(100,50,10),activation='tanh',
                      solver='sgd',learning_rate_init=0.001,max_iter=2000, random_state=0)

model2 = MLPClassifier(hidden_layer_sizes=(100,50),activation='relu',
                      solver='sgd',learning_rate_init=0.01,max_iter=2000, random_state=0)

In this part you should complete the following:  
- Split your X and y arrays into a training set and a test set, using 15% of data for the test set.  Store the training data as `X_train_full, y_train_full` and the test set data as `X_test, y_test`
- Now, split your training set again into a training set and a validation set, using 15% of the training set for the new validation set (and the remaining 85% is still available for training). Store the final training data as `X_train, y_train` and the validation set data as `X_val, y_val`
- Train (fit) model1 and model2 using the training data only  
- Now, use your trained model1 and model2 to generate predictions on the validation set.  Store model1's predictions as `val_preds_model1` and model2's predictions as `val_preds_model2`  
- Finally, run the code cell below to calculate the accuracy of each on the validation set.  Based on this, which model would you select as your final model?

In [22]:
### YOUR CODE HERE ###

# Split data first into training and testing to get test set using 15% of data for test
X_train_full,X_test,y_train_full,y_test = train_test_split(X, y, random_state=0,test_size=0.15)

# Now split the training set again into training and validation, using 15% of training data for validation
X_train,X_val,y_train,y_val = train_test_split(X_train_full,y_train_full,random_state=0,test_size=0.15)

# Compare the performance of the two models using the validation set
model1.fit(X_train,y_train)
val_preds_model1 = model1.predict(X_val)

model2.fit(X_train,y_train)
val_preds_model2 = model2.predict(X_val)

### END CODE ###

Now let's compare two different models and determine which one gives us better performance.

In [23]:
# Calculate the validation accuracy of each model
acc_val_model1 = np.sum(val_preds_model1==y_val)/len(y_val)
acc_val_model2 = np.sum(val_preds_model2==y_val)/len(y_val)

print('Accuracy of model1 on the validation set is {:.3f}'.format(acc_val_model1))
print('Accuracy of model2 on the validation set is {:.3f}'.format(acc_val_model2))

Accuracy of model1 on the validation set is 0.778
Accuracy of model2 on the validation set is 0.889


Now that we've chosen our final model, we can use the test set to evaluate it's performance.  Before we do that, let's retrain our model using the training plus validation data.

In [24]:
# Train our selected model on the training plus validation sets
model2.fit(X_train_full,y_train_full)

# Evaluate its performance on the test set
preds_test = model2.predict(X_test)
acc_test = np.sum(preds_test==y_test)/len(y_test)
print('Accuracy of our model on the test set is {:.3f}'.format(acc_test))

Accuracy of our model on the test set is 0.875


## Part 3: Model selection using cross-validation

A common approach to comparing and optimizing models is to use cross-validation rather than a single validation set to compare model performace.  We will then select the better model based on the cross-validation performance and use the test set to determine its performance.

In [38]:
# Let's set aside a test set and use the remainder for training and cross-validation
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=0,test_size=0.15)

# Set up the two models we want to compare: a neural network model and a KNN model
model_a = MLPClassifier(hidden_layer_sizes=(100,50),activation='relu',
                      solver='sgd',learning_rate_init=0.01,max_iter=1000,random_state=0)

model_b = KNeighborsClassifier(n_neighbors=5)

### YOUR CODE HERE ###

# Instantiate the KFold generator which allows you to iterate through the data 3 times, splitting the data
# into the training folds and validation fold each time

kf = KFold(n_splits=3)


# For each model, use K-folds cross validation to calculate the cross-validation accuracy
for model in [model_a,model_b]:
    print(model)
    
    # List to hold the validation fold accuracy at each iteration of the below cross-validation loop
    acc_folds = [] 
    
    # For each iteration, train the model on the training folds and calculate the accuracy on the validation folds
    for (train_idx,val_idx) in kf.split(X=X_train,y=y_train):
        
        # Use the indices to get the training and validation sets for each fold
        X_fold_train, X_fold_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_fold_train, y_fold_val = y_train.iloc[train_idx], y_train.iloc[val_idx]

        # Fit model to the training data for this iteration
        model.fit(X_fold_train,y_fold_train)

        # Get predictions for the validation fold and calculate validation fold accuracy for this iteration
        preds = model.predict(X_fold_val)
        acc_val = np.sum(preds==y_fold_val)/len(y_fold_val)
        
        print('Fold accuracy: {:.3f}'.format(acc_val))

        # Add the accuracy score of this iteration to the acc_folds list
        acc_folds.append(acc_val)
        
    # Calculate the mean validation accuracy across all iterations
    mean_acc = np.mean(acc_folds)
    
    print('Mean cross-validation accuracy across all folds is {:.3f} \n'.format(mean_acc))
    
### END CODE ###
        

MLPClassifier(hidden_layer_sizes=(100, 50), learning_rate_init=0.01,
              max_iter=1000, random_state=0, solver='sgd')
Fold accuracy: 0.780
Fold accuracy: 0.847
Fold accuracy: 0.845
Mean cross-validation accuracy across all folds is 0.824 

KNeighborsClassifier()
Fold accuracy: 0.712
Fold accuracy: 0.763
Fold accuracy: 0.776
Mean cross-validation accuracy across all folds is 0.750 



In [3]:
kf = KFold(n_splits=3)

In [4]:
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [41]:
# Alternative approach
# Let's set aside a test set and use the remainder for training and cross-validation
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=0,test_size=0.15)

# Set up the two models we want to compare: a neural network model and a KNN model
model_a = MLPClassifier(hidden_layer_sizes=(100,50),activation='relu',
                      solver='sgd',learning_rate_init=0.01,max_iter=1000,random_state=0)

model_b = KNeighborsClassifier(n_neighbors=5)

# Cross-validation using cross_val_score
from sklearn.model_selection import cross_val_score
for model in [model_a,model_b]:
    scores = cross_val_score(model,X_train,y_train,scoring="accuracy",cv=3)
    mean_score = np.mean(scores)
    print(model)
    print('Mean cross-validation accuracy across all folds is {:.3f} \n'.format(mean_score))

MLPClassifier(hidden_layer_sizes=(100, 50), learning_rate_init=0.01,
              max_iter=1000, random_state=0, solver='sgd')
Mean cross-validation accuracy across all folds is 0.796 

KNeighborsClassifier()
Mean cross-validation accuracy across all folds is 0.756 



As we can see above, the cross-validation accuracy of model_a is higher than model_b, so we will use model_a.  Let's now evaluate the performance of model_a on the test set

In [42]:
# Train our selected model on the full training set
model_a.fit(X_train,y_train)
    
# Evaluate its performance on the test set
preds_test = model_a.predict(X_test)
acc_test = np.sum(preds_test==y_test)/len(y_test)
print('Accuracy of our model on the test set is {:.3f}'.format(acc_test))

Accuracy of our model on the test set is 0.875
