<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Model Selection & Evaluation

In this notebook we are going to look at strategies to divide your dataset in order to perform model selection and testing using subsets of data in ways that do not create bias in your measurement of model performance.

We are going to use a dataset which comes from a study done to try to use sonar signals to differentiate between a mine (simulated using a metal cylinder) and a rock.  Details on the dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks))

In [1]:
# Import the libraries we know we need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import KFold

import warnings
warnings.filterwarnings("ignore")

In [2]:
def load_data(url):
    # Load the data
    data = pd.read_csv(url, header=None)
    print(data.shape)
    display(data.head())

    # Separate into X and y 
    # Create feature matrix using the first 60 columns as the features
    X = data.iloc[:,:60].to_numpy()
    # Create target vector from the last column
    y = data.iloc[:,60].to_numpy()

    return X,y

X,y = load_data('https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv')

(208, 61)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094,R


## Part 1: Training and test sets
First, complete the `split()` function which does the following: 
- Splits your data into a feature matrix X and a target vector y  
- THen splits the data into a training set and a test set, using `pct` percentage of the data for the test set.  Use `random_state=0` while splitting for repeatability.  Use the `stratify` parameter to ensure that the splits contain the same distribution of labels as the original data.

Then, complete the function `run_model()` which does the following:
- Trains (fit) your model on the training data 
- Uses your trained model to get predictions on the `X_test` test set and returns the predictions 

Finally, run the next code cell to calculate the display the accuracy of your classifier model

In [3]:
def split(X,y,pct):
    '''
    Splits the data into training and test sets

    Inputs:
        X(np.ndarray): array of input data
        y(np.ndarray): array of targets
        pct(float): percentage of data to use for the test set

    Returns:
        X_train(np.ndarray): training set inputs
        y_train(np.ndarray): training set targets
        X_test(np.ndarray): test set inputs
        y_test(np.ndarray): test set targets
    '''
    ### BEGIN SOLUTION ###
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=pct, random_state=0, stratify=y)
    return X_train, y_train, X_test, y_test
    ### END SOLUTION ###

def run_model(X_train,y_train,X_test,model):
    '''
    Trains a model on the training data and then generates and returns predictions on the test set

    Inputs:
        X_train(np.ndarray): training set inputs
        y_train(np.ndarray): training set targets
        X_test(np.ndarray): test set inputs
        model(sklearn.base.BaseEstimator): instantiated scikit-learn model object

    Returns:
        preds(np.ndarray): numpy array containing the model predictions for the test set
    '''
    ### BEGIN SOLUTION ###
    model.fit(X_train,y_train)
    preds = model.predict(X_test)
    return preds
    ### END SOLUTION ###

In [4]:
# Create an instance of the MLPClassifier algorithm and set the hyperparameter values
model = MLPClassifier(hidden_layer_sizes=(100,50,10),activation='tanh',
                      solver='sgd',learning_rate_init=0.001,max_iter=2000, random_state=0)
                      
# Evaluate the performance of our model using the test predictions
X_train, y_train, X_test, y_test = split(X,y,pct=0.15)
preds = run_model(X_train,y_train,X_test,model)
assert len(preds) == len(y_test)
acc_test = np.sum(preds==y_test)/len(y_test)
print('Accuracy of our classifier on the test set is {:.3f}'.format(acc_test))

Accuracy of our classifier on the test set is 0.750


## Part 2: Model selection using validation sets
But what if we want to compare different models (for example, evaluate different algorithms or fine-tune our hyperparameters)?  Can we use the same strategy of training each model on the training data and then comparing their performance on the test set to select the best model?

When we are seeking to optimize models by tuning hyperparameters or comparing different algorithms, it is a best practice to do so by comparing the performance of your model options using a "validation" set, and then reserve use of the test set to evaluate the performance of the final model you have selected.  To utilize this approach we must split our data three ways to create a training set, validation set, and test set.

To illustrate this, let's compare two different models.  Complete the function below which performs the following:
- Split your training set again into a training set and a validation set, using 15% of the training set for the new validation set (and the remaining 85% is still available for training). Use the `stratify` parameter to ensure that the splits contain the same distribution of labels as the original data.
- Train (fit) the models provided as inputs on the training data
- Now, use each of your trained models to generate predictions on the validation set inputs and calculate the accuracy of the predictions for each model.  Return the model with the higher validation set accuracy

In [5]:
def compare_models(models,X_train,y_train):
    '''
    Compares models using a validation set and returns the model with the highest validation set performance

    Inputs:
        models(list): list of instantiated models to compare
        X_train(np.ndarray): training set inputs
        y_train(np.ndarray): training set targets

    Returns:
        best_model(sklearn.base.BaseEstimator): model with the highest validation set performance
    '''
    ### BEGIN SOLUTION ###
    best_acc = 0
    best_model = None
    
    X_train_new, X_val, y_train_new,y_val = train_test_split(X_train, y_train, pct=0.15)
    for model in models:
        model.fit(X_train_new,y_train_new)
        preds = model.predict(X_val)
        assert len(preds) == len(y_val)
        acc_val = np.sum(preds==y_val)/len(y_val)
        if acc_val >= best_acc:
            best_acc = acc_val
            best_model = model
    return best_model
    ### END SOLUTION ###

In [6]:
X_train_full,y_train_full,X_test,y_test = split(X,y,pct=0.15)

# Create an instance of each model we want to evaluate
model1 = MLPClassifier(hidden_layer_sizes=(100,50,10),activation='tanh',
                      solver='sgd',learning_rate_init=0.001,max_iter=2000, random_state=0)

model2 = MLPClassifier(hidden_layer_sizes=(100,50),activation='relu',
                      solver='sgd',learning_rate_init=0.01,max_iter=2000, random_state=0)

models = [model1,model2]
best_model = compare_models(models,X_train,y_train)

Now that we've chosen our final model, we can use the test set to evaluate it's performance.  Before we do that, let's retrain our model using the training plus validation data - since we are now done with model comparision we can use the validation set as part of our training data for our final model.

In [7]:
# Train our selected model on the full training data (training plus validation sets)
best_model.fit(X_train_full,y_train_full)

# Evaluate its performance on the test set
preds_test = best_model.predict(X_test)
acc_test = sum(preds_test==y_test)/len(y_test)
print('Accuracy of our model on the test set is {:.3f}'.format(acc_test))

Accuracy of our model on the test set is 0.844


## Part 3: Model selection using cross-validation

A common approach to comparing and optimizing models is to use cross-validation rather than a single validation set to compare model performace.  Complete the below function `run_kfolds()` which performs k-folds cross validation on the models provided as inputs.  Your function should use `nsplits` number of folds in the cross-validation and validation accuracy as the comparision metric.  After your model calculates the mean validation set accuracy for each model it should then return the model with the best performance.


In [None]:
def run_kfolds(models,X_train,y_train,nsplits):
    '''
    Performs k-folds cross validation on an arbitrary number of models provided as inputs and returns the model with the highest 
    validation set accuracy

    Inputs:
        X_train(np.ndarray): numpy array containing the training set features
        y_train(np.ndarray): numpy array containing the training set labels
        models(list): list of instantiated scikit-learn model objects to compare
        nsplits(int): number of folds for cross-validation

    Returns:
        best_model(sklearn.base.BaseEstimator): model with the highest cross-validation accuracy
    '''

    ### BEGIN SOLUTION ###
    kf = KFold(n_splits=nsplits)
    for model in models:
        for (train_idx,val_idx) in kf.split(X=X_train,y=y_train)
        X_fold_train, X_fold_val = X_train.iloc[train]
    
        
    ### END SOLUTION ###
            

In [None]:
# Set up the two models we want to compare: a neural network model and a KNN model
model2 = MLPClassifier(hidden_layer_sizes=(100,50),activation='relu',
                    solver='sgd',learning_rate_init=0.01,max_iter=1000)
model3 = LogisticRegression()
models = [model2,model3]

# Split data
X_train, y_train, X_test, y_test = split(X,y,pct=0.15)

# Run cross validation
best_model = run_kfolds(models,X_train,y_train,nsplits=10)

As we can see above, the cross-validation accuracy of model2 is higher than model3, so we will use model2 as our best model.  Let's now evaluate its performance on the test set

In [None]:
# Train our selected model on the training plus validation sets
preds = run_model(X_train,y_train,X_test,best_model)

# Evaluate its performance on the test set
acc_test = np.sum(preds==y_test)/len(y_test)
print('Accuracy of our model on the test set is {:.3f}'.format(acc_test))