# Introduction to Model Selection

-----

With the growth of machine learning, there is an evergrowing list of algorithms that can be used to model data. In previous notebooks, we have introduced many of the most popular machine learning algorithms, including linear and logistic regression, support vector machine, decision trees, k-nearest neighbors, and ensemble techniques like random forest. In addition, these algorithms all have their own set of hyperparameters whose values, for optimal performance, must be carefully selected. 

As a result, selecting the best model and associated set of hyperparameters can be a daunting task. In this notebook, we explore the topic of [**Model Selection**][skms] by first manually evaluating one hyperparameter for one machine learning algorithm on a specific data set. This will introduce the basic concepts required to perform model selection, before we move into more automated techniques with help of cross-validation which is introduced in the previous lesson. Following this, we will look at grid searches to find the best combinations of multiple hyperparameters. Finally, we will look at additional model selection techniques such as random hyperparameter searches.

-----
[skms]: http://scikit-learn.org/stable/modules/grid_search.html



## Table of Contents

[Model Selection](#Model-Selection)

[Grid Search](#Grid-Search)

- [Multi-Dimensional Grid Search](#Multi-Dimensional-Grid-Search)

[Randomized Grid Search](#Randomized-Grid-Search)

-----

Before proceeding with the rest of this Notebook, we first have our standard notebook setup code.

-----

In [1]:
# Set up Notebook

%matplotlib inline

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# We do this to ignore several specific Pandas warnings
import warnings
warnings.filterwarnings("ignore")

# Set global fiugure properties
import matplotlib as mpl
mpl.rcParams.update({'axes.titlesize' : 20,
                     'axes.labelsize' : 18,
                     'legend.fontsize': 16})

# Set default seaborn plotting style
sns.set_style('white')

# Some cells take a while to run, so we will time them
from time import time

-----

[[Back to TOC]](#Table-of-Contents)

## Model Selection

Formally, model selection is the task of choosing the best machine learning model for a given data set. Thus, to get started, we first need a data set to which we can apply machine learning algorithms. For this purpose we will use the adult income data set. The first Code cell below load and prepare the dataset.

For the purpose of simplicity, we will use only one machine learning algorithm in this notebook, and focus on finding the best hyperparameters for this one algorithm on these data. The machine learning algorithm we will use is k-nearest neighbors classification. KNN works better with normalized data, but for simplicity, we will bypass this step.

In this simple example, we will only evaluate one potential hyperparameter, `n_neighbors` for this one algorithm. On a real-world problem, however, we likely would evaluate multiple algorithms, each with various hyperparameter combinations in order to determine the best model. 

-----


In [2]:
from sklearn.preprocessing import LabelEncoder

# Read CSV data
adult_data = pd.read_csv('data/adult_income.csv')

# Create label column, one for >50K, zero otherwise.
adult_data['Label'] = adult_data['Salary'].map(lambda x : 1 if '>50K' in x else 0)

# Generate categorical features(with string values)
categorical_features = adult_data[['Workclass', 'Education', 'MaritalStatus', 
               'Occupation', 'Relationship', 'Race', 'Sex', 'NativeCountry']]

#encode categorical features
categorical_features = categorical_features.apply(LabelEncoder().fit_transform)

# Extract numerical features
numerical_features = adult_data[['Age', 'FNLWGT', 'EducationLevel', 'CapitalGain', 'CapitalLoss', 'HoursPerWeek']]

all_features = pd.concat([numerical_features, categorical_features], axis=1)

features = all_features[['Age', 'EducationLevel', 'MaritalStatus', 'Occupation', 
                         'Relationship', 'CapitalGain', 'HoursPerWeek']]

label = adult_data['Label']

#display sample data(for display purpose only)
pd.concat([features, label], axis=1).sample(5, random_state=2)

Unnamed: 0,Age,EducationLevel,MaritalStatus,Occupation,Relationship,CapitalGain,HoursPerWeek,Label
3846,22,10,4,10,4,0,20,0
848,55,14,2,4,0,0,60,1
1658,34,12,2,1,0,0,28,0
3415,53,9,0,1,1,0,40,0
3678,31,9,2,1,2,0,35,0


---
### Hold Unused Test Set

The next step is to split the dataset to train and test set. We will only use the train set for model selection. The reason is that the test set is used to evaluate the final trained model and it should never be used to change the model(hyperparameters). The whole process is:
1. Split data to train and test set.
2. Select model hyperparameter values with train set only.
3. Train the model with optimum hyperparameter values on train set.
4. Evaluate trained model with test set.

In this lesson, we will only do the first two steps to focus on model selection. 

In the next Code cell, we split the dataset to train and test set. Then in the rest of this notebook, we will use the train set to select model hyperparameters.

---

In [3]:
from sklearn.model_selection import train_test_split

d_train, d_test, l_train, l_test = train_test_split(features, label, test_size=0.3, random_state=23)

-----
### Select Model with Cross Validation

The next step is to create a cross-validation iterator that will be applied to the training data to evaluate the model hyperparameters. In this case, we employ a stratified k-fold cross-validation technique to once again maintain class balance. We also specify 10 folds (via the `n_splits` parameter), which will produce 10 unique training and validation data sets for each hyperparameter. 

We create an array of nine `n_neighbors` values between 1 to 107 for the model selection process. We pick these values arbitrarily just for demonstration purpose. You can define an array with all integers under 100, but the whole process will take a lot more time to run.

We then iterate through each value in the array of hyperparameter values, compute and display the cross-validation scores for each value. Finally, we display the total time taken for this specific cross-validation.

For only nine different values of n_neighbors, the code takes almost 2 seconds to run as shown below.

-----

In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

start = time()
skf = StratifiedKFold(n_splits=10, random_state=23)

neighbors = [1, 3, 5, 11, 17, 23, 31, 53, 107]
for n in neighbors:
    knc = KNeighborsClassifier(n_neighbors=n)
    score = cross_val_score(knc, d_train, l_train, cv=skf)
    print(f'neighbors={n}, score={np.mean(score)*100:4.1f}%')
# Display compute time
print(f'Compute time = {time() - start:4.2f} seconds.')

neighbors=1, score=79.7%
neighbors=3, score=81.5%
neighbors=5, score=82.5%
neighbors=11, score=82.8%
neighbors=17, score=82.7%
neighbors=23, score=82.9%
neighbors=31, score=83.1%
neighbors=53, score=82.9%
neighbors=107, score=82.3%
Compute time = 1.91 seconds.


-----

From the above code, we find out that for all the number of neighbors tested, 31 gives the best accuracy score. Thus 31 is the best value for KNN hyperparameter `n_neighbors`. Keep two things in mind, however, first, this is only the best value among the n_neighbors values we tested, and secondly, this is based on the accuracy score only. You may change to other scoring function like `'precision'`, `'recall`' or `'roc_auc'` to get completely different results.

-----

-----

<font color='red' size = '5'> Student Exercise </font>

In the preceding cells, we applied cross-validation using a KNN classification estimator to the adult income data set to determine the best `n_neighbors` hyperparameter. Now that you have seen this technique work, try making the following changes and see if you can explain the results.

1. Set `cross_val_score` argument `scoring` to `roc_auc`.
2. Change the algorithm to use a random forest classifier, and vary the hyperparameters like `n_estimators` or `max_features`.

-----

[[Back to TOC]](#Table-of-Contents)

## Grid Search

Many machine learning algorithms have hyperparameters that can be adjusted to tune the performance of the algorithm on a particular data set, for example, the `n_neighbors` parameter in KNN. While in some cases there is a theoretical justification for a particular parameter value when applied to a specific data set, in many cases, we must determine the parameter values programmatically. With multiple parameters, however, this process can quickly become tedious.

Rather than repeatedly changing parameter values and computing the resulting model scores, a better approach is to employ a grid search approach. In a grid search, one defines a grid of parameter values, applies the model over all possible parameter value combinations in the grid, and identifies the set of parameters that produces the best model performance score. The scikit-learn library provides a [`GridSearchCV`][skgs] object that performs a grid search by using cross validation, which produces a model score at the end.

In the following Code cell, we demonstrate using `GridSearchCV` to compute the best value for the `n_neighbors` parameter when running the KNN classifier on the adult income data.

-----
[skgs]: http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html

In [5]:
from sklearn.model_selection import GridSearchCV

# Start clock
start = time()

skf = StratifiedKFold(n_splits=10, random_state=23)

knc = KNeighborsClassifier()

# Create a dictionary of hyperparameters and values
neighbors = [1, 3, 5, 11, 17, 23, 31, 53, 107]
params = {'n_neighbors':neighbors}

# Create grid search cross validator
gse = GridSearchCV(estimator=knc, param_grid=params, cv=skf)

# Fit estimator
gse.fit(d_train, l_train)

# Display time and best estimator results.
print(f'Compute time = {time() - start:4.2f} seconds.\n')

print(f'Best n_neighbors={gse.best_estimator_.get_params()["n_neighbors"]:5.4f}')
print(f'Best CV Score = {gse.best_score_:4.3f}')

Compute time = 2.71 seconds.

Best n_neighbors=31.0000
Best CV Score = 0.831


-----

[[Back to TOC]](#Table-of-Contents)

### Multi-Dimensional Grid Search

The previous example performed a grid search on one hyperparameter. This approach can be extended to multiple hyperparameters by constructing a dictionary that maps the hyperparameters to the hyperparameter values. In the following Code cell, we demonstrate this by first creating arrays of our hyperparameter values, for both the `n_neighbors` and `weights`. Next, we construct the grid of the two hyperparameters for the KNN algorithm by using an explicit dictionary. Then we perform stratified k-fold cross-validation using our KNN model. At the end of the Code cell, we compute and display the processing time, the best fit hyperparameters, and the best cross-validation score.

-----

In [6]:
# Start clock
start = time()

# Define individual parameter values
# Equal-spaced intervals in log-space
neighbors = [1, 3, 5, 11, 17, 23, 31, 53, 107]
weights = ['uniform', 'distance']

knc = KNeighborsClassifier()
skf = StratifiedKFold(n_splits=10, random_state=23)


# Create a dictionary of hyperparameters and values
params = {'n_neighbors':neighbors, 'weights':weights}

# Create Grid Search Cross validation iterator
mgse = GridSearchCV(estimator=knc, param_grid=params, cv=skf)

# Fit and display results of grid search
mgse.fit(d_train, l_train)

# Compute and display results
print(f'Compute time = {time() - start:4.2f} seconds.\n')

mgbe = mgse.best_estimator_
print(f'Best n_neighbors={mgbe.get_params()["n_neighbors"]:5.4f}')
print(f'Best weights={mgbe.get_params()["weights"]}')
print(f'Best CV Score = {mgse.best_score_:4.3f}')

Compute time = 3.99 seconds.

Best n_neighbors=53.0000
Best weights=distance
Best CV Score = 0.841


-----

[[Back to TOC]](#Table-of-Contents)

## Randomized Grid Search

For a large number of hyperparameters, the number of possible combinations quickly becomes excessive. Since we must evaluate the model for each possible combination of parameters, model selection with standard grid selection can become computationally intractable. The code in previous Code cell takes over twenty seconds to run with only 3 parameters that have limited options. 

An alternative approach is to randomly select possible hyperparameter combinations from the supplied grid of values to identify good parameter combinations.

In the following Code cell, we demonstrate a random grid search by using the [`RandomizedSearchCV`][skrgs] estimator. First, we pass in our KNN classifier, along with the parameter grid, and the number of parameter values to sample. As this last value increases, more parameter value combinations are sampled. These values are sampled randomly from a distribution of each parameter. If a list of values is provided (as we have done with previous grid searches), the values are randomly sampled from the list. Alternatively, a random distribution can be used. The requirement for these distributions is that they must contain an `rvs` method, which is provided by the random distributions in the `scipy.stats` module. 

We demonstrate the technique in the example below. In the code, instead of nine neighbor values, we use `range()` to create a range of integers from 0 to 50. If we use GridSearchCV, there will be 50*2=100 iterations of model training, which will take very long time to finish. With RandomizedSearchCV, we can set `n_iter` to 20, which means we only compute scores for twenty different parameter combinations. The twenty combinations are randomly picked from all combinations. With a lot more hyperparameter value options, the code takes about 5 seconds to run.

In the second Code cell, we extract the best estimator and display the optimal parameters and associated score. Note that while not the guaranteed best score (and hyperparameter values), we achieve similar score as GridSearchCV. Thus, the random search cross-validation can be useful to get reasonable hyperparameter values much faster than with traditional techniques.

-----

[skrgs]: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

In [7]:
from sklearn.model_selection import RandomizedSearchCV

# Start clock
start = time()

knc = KNeighborsClassifier()
skf = StratifiedKFold(n_splits=10, random_state=23)

neighbors = range(1, 51)
weights = ['uniform', 'distance']
# Create a dictionary of hyperparameters and values
params = {'n_neighbors':neighbors, 'weights':weights}

# Number of random parameter samples
num_samples = 20

# Run randomized search
rscv = RandomizedSearchCV(knc, param_distributions=params, n_iter=num_samples, random_state=23)

# Fit grid search estimator and display results
rscv.fit(d_train, l_train)

print(f'Compute time = {time() - start:4.2f} seconds', end='')
print(f' for {num_samples} parameter combinations')

Compute time = 1.86 seconds for 20 parameter combinations


In [8]:
# Get best esimtator
be = rscv.best_estimator_

# Display parameter values
print(f'Best n_neighbors={be.get_params()["n_neighbors"]:5.4f}')
print(f'Best weights={be.get_params()["weights"]}')

# Display best score
print(f'Best CV Score = {rscv.best_score_:4.3f}')

Best n_neighbors=36.0000
Best weights=distance
Best CV Score = 0.838


-----

## Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

1. The scikit-learn library’s [introduction][1] to hyperparameter tuning
1. A discussion on [grid search][2] for algorithmic tuning from the machine learning mastery website

2. The Wikipedia article on [hyperparameters][3]


-----

[1]: http://scikit-learn.org/stable/modules/grid_search.html
[2]: https://machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/
[3]: https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)

**&copy; 2019: Gies College of Business at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode