**Under construction**

Alexander S. Lundervold, September 28th, 2018

# Introduction

As you know very well by now, machine learning models typically have a number of _hyperparameters_ that has to be chosen correctly to get higher performance.

> **What is a hyperparameter?** During training the model's _parameters_ are automatically tuned to make the model produce useful outputs. This is typically achieved by using an optimization algorithm (for example gradient descent) to minimize some cost function (for example mean square error or cross-entropy). 

> However, there are typically other parameters in the model that are not automatically tuned during training. These are the things you pass to the scikit-learn estimators as parameters. For example `RandomForestClassifier(max_depth = 2, n_estimators = 100, ...)` They are called _hyperparameters_. Examples include things like the learning rate, the amount of regularization, the number of layers in a neural network (as we shall see), and much more! Some models have a large number of such parameters which can influence their performance heavily. 

**How do we select good hyperparameters?**

It's essentially a learning task: train the model to also obtain good hyperparameter settings. However, it's typically not that easy to formulate the task in a way where machine learning training methods can work (it's for example difficult to create cost functions for this task that can be optimized using gradient descent, since they wouldn't be differentiable). 

> **There are some very interesting methods to make powerful models that optimize ML models, but that is beyond our scope in this notebook**. Have a look at <a href="https://ai.googleblog.com/2017/05/using-machine-learning-to-explore.html">AutoML</a> or <a href="https://en.wikipedia.org/wiki/Hyperparameter_optimization#Evolutionary_optimization">evolutionary algorithms</a> if you're curious.

**One approach: use search!**

# Searching for good hyperparameters

Two standard ways are 
1. **brute-force search**: try all parameter combinations within a specified range, and 
2. **random search**: try out random combinations of parameter setting within a specified range

(a third way is to be more clever and use the results obtained from previous parameter settings to select a next setting that is expected to be better. This leads to things like **bayesian hyperparameter optimization** and **evolutionary hyperparameter optimization**, not covered here) 

One brute force search method often used is **grid search**, a heavily used method in machine learning. In cases where it makes sense to search through a very large space of parameter settings, or cases where each time you try a setting you have to do a lot of compute, it's better to use random search than grid search.

Let's get concrete and try these out on some model trained on some data.

# Data

We'll look at two data sets, one regression, one classification. Both built-in with scikit-learn:

**Regression**: Boston housing data set

**Classification**: Breast cancer data set

In [1]:
from sklearn.datasets import load_boston, load_breast_cancer

In [2]:
boston = load_boston()
X_boston, y_boston = boston['data'], boston['target']

In [3]:
X_boston.shape, y_boston.shape

((506, 13), (506,))

In [4]:
breast = load_breast_cancer()
X_breast, y_breast = breast['data'], breast['target']

In [5]:
X_breast.shape, y_breast.shape

((569, 30), (569,))

# Machine learning model

Let's use random forests as our model example:

In [6]:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

  from numpy.core.umath_tests import inner1d


In [7]:
rf_reg = RandomForestRegressor(random_state=42)
rf_clf = RandomForestClassifier(random_state=42)

We can use these models on our data to see how they perform:

In [9]:
from sklearn.model_selection import train_test_split

**Boston:**

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X_boston, y_boston, random_state=42)

In [11]:
rf_reg.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

In [12]:
rf_reg.score(X_test, y_test)

0.8504443818584292

Note from the documentation of `rf_reg.score` that the best score is 1.0 and higher is better.

Random forests have a bunch of parameters..

**Breast cancer:**

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X_breast, y_breast, random_state=42)

In [14]:
rf_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [15]:
rf_clf.score(X_test, y_test)

0.951048951048951

95% accuracy

> Can we do better with these models?

Our only hope is to find better hyperparameter settings. We could try to change some parameters one by one and track what happens, but it's easier if we let the computer search for us.

## Grid search

In grid search we select some paramaters to change, make a set or range of values for each of them, and try every combination. 

We specify our grid as a Python dictionary, or a list of dictionaries if we want to be more specific about parameter combinations to try (for example, "if `n_estimators` is 10, try `max_depth` 2, 3 and 4")

In [16]:
from sklearn.model_selection import GridSearchCV

In [None]:
?GridSearchCV

In [17]:
param_grid = {
    'n_estimators': [10, 50, 100, 150],
    'max_depth': [2, 3, 100, 150, 1000]
    
}

**Boston:**

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X_boston, y_boston, random_state=42)

In [19]:
gs_reg = GridSearchCV(estimator=rf_reg, param_grid=param_grid, cv=3, verbose=1)

In [20]:
gs_reg.fit(X_train, y_train)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:    5.7s finished


GridSearchCV(cv=3, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [10, 50, 100, 150], 'max_depth': [2, 3, 100, 150, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [21]:
gs_reg.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=100,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

In [22]:
best_reg = gs_reg.best_estimator_

In [23]:
best_reg.score(X_test, y_test)

0.8556781281650231

Better than before!

**Breast cancer:**

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X_breast, y_breast, random_state=42)

In [25]:
gs_clf = GridSearchCV(rf_clf, param_grid, verbose=1)

In [26]:
gs_clf.fit(X_train, y_train)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:    4.8s finished


GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [10, 50, 100, 150], 'max_depth': [2, 3, 100, 150, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [27]:
gs_clf.score(X_test, y_test)

0.972027972027972

Also better!

## Random search

We use it in a similar way to `GridSearchCV`: we specify the estimator to use and the parameter grid to search through. But we also specify the number of settings to try, and the method searches randomly through the parameter space that many times.

In [21]:
from sklearn.model_selection import RandomizedSearchCV

In [29]:
?RandomizedSearchCV

As our data is small it doesn't cost much time to search through *all* the settings in our specified parameter grid. That is, using grid search is okay. However, if the parameter space was chosen to be much larger, or our data set was more complicated, random search would be a much better approach

**Boston**

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X_boston, y_boston, random_state=42)

In [31]:
r_reg = RandomizedSearchCV(rf_reg, param_grid, n_iter=20, cv=3, verbose=1)

In [32]:
r_reg.fit(X_train, y_train)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:    6.0s finished


RandomizedSearchCV(cv=3, error_score='raise',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid=True, n_iter=20, n_jobs=1,
          param_distributions={'n_estimators': [10, 50, 100, 150], 'max_depth': [2, 3, 100, 150, 1000]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=1)

Note that we have total control over the number of fits the model should do: cv * n_iter

In [33]:
r_reg.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=100,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

In [34]:
best_reg = r_reg.best_estimator_

In [35]:
best_reg.score(X_test, y_test)

0.8556781281650231

It found the same settings as grid search, and therefore got the same score.

**Breast cancer:**

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X_breast, y_breast, random_state=42)

In [37]:
r_clf = RandomizedSearchCV(rf_clf, param_grid, n_iter=20, verbose=1)

In [38]:
r_clf.fit(X_train, y_train)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:    4.9s finished


RandomizedSearchCV(cv=None, error_score='raise',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid=True, n_iter=20, n_jobs=1,
          param_distributions={'n_estimators': [10, 50, 100, 150], 'max_depth': [2, 3, 100, 150, 1000]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=1)

In [39]:
r_clf.score(X_test, y_test)

0.972027972027972

Same as for grid search.

# Tips and tricks

To efficiently use hyperparameter search you have to know what settings it makes sense to try. That depends on the model, of course, and it's one of the reasons why it's important to have good insights into the workings of various models. 

What parameters that it makes sense to try out also depends on the data you have. This is another skill one should obtain: guesstimating parameters that it makes sense to try based on knowledge of the data. 

Two models used extremely often are random forests and XGBoost, which we'll learn about in Part 4 of the course. 

In case you want to try out hyperparameter optimization on these before we get to the details about the models, here are some good parameter spaces to use. 

## Random forests

As we shall see when we learn about random forests, the parameters to focus on in random forests are:

- bootstrap, max_depth, max_features, min_samples_leaf, min_samples_split, n_estimators,

In [40]:
param_grid_large = {
     'bootstrap': [True, False],
     'max_depth': [5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
     'max_features': ['auto', 'sqrt'],
     'min_samples_leaf': [1, 2, 4],
     'min_samples_split': [2, 5, 10],
     'n_estimators': [10, 100, 500, 1000]
    }


param_grid_small = {
    
    'max_depth': [5, 10, 15, 20, 30, 100, None],
    'n_estimators': [50, 100, 500, 1000]
    
}

Let's try it on the Boston data set:

### Grid search

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X_boston, y_boston, random_state=42)

In [42]:
gs_reg = GridSearchCV(rf_reg, param_grid_large, cv=3, verbose=1)

In [None]:
%%time
gs_reg.fit(X_train, y_train)

...takes too much time to do 5184 fits!

But we're willing to do, say, 150 fits (50*cv):

In [44]:
r_reg = RandomizedSearchCV(rf_reg, param_grid_large, n_iter=50, verbose=1)

In [45]:
%%time
r_reg.fit(X_train, y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed:  1.1min finished


Wall time: 1min 5s


RandomizedSearchCV(cv=None, error_score='raise',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid=True, n_iter=50, n_jobs=1,
          param_distributions={'bootstrap': [True, False], 'max_depth': [5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None], 'max_features': ['auto', 'sqrt'], 'min_samples_leaf': [1, 2, 4], 'min_samples_split': [2, 5, 10], 'n_estimators': [10, 100, 500, 1000]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=1)

In [46]:
r_reg.best_estimator_

RandomForestRegressor(bootstrap=False, criterion='mse', max_depth=90,
           max_features='sqrt', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=5,
           min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

In [47]:
best = r_reg.best_estimator_

In [48]:
best.score(X_test, y_test)

0.8523790350580436

In [49]:
# Could also have done
r_reg.score(X_test, y_test)
# since scikit-learn understands what we want to do..

0.8523790350580436

Actually slightly better than before.

## XGBoost

In [50]:
from xgboost import XGBRegressor

Some good defaults (you'll be better able to select settings to try once you understand how XGBoost works):

In [20]:
xgb_param_grid = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.0, 1.0, 1.5],
        'subsample': [0.6, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [5, 6, 7, 8, 10]
        }


xgb_param_grid_large = {
        'learning_rate': [0.1, 0.05, 0.2],
        'n_estimators': [50, 100, 500, 600],
        'min_child_weight': [1, 5, 10],
        'gamma': [0.0, 1.0, 1.5],
        'subsample': [0.6, 1.0],
        'colsample_bytree': [0.5, 0.6, 0.8],
        'max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
        'reg_lambda': [1, ]
        }


Let's try them out on our regression task:

In [52]:
# To place it on the GPU
#xgb_reg = XGBRegressor(tree_method='gpu_hist', predictor='gpu_predictor')
# On the CPU
xgb_reg = XGBRegressor()

In [53]:
r_xgb_reg = RandomizedSearchCV(xgb_reg, xgb_param_grid, n_iter=200, verbose=1)

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X_boston, y_boston, random_state=42)

In [55]:
r_xgb_reg.fit(X_train, y_train)

Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=1)]: Done 600 out of 600 | elapsed:   20.3s finished


RandomizedSearchCV(cv=None, error_score='raise',
          estimator=XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
          fit_params=None, iid=True, n_iter=200, n_jobs=1,
          param_distributions={'min_child_weight': [1, 5, 10], 'gamma': [0.0, 1.0, 1.5], 'subsample': [0.6, 1.0], 'colsample_bytree': [0.6, 0.8, 1.0], 'max_depth': [5, 6, 7, 8, 10]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=1)

In [56]:
r_xgb_reg.best_estimator_

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.6, gamma=0.0, learning_rate=0.1,
       max_delta_step=0, max_depth=5, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None, objective='reg:linear',
       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=0.6)

In [57]:
r_xgb_reg.score(X_test, y_test)

0.8913840638026145

# You should try this out on every data set you study!

# Under construction
Use a more complicated data set to really illustrate the power of hyperparameter optimization.

In [4]:
import numpy as np

In [5]:
from sklearn.datasets import fetch_mldata

In [6]:
mnist = fetch_mldata('MNIST Original')

In [7]:
mnist

{'DESCR': 'mldata.org dataset: mnist-original',
 'COL_NAMES': ['label', 'data'],
 'target': array([0., 0., 0., ..., 9., 9., 9.]),
 'data': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)}

In [8]:
X = mnist["data"]
y = mnist["target"]

In [9]:
X.shape

(70000, 784)

In [10]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

In [11]:
np.random.seed(seed=42)
shuffled_indices = np.random.permutation(len(X_train))
X_train, y_train = X_train[shuffled_indices], y_train[shuffled_indices]

## Standard random forest and XGBoost

**Random forest**

In [12]:
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

In [13]:
rf = RandomForestClassifier()

In [14]:
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [15]:
rf.score(X_test, y_test)

0.9462

**XGB classifier**

In [16]:
xgb = XGBClassifier()
xgb = XGBClassifier(tree_method='gpu_hist', predictor='gpu_predictor')

In [17]:
xgb.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob',
       predictor='gpu_predictor', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=1, tree_method='gpu_hist')

In [18]:
xgb.score(X_test, y_test)

  if diff:


0.9366

## Optimization

In [21]:
xgb = XGBClassifier(tree_method='gpu_hist', predictor='gpu_predictor')

In [19]:
from sklearn.model_selection import RandomizedSearchCV

In [20]:
xgb_param_grid = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.0, 1.0, 1.5],
        'subsample': [0.6, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [5, 6, 7, 8, 10]
        }


xgb_param_grid_large = {
        'learning_rate': [0.1, 0.05, 0.2],
        'n_estimators': [50, 100, 500, 600],
        'min_child_weight': [1, 5, 10],
        'gamma': [0.0, 1.0, 1.5],
        'subsample': [0.6, 1.0],
        'colsample_bytree': [0.5, 0.6, 0.8],
        'max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
        'reg_lambda': [1, ]
        }


In [22]:
rs_xgb = RandomizedSearchCV(xgb, xgb_param_grid, n_iter=100, n_jobs=-1, verbose=2)

In [None]:
rs_xgb.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


In [None]:
rs_xgb.score(X_test, y_test)