<a href="https://colab.research.google.com/github/victorviro/Machine-Learning-Python/blob/master/Hyperparameter_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyperparameter optimization

# Table of contents

1. [Introduction](#1)
2. [Hyperparameters are important](#2)
3. [Tuning approaches](#3)
    1. [Manual tuning](#3.1)
    2. [Grid search](#3.2)
    3. [Random search](#3.3)
    4. [Model-based optimization](#3.4)
    5. [More advanced techniques](#3.5)
4. [Example](#4)
    1. [Prepare the dataset](#4.1)
    2. [Define function to minimize](#4.2)
    3. [Define search space](#4.3)
    4. [Select search algorithm](#4.4)
    5. [Run the tuning algorithm](#4.5)
    5. [Tuning with Hyperopt-sklearn](#4.6)
5. [References](#5)






# Introduction <a name="1"></a>

In machine learning, **hyperparameter optimization or tuning** is the problem of **choosing a set of optimal hyperparameters** for a learning algorithm. A hyperparameter is a parameter whose value is used to **control the learning process**. By contrast, the values of other parameters (typically node weights) are learned.

In this notebook, we will see why the choice of the values for the hyperparameters is important, what options we have for tuning a model, and finally, we will see an example.

# Hyperparameters are important <a name="2"></a>

With all of the advanced feature engineering, ETL, visualizations work, and coding effort that is involved in building a ML solution, seeing an **untuned model using defaults** is like buying a high-performance sports car and filling it with regular gas. **It will run, but it will not perform well**. Not only will it under-perform, but the chances of it breaking are very high once we take it out into the "real world".

While some algorithms have no hyperparameters, the majority have from a single to dozens of parameters that influence not only the core functionality of the algorithm's optimizer (e.g. the `family` parameter in generalized linear regression will directly influence the predictive performance of such a model) but how the optimizer executes its search to find the minimum objective function. The **most of these hyperparameters influence how the algorithm will "learn" an optimal fit to the data**, which is exceptionally important.

The effect on models for different values of the hyperparameters is not only dependent on the algorithm type being used, but also in the nature of the data contained in the feature vector and the attributes of the target variable. This is why **every model needs to be tuned**.

The next figure shows some simplified examples of two such
critical hyperparameters for linear regression models.

![](https://i.ibb.co/Sfv4wLQ/hyperparameter-opt.png)

The effects of using defaults (left) or selecting a poor hyperparameter (using an aggressive learning rate, right) on a linear regressor that uses stochastic gradient descent as an optimizer for the solver. **Without properly tuning the model, the final result could be a model that is overfitting to a local minima or can be a model that fails to converge at all**. Either way, the predictions are going to be less than optimal.

# Tuning approaches <a name="3"></a>

In the earlier stage of a experimentation phase, a manual process can be used to adjust and tune the model to get an idea about its predictive performance. However, depending on the use case, the difference between "ok" and "very good" predictions could translate to millions of dollars. **Spending time manually tuning** by just trying a bunch of different hyperparameters simply **won’t scale for both predictive performance and for timeliness of delivery**.

If we want to come up with a better approach for tuning the model, we need to look at **what options there are**. The figure below shows the different approaches to tune models.

![](https://i.ibb.co/sQh2hWd/model-tuning-approaches.png)

As we can see, there are several options to pursue for arriving at the most optimal set of hyperparameters. The top section is typically how prototypes are built. Manually testing values of hyperparameters, when doing rapid testing, is an understandable approach. The goal of the prototype is getting an approximation of the tuneability of a solution. At the stage of moving towards a production solution, however, more maintainable and powerful solutions need to be considered.

## Manual tuning <a name="3.1"></a>

We will see later, when going through the process of applying hyperopt to an example, **how difficult it would be to arrive at the optimal hyperparameters for the model**. **It’s very unlikely to get even close to optimal parameters with a manual process**. The amount of time required to arrive at a set of hyperparameters that gives an acceptably good answer could be weeks or months.

**Another issue** with this method is in **tracking what has been tested**. Even if there was a system in place to record and ensure that the same values haven’t been tried before, the amount of work required is overwhelming, and prone to errors. Project work, after the rapid prototyping phase, should always abandon this approach to tuning as soon as is practicable. We have many better things to do with our time.

## Grid Search <a name="3.2"></a>

The **brute-force-search approach** of grid-based testing of hyperparameters has been around for quite some time. The process requires **selecting a set of values to test for each of the hyperparameters**. The grid search will then assemble **collections of hyperparameters to test by creating permutations of each of the values from each group** that has been specified.

The **problem** with this approach is **when there is a large collection of hyperparameters to search through** (**and a large space** to search through mainly for those that are continuously distributed). The count of permutations that need to be tested can become overwhelming rather quickly. The trade-off is between the time required to run all of the permutations and the search capability of the optimization. If we want to explore more of the hyperparameter response surface, we’re going to have to run more iterations.

## Random Search <a name="3.3"></a>

**An alternative to Grid Search**, to simultaneously test the influencing effects of different hyperparameters at the same time (rather than relying on explicit permutations to determine the optimal values), **is using random sampling of each of the hyperparameter groups**. Instead of specifying exact conditions to test through permutation collections (as in Grid Search), this algorithm can **select candidates from each hyperparameter space and assemble, at random, candidate hyperparameter collections to test**. This approach, **controlled via a number of iterations**, can ensure that **large spaces are tested in a much shorter period of time**. The trade-off, however, is in the coverage of the hyperparameter space.

In a paper by James Bergstra and Yoshua Bengio, [Random Search for Hyper-Parameter Optimization](https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf), they concluded that grid search is flawed as an approach (since **some hyperparameters are far more influential in the overall quality of a particular trained model, those with greater effect get the same amount of "coverage" as those with negligible influence**, limiting the effective search due to computation time and cost of more expansive
testing). Random Search is, in their opinion, a better approach than Grid Search, but it isn’t the most effective approach.

## Model-based optimization. TPE (hyperopt) <a name="3.4"></a>

With a complex search space for hyperparameters, the approaches mentioned above are either too time consuming, expensive, or difficult to achieve adequate fit characteristics for validation against hold out data (all of them).

The same team that brought the paper arguing that random search is a superior methodology to grid search also arrived at a **process for selecting an optimized hyperparameter response surface through the use of Bayesian techniques in a model-based optimization** relying on either Gaussian Processes or Tree of Parzen Estimators. The results of their research are provided in the open source python package [hyperopt](https://github.com/hyperopt/hyperopt).

This technique is remarkably **capable of exploring complex hyperparameter spaces, but it can do so in far fewer iterations** than other methodologies. For further
reading on this topic, the original paper is available,
[Algorithms for Hyper-Parameter Optimization](https://papers.nips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf).

## More advanced/complex techniques <a name="3.5"></a>

More advanced techniques typically means paying a company that offers an **automated-ML (autoML)** solution or build your own.

**Building a custom tuner** solution, we might look into a
mixture of genetic algorithms with Bayesian prior search optimization to create search candidates within the hyperparameter space that have the highest likelihood of giving a good result, leveraging the selective optimization that genetic algorithms are known for.

However, going down this path **is not usually recommended** unless we’re building out a custom framework for hundreds different projects and have a distinct need for a high-performance and lower cost optimization tool. **The effort otherwise is simply not worth it**.

# Example <a name="4"></a>

As we mentioned previously, [Hyperopt](https://github.com/hyperopt/hyperopt) is a Python library for serial and parallel optimization over awkward search spaces, which may include real-valued, discrete, and conditional dimensions. It is designed for large-scale optimization for models with hundreds of parameters and allows the optimization procedure to be scaled across multiple cores and multiple machines.

In this section, we are going to see an example using this library to find the best hyperparameters for a regression problem. We are going to use the [Housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html), which contains data representing characteristics of houses. The target variable is the median value of homes in $1000’s.

In [None]:
from sklearn.datasets import load_boston
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import ElasticNet
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

## Prepare the dataset <a name="4.1"></a>

In [None]:
X, y = load_boston(return_X_y=True)
X = pd.DataFrame(X)
print(X.shape)
X.head()

(506, 13)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


We split the data into into training and test sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

## Define function to minimize <a name="4.2"></a>

In this example, we want to search for a regression linear model. We define a parameter `params['type']` as the model name. We define a function to run the training and return the mean squared error.

In [None]:
def objective(params):
    regressors_type = params['type']
    del params['type']
    if regressors_type == 'ElasticNet':
        pipeline = Pipeline([('scaler', StandardScaler()),
                             ('ElasticNet', ElasticNet(**params))]
        )
        pipeline.fit(X_train, y_train)
        y_pred_test = pipeline.predict(X_test)
    else:
        return 0    
    loss = mean_squared_error(y_test, y_pred_test)
    
    return {'loss': loss, 'status': STATUS_OK}

## Define search space <a name="4.3"></a>

Details on defining a search space and parameter expressions are available in the [Hyperopt docs](https://github.com/hyperopt/hyperopt/wiki/FMin#21-parameter-expressions).

In [None]:
search_space = hp.choice('regressor_type', [
    {
        'type': 'ElasticNet',
        'alpha': hp.lognormal('alpha', 0, 1.0),
        'l1_ratio': hp.lognormal('l1_ratio', 0, 1.0)
    },
])

## Select search algorithm <a name="4.4"></a>

The two main choices are:

- `hyperopt.tpe.suggest`: Tree of Parzen Estimators, a Bayesian approach which iteratively and adaptively selects new hyperparameter settings to explore based on past results.

- `hyperopt.rand.suggest`: Random search, a non-adaptive approach which samples over the search space.

In [None]:
algorithm = tpe.suggest

## Run the tuning algorithm <a name="4.5"></a>

We set `max_evals` to the maximum number of points in hyperparameter space to test, that is, the maximum number of models to fit and evaluate.

In [None]:
best_result = fmin(
    fn=objective, 
    space=search_space,
    algo=algorithm,
    max_evals=50)

In [None]:
print(f'best parameters: {best_result}')

best parameters: {'alpha': 0.038833880217567576, 'l1_ratio': 1.7404673537285114, 'regressor_type': 0}


**Note**: Our results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. We can run the example a few times.

## Tuning with hyperopt-sklearn <a name="4.6"></a>

[Hyperopt-sklearn](https://github.com/hyperopt/hyperopt-sklearn) is Hyperopt-based model selection among ML algorithms in Scikit-learn.

In [None]:
!pip install git+https://github.com/hyperopt/hyperopt-sklearn.git

In [None]:
from hpsklearn import (HyperoptEstimator, any_regressor, any_preprocessing) 
from hyperopt import tpe
from sklearn.metrics import mean_squared_error

We define the search procedure. We will explore all regressor algorithms and all data transforms available to the library and use the Tree of Parzen Estimators search algorithm.

The search will evaluate 50 pipelines and limit each evaluation to 30 seconds.

In [None]:
# Define search
hyperopt_estimator = HyperoptEstimator(regressor=any_regressor('reg'),
                          preprocessing=any_preprocessing('pre'),
                          loss_fn=mean_squared_error,
                          algo=tpe.suggest, max_evals=50, trial_timeout=30)

We then start the search.

In [None]:
hyperopt_estimator.fit(X_train, y_train)

We can report the performance of the model on the holdout dataset and summarize the best performing pipeline.

In [None]:
y_test_predicted = hyperopt_estimator.predict(X_test)
mse = mean_squared_error(y_test, y_test_predicted)
print("MSE: %.3f" % mse)
# Summarize the best model
print(f'Best model: {hyperopt_estimator.best_model()}')

MSE: 9.826
Best model: {'learner': RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features=None, max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=41, n_jobs=1, oob_score=False,
                      random_state=2, verbose=False, warm_start=False), 'preprocs': (), 'ex_preprocs': ()}


# References <a name="5"></a>

- [Machine learning engineering in action](https://livebook.manning.com/book/machine-learning-engineering/)

- [Random Search for Hyper-Parameter Optimization](https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf)

- [Algorithms for Hyper-Parameter Optimization](https://papers.nips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf)

- [Hyperopt](https://github.com/hyperopt/hyperopt)

- [Hyperopt-sklearn](https://github.com/hyperopt/hyperopt-sklearn)

- [Hyperopt-Sklearn paper](https://www.ml4aad.org/wp-content/uploads/2018/07/automl_book_draft_hyperopt-sklearn.pdf)