<a class="anchor" id="0"></a>
# Bayesian Optimization using Hyperopt

Hello friends,

In this kernel, I will walk through the **Bayesian Optimization process using Hyperopt**. Bayesian Optimization process is used to find optimal set of parameters for any machine learning model. 

**I hope you find this kernel useful and your <font color="red"><b>UPVOTES</b></font> would be very much appreciated**

<a class="anchor" id="0.1"></a>

## Table of Contents

1. [Introduction to Bayesian Optimization](#1)
1. [Bayesian Optimization method](#2)
1. [Introduction to Hyperopt](#3)
1. [4 parts of optimization problem](#4)
   1. [Objective function](#4.1)
   1. [Domain space](#4.2)
   1. [Optimization algorithm](#4.3)
   1. [Results](#4.4)
1. [Bayesian Optimization implementation](#5)
   1. [Optimization](#5.1)
   1. [Results](#5.2)
1. [Important points to note](#6)
1. [Conclusion](#7)



## 1. Introduction to Bayesian Optimization <a class="anchor" id="1"></a>


[Back to Table of Contents](#0.1)


We have found the best machine learning model for a kaggle competition. Now, we try to improve our model by tuning  its hyperparameters. There are standard hyperparameter optimization techniques like GridSearch and RandomSearch. But these techniques search the full space of available parameter values. If we have small set of parameter values, then this is Ok. But with large parameter spaces, this is quite time-consuming and obviously frustrating.

A popular alternative to tune the model hyperparameters is **Bayesian Optimization**. Bayesian Optimization is a probabilistic model-based technique used to find minimum of any function. This approach can yield better performance on the test set while it requires fewer iterations than random search. It takes into account
past evaluations when choosing the optimal set of hyperparameters. Thus it chooses its parameter combinations in an informed way. In doing so, it focus on those parameters that yield the best possible scores. Thus, this technique requires less number of iterations to find the optimal set of parameter values. It ignores those areas of the parameter space that are useless. Hence, it is less time-consuming and not frustrating at all. 

Please visit the following link for more information on Bayesian Optimization.

https://github.com/fmfn/BayesianOptimization




## 2. Bayesian Optimization Method <a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

Bayesian optimization is also called **Sequential Model-Based Optimization (SMBO)**. It finds the value that minimizes an objective function by building a surrogate function. A surrogate function is nothing but a probability model based on past evaluation results of the objective. In the surrogate function, the input values to be evaluated are selected based on the criteria of expected improvement. Bayesian methods use past evaluation results to choose the next input values. So, this method excludes the poor input values and limit the evaluation of the objective function by choosing the next input values which have done well in the past.

Nowadays, there are a number of Python libraries that enable us to implement Bayesian Optimization for machine learning models. The examples of libraries are Spearmint, Hyperopt or SMAC. Scikit-learn also provides a library named **Scikit-optimize** for Bayesian optimization.

Bayesian Optimization methods differ in how they construct the surrogate function. Spearmint uses Gaussian Process surrogate while SMAC uses Random Forest Regression. Hyperopt uses the Tree Parzen Estimator (TPE) for optimization. 




## 3. Introduction to Hyperopt  <a class="anchor" id="3"></a>


[Back to Table of Contents](#0.1)


Bayesian Optimization technique uses **Hyperopt** to tune the model hyperparameters. **Hyperopt** is a Python library which is used to tune model hyperparameters. 


More information on Hyperopt can be found at the following link:-


https://hyperopt.github.io/hyperopt/?source=post_page



## 4. 4 parts of Optimization Problem <a class="anchor" id="4"></a>

[Back to Table of Contents](#0.1)

Implementing an optimization problem in Hyperopt requires 4 parts as follows:-


**1. Objective function**

The objective function can be any function which returns a real value that we want to minimize. In this case, we want to minimize the validation error of a machine learning model with respect to the hyperparameters. If the real value is accuracy, then we want to maximize it. Then the function should return the negative of that metric. 

**2. Domain space**

The domain space is the input values over which we want to search.

**3. Optimization algorithm**

It is the method used to construct the surrogate objective function and choose the next values to evaluate.

**4. Results**

Results are score or value pairs that the algorithm uses to build the model.


## 5. Bayesian Optimization Implementation <a class="anchor" id="5"></a>


[Back to Table of Contents](#0.1)


Enough of theory, let's get to the implementation. In this implementation, I will walk through :-

        - the workings of Bayesian Optimization.
        - its application by means of Hyperopt.
        
        
First of all, I will generate a synthetic dataset by using the Scikit-learn's `make_classification` library. I will generate a random binary classification dataset with 1000 samples, 100 features with 2 classes and split it in a train and test set.

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

# Generate dataset with 1000 samples, 100 features and 2 classes

def gen_dataset(n_samples=1000, n_features=100, n_classes=2, random_state=123):  
    X, y = datasets.make_classification(
        n_features=n_features,
        n_samples=n_samples,  
        n_informative=int(0.6 * n_features),    # the number of informative features
        n_redundant=int(0.1 * n_features),      # the number of redundant features
        n_classes=n_classes, 
        random_state=random_state)
    return (X, y)

X, y = gen_dataset(n_samples=1000, n_features=100, n_classes=2)


In [2]:
import pandas as pd


# convert X and y to dataframe
X = pd.DataFrame(X)
y = pd.DataFrame(y)

In [3]:
# ignore warnings

import warnings
warnings.filterwarnings('ignore')

In [4]:
# train / test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Gradient Boosting Model

For implementation purpose, I will use the Gradient Boosting Model (GBM). GBM is an ensemble boosting method based on using weak learners which are trained sequentially to form a strong learner. Mostly, weak learners are decision-trees. I choose GBM because it has vast set of parameters to tune. The parameters are related to the entire ensemble and individual decision-trees. 



Now, I will write the 4 parts of the optimization problem as described earlier. The first is the objective function:-



### 1. Objective function

Our aim is to minimize the objective function. It takes in a set of values as input - in this case hyperparameters of GBM model - and outputs a real value to minimize - the cross validation loss. 

I will write the objective function for the GBM model with 5-fold cross validation.

In [5]:
from sklearn.model_selection import cross_val_score
from hyperopt import STATUS_OK
import lightgbm as lgb

def objective_function(params):
    clf = lgb.LGBMClassifier(**params)
    score = cross_val_score(clf, X_train, y_train, cv=5).mean()
    return {'loss': -score, 'status': STATUS_OK}    
    

In the objective-function, I implement cross-validation. Once the cross validation is complete, we get the mean score. We want a value to minimize. So, we take negative of score. This value is then returned as the loss key in the return dictionary.

The objective function returns a dictionary of values - loss and status.

Next, I will define the domain space.

### 2. Domain space

The domain space is the range of values that we want to evaluate for each hyperparameter.In each iteration of the search, the Bayesian optimization algorithm will choose one value for each hyperparameter from the domain space. In Bayesian optimization this space has probability distributions for each hyperparameter value rather than discrete values. When first tuning a model, we should create a wide domain space centered around the default values and then refine it in subsequent searches.

Now, I will define the domain space as follows:-

In [6]:
from hyperopt import hp
import numpy as np

space= {
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(1)),
    'max_depth': hp.quniform('max_depth', 5, 15, 1),
    'n_estimators': hp.quniform('n_estimators', 5, 35, 1),
    'num_leaves': hp.quniform('num_leaves', 5, 50, 1),
    'boosting_type': hp.choice('boosting_type', ['gbdt', 'dart']),
    'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
}

Here we use a number of different domain distribution types which are as follows:-

- choice : categorical variables
- quniform : discrete uniform (integers spaced evenly)
- uniform: continuous uniform (floats spaced evenly)
- loguniform: continuous log uniform (floats spaced evenly on a log scale)

Next, I will define the optimization algorithm.

### 3. Optimization algorithm

Writing the optimization algorithm in hyperopt is very simple. It just involves a single line of code. We should use the `Tree Parzen Estimator (tpe)`. The code snippet is as follows:-

`from hyperopt import tpe`

`tpe_algorithm = tpe.suggest`


Hyperopt  has the TPE option along with random search. During optimization, the TPE algorithm constructs the probability model from the past results and decides the next set of hyperparameters to evaluate in the objective function by maximizing the expected improvement.


Next, we should document the results history.

### 4. Results history

Hyperopt will track the results internally for the algorithm. To take a look into the results, we can use a `Trials` object which will store basic training information and also the dictionary returned from the objective function (which includes the loss and params). All we need is one line of code to make a `Trials` object. The code snippet is as follows:-

`from hyperopt import Trials`

`trials = Trials()`


Alternatively, we can monitor the progress of a long training process by writing a line to a csv file with each search iteration. This also saves all the results to disk. We can do this using the `csv` library. Before training we should open a new csv file and write the headers. Within the objective function we can add lines to write to the csv on every iteration. The complete code is as follows:-

In [7]:
import csv

# File to save first results
out_file = 'gbm_trials.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

# Write the headers to the file
writer.writerow(['loss', 'params', 'iteration', 'estimators', 'train_time'])
of_connection.close()

Writing to a csv file means we can check the progress by opening the file while training. Every time the objective function is called, it will write one line to this file.

### Optimization

Once we have the four parts in place, optimization is run with fmin as follows:-

from hyperopt import fmin, tpe, Trials

import numpy as np

num_eval = 500

trials = Trials()

best_param = fmin(objective_function, space, algo=tpe.suggest, max_evals=num_eval, trials=trials, rstate= np.random.RandomState(1))




### Results

Within each iteration, the algorithm chooses new hyperparameter values from the surrogate function which is constructed based on the previous results and evaluates these values in the objective function. This continues for num_eval evaluations of the objective function with the surrogate function continually updated with each new result. 

The `best_param` object yield the results. The `best_param` object that is returned from fmin contains the hyperparameters that yielded the lowest loss on the objective function. Once we have these hyperparameters, we can use them to train a model on the full training data and then evaluate on the test data.


## 6. Important points to note  <a class="anchor" id="6"></a>


[Back to Table of Contents](#0.1)


- The optimal hyperparameters are those that do best in cross validation and not necessarily those that do best on the test data. When we use cross validation, we hope that these results generalize to the test data.

- Even using 10-fold cross-validation, the hyperparameter tuning overfits to the training data. The best score from cross-validation is significantly higher than that on the test data.

- Random search may return better hyperparameters just by luck. Bayesian optimization is not guaranteed to find better hyperparameters and can get stuck in a local minimum of the objective function.

## 7. Conclusion <a class="anchor" id="7"></a>

[Back to Table of Contents](#0.1)


- In this kernel, I demonstrated the idea of Bayesian Optimisation by using Hyperopt.

- Bayesian optimisation chooses the next hyperparameters in an informed way, and as such spends more time evaluating areas of the parameter distribution it believes have the highest chance of bringing a cross-validation score improvement versus previous iterations.

- This can result in fewer evaluations of the objective function and better generalisation performance on the test set compared to random or grid search.

- The relative benefits of Bayesian Optimisation differ with the number of dimensions of the dataset and the size of the parameter grid. The larger the dataset and or the parameter grid, the higher the potential for efficacy gains.

- Random Search can still outperform Bayesian Optimisation as it could bump onto the optimal set of hyperparameters right at the start — just by luck.

I hope you find this kernel useful and enjoyable.



Your comments and feedback are most welcome.

Thank you.

[Go to Top](#0)