# Bayesian Hyperparameter Optimization

For competitions as close as those in Kaggle, the amount of lift that one can get from optimizing hyperparameters often becomes the decisive factor. It seems to be even more the case in this competition where we have to predict future prices of something as volatile as Crypto. 

# Data Loading and All That Stuff

Check out this amazing notebook! 

[https://www.kaggle.com/craniket/gresearch-submitting-lagged-features-via-api?scriptVersionId=85838940](http://)

In [None]:

import os
import random
import pandas as pd
import numpy as np
import lightgbm as lgb
from lightgbm import LGBMRegressor
import gresearch_crypto
import time
import datetime
from hyperopt import tpe
from hyperopt import hp
from hyperopt import Trials, STATUS_OK,fmin
from hyperopt.pyll.base import scope
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
TRAIN_CSV = '/kaggle/input/g-research-crypto-forecasting/train.csv'
ASSET_DETAILS_CSV = '/kaggle/input/g-research-crypto-forecasting/asset_details.csv'

In [None]:
df_train = pd.read_csv(TRAIN_CSV)
df_train.head()

In [None]:
df_asset_details = pd.read_csv(ASSET_DETAILS_CSV).sort_values("Asset_ID")
df_asset_details

In [None]:
def get_features(df, 
                 asset_id, 
                 train=True):
    '''
    This function takes a dataframe with all asset data and return the lagged features for a single asset.
    
    df - Full dataframe with all assets included
    asset_id - integer from 0-13 inclusive to represent a cryptocurrency asset
    train - True - you are training your model
          - False - you are submitting your model via api
    '''
    
    df = df[df['Asset_ID']==asset_id]
    df = df.sort_values('timestamp')
    if train == True:
        df_feat = df.copy()
        # define a train_flg column to split your data into train and validation
        totimestamp = lambda s: np.int32(time.mktime(datetime.datetime.strptime(s, "%d/%m/%Y").timetuple()))
        valid_window = [totimestamp("12/03/2021")]
        df_feat['train_flg'] = np.where(df_feat['timestamp']>=valid_window[0], 0,1)
        df_feat = df_feat[['timestamp','Asset_ID','Close','Target','train_flg']].copy()
    else:
        df = df.sort_values('row_id')
        df_feat = df[['Asset_ID','Close','row_id']].copy()
    
    # Create your features here, they can be lagged or not
    df_feat['sma15'] = df_feat['Close'].rolling(15).mean()/df_feat['Close'] -1
    df_feat['sma60'] = df_feat['Close'].rolling(60).mean()/df_feat['Close'] -1
    df_feat['sma240'] = df_feat['Close'].rolling(240).mean()/df_feat['Close'] -1
    
    df_feat['return15'] = df_feat['Close']/df_feat['Close'].shift(15) -1
    df_feat['return60'] = df_feat['Close']/df_feat['Close'].shift(60) -1
    df_feat['return240'] = df_feat['Close']/df_feat['Close'].shift(240) -1
    df_feat = df_feat.fillna(0)
    
    return df_feat

In [None]:
# create your feature dataframe for each asset and concatenate
feature_df = pd.DataFrame()
for i in range(14):
    feature_df = pd.concat([feature_df,get_features(df_train,i,train=True)])

In [None]:
# assign weight column feature dataframe
feature_df = pd.merge(feature_df, df_asset_details[['Asset_ID','Weight']], how='left', on=['Asset_ID'])

In [None]:
# define features for LGBM
features = ['Asset_ID','sma15','sma60','sma240','return15','return60','return240','Weight']
categoricals = ['Asset_ID']

In [None]:
del df_train

In [None]:
# define the evaluation metric
def weighted_correlation(a, train_data):
    
    weights = train_data.Weight.values
    b = train_data.target
    
    
    w = np.ravel(weights)
    a = np.ravel(a)
    b = np.ravel(b)

    sum_w = np.sum(w)
    mean_a = np.sum(a * w) / sum_w
    mean_b = np.sum(b * w) / sum_w
    var_a = np.sum(w * np.square(a - mean_a)) / sum_w
    var_b = np.sum(w * np.square(b - mean_b)) / sum_w

    cov = np.sum((a * b * w)) / np.sum(w) - mean_a * mean_b
    corr = cov / np.sqrt(var_a * var_b)

    return corr

# Tuning

Here is the part where we use Bayesian hyperparameter tuning. Simply put, bayesian hyperparameter tuning is an "intelligent" mechanism wherein each model evaluation carries information from the previous iterations. Compare this to gridsearch or random search, which is an [embarassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) problem. Both of these methods involve evaluating the objective function at a bunch of points on the hyperparameter space separately from each other. However, note that we used the term "iterations" for Bayesian hyperparameter tuning.

# How does it work?
# What does hyperparameter optimization entail?

What do we REALLY want to find when we optimize hyperparameters? We want to find the set of values for which the loss function or objective function is minimized. The objective function depends on 
* The observations
* The hyperparameters


What happens if we fix the hyperparameters? The loss function becomes analogous to probability distributions - it takes up values in ranges depending on the distribution of the predictor variables.

This can be denoted by ***P(loss|hyperparameters)***.

Once we have this, we can estimate the loss for this set of hyperparameters with something like the  median of the assumed distribution.

However, we do not know this distribution for "each" value of the hyperparameter. The idea is to figure out this distribution at a sufficient number of points. That is, sufficient to get a decent idea of the joint distribution. 

# How many points is sufficient?

Let's try to understand this visually. I am shamelessly stealing these pictures from [this amazing lecture](https://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/slides/lec21.pdf).

Let's say we want to approximate a function, and we have evaluated its value at three points.


![](https://i.imgur.com/ujmYbRB.png)

One non-linear estimate of the function could be


![](https://i.imgur.com/hKeBPjt.png)

Note the labelled 80, 90, 95th percentiles. We are extremely certain of the function's value at points we have evaluated it. We grow less and less certain as we move away.

We want to evaluate the function at enough points so that we are "pretty" certain of the function. And here is where the Bayesian optimization process comes in - it tells us where next to evaluate the function so as to gain most certainty. 

Let's look at the components here.

# Components 

We obviously have the objective function to think about. However, in practice, it is too complicated to estimate/approximate. People tend to use different kinds of gaussian kernels as proxies, or **surrogate functions**.

And then there's the function which is going to tell us which point to look at next. This is called the **acquisition function**. In practice, people tend to use the expected  improvement, E[max(γ − f (θ), 0)], where γ is the previous minima found. The θ at which this is maximum is chosen for the next evaluation. This is how it could look:


![](https://i.imgur.com/OAXvXTj.png)

In this notebook, we will use the TPE algorithm, which uses gaussian kernels and expected improvement. It uses Bayes Theorem to calculate the expected improvement. The expectation is calculated over the conditional surrogate function, or over *P(loss|hyperparameters)*. TPE expresses *P(loss|hyperparameters)* as a function of *P(hyperparameters|loss)*. Now *P(hyperparameters|loss)* is defined as 


![](https://i.imgur.com/LcHJN92.png)


where y* is a previously defined threshold. This creates two distributions, one for scores lesser than the threshold and the other for scores higher than the threshold. The idea is to make it more likely to draw hyperparameters from regions resulting in lower losses. And indeed, when the expectation is calculated, it is found to be proportional to the ratio *l(x)/g(x)*!

More details can be found in this brilliant article https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f

And finally, there is, of course, the hyperparameter space over which we want to find the minima.

To summarize, the components are:

* Surrogate function
* Acquisition function
* Hyperparameter space

Let's execute all of this. We will use the hyperopt package in python.


In [None]:
def hyperopt(param_space, X_train, y_train, X_test, y_test, num_eval):

    start = time.time()
    
    def objective_function(params):
        clf = LGBMRegressor(**params).fit(X_train,y_train)
        y_pred = clf.predict(X_test)
        X_test.target = y_test
        score = weighted_correlation(y_pred,X_test)
        return {'loss': -score, 'status': STATUS_OK}

    trials = Trials()
    best_param = fmin(objective_function, 
                      param_space, 
                      algo=tpe.suggest, 
                      max_evals=num_eval, 
                      trials=trials)
    loss = [x['result']['loss'] for x in trials.trials]
    
    best_param_values = [x for x in best_param.values()]

    print("")
    print("##### Results")
    print("Best parameters: ", best_param)
    print("Best score: ", -min(loss))
    print("Time elapsed: ", time.time() - start)
    print("Parameter combinations evaluated: ", num_eval)
    
    return trials

In [None]:
# define train and validation weights and datasets
#weights_train = feature_df.query('train_flg == 1')[['Weight']]
#weights_test = feature_df.query('train_flg == 0')[['Weight']]

#train_dataset = lgb.Dataset(feature_df.query('train_flg == 1')[features], 
#                            feature_df.query('train_flg == 1')['Target'].values, 
#                            feature_name = features, 
#                            categorical_feature= categoricals)
#val_dataset = lgb.Dataset(feature_df.query('train_flg == 0')[features], 
#                          feature_df.query('train_flg == 0')['Target'].values, 
#                          feature_name = features, 
#                          categorical_feature= categoricals)

#train_dataset.add_w = weights_train
#val_dataset.add_w = weights_test

evals_result = {}

space = { 'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),
    'max_depth': scope.int(hp.quniform('max_depth', 4, 10, 1)),
    'n_estimators': scope.int(hp.quniform('n_estimators', 200, 2000, 50)),
#    'early_stopping_rounds': scope.int(hp.quniform('early_stopping_rounds', 20, 500, 5)),
    'objective': 'regression',
    'metric': 'None',
    'boosting_type': hp.choice('boosting_type',['goss','gbdt']),
    'verbose': -1,
    'seed': 46
}

results_hyperopt = hyperopt(space, feature_df.query('train_flg == 1')[features], feature_df.query('train_flg == 1')['Target'].values, feature_df.query('train_flg == 0')[features], feature_df.query('train_flg == 0')['Target'].values, 25)

##### Results

Best parameters:  {'boosting_type': 1, 'learning_rate': 0.010257600094741514, 'max_depth': 4.0, 'n_estimators': 1550.0}
Best score:  0.03316784490207032
Time elapsed:  26925.81499361992
Parameter combinations evaluated:  25

Thanks for checking this out! Let me know what you think! 