## Bayesian methods of hyperparameter optimization

This kernel is about using library BayesianOptimization, that can do parameters tuning for us much easier. This library has very good documentation, so I will use information from this and you can find there much more information.

Documentation:
https://github.com/fmfn/BayesianOptimization

At first this is simple data preparation to show, how to work with library. 

In [1]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
import lightgbm
from bayes_opt import BayesianOptimization
from catboost import CatBoostClassifier, cv, Pool

In [2]:
train_df = pd.read_csv('../input/flight_delays_train.csv')
test_df = pd.read_csv('../input/flight_delays_test.csv')
train_df = train_df[train_df.DepTime <= 2400].copy()
y_train = train_df['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values

In [3]:
def label_enc(df_column):
    df_column = LabelEncoder().fit_transform(df_column)
    return df_column

def make_harmonic_features_sin(value, period=2400):
    value *= 2 * np.pi / period 
    return np.sin(value)

def make_harmonic_features_cos(value, period=2400):
    value *= 2 * np.pi / period 
    return np.cos(value)

def feature_eng(df):
    df['flight'] = df['Origin']+df['Dest']
    df['Month'] = df.Month.map(lambda x: x.split('-')[-1]).astype('int32')
    df['DayofMonth'] = df.DayofMonth.map(lambda x: x.split('-')[-1]).astype('uint8')
    df['begin_of_month'] = (df['DayofMonth'] < 10).astype('uint8')
    df['midddle_of_month'] = ((df['DayofMonth'] >= 10)&(df['DayofMonth'] < 20)).astype('uint8')
    df['end_of_month'] = (df['DayofMonth'] >= 20).astype('uint8')
    df['DayOfWeek'] = df.DayOfWeek.map(lambda x: x.split('-')[-1]).astype('uint8')
    df['hour'] = df.DepTime.map(lambda x: x/100).astype('int32')
    df['morning'] = df['hour'].map(lambda x: 1 if (x <= 11)& (x >= 7) else 0).astype('uint8')
    df['day'] = df['hour'].map(lambda x: 1 if (x >= 12) & (x <= 18) else 0).astype('uint8')
    df['evening'] = df['hour'].map(lambda x: 1 if (x >= 19) & (x <= 23) else 0).astype('uint8')
    df['night'] = df['hour'].map(lambda x: 1 if (x >= 0) & (x <= 6) else 0).astype('int32')
    df['winter'] = df['Month'].map(lambda x: x in [12, 1, 2]).astype('int32')
    df['spring'] = df['Month'].map(lambda x: x in [3, 4, 5]).astype('int32')
    df['summer'] = df['Month'].map(lambda x: x in [6, 7, 8]).astype('int32')
    df['autumn'] = df['Month'].map(lambda x: x in [9, 10, 11]).astype('int32')
    df['holiday'] = (df['DayOfWeek'] >= 5).astype(int) 
    df['weekday'] = (df['DayOfWeek'] < 5).astype(int)
    df['airport_dest_per_month'] = df.groupby(['Dest', 'Month'])['Dest'].transform('count')
    df['airport_origin_per_month'] = df.groupby(['Origin', 'Month'])['Origin'].transform('count')
    df['airport_dest_count'] = df.groupby(['Dest'])['Dest'].transform('count')
    df['airport_origin_count'] = df.groupby(['Origin'])['Origin'].transform('count')
    df['carrier_count'] = df.groupby(['UniqueCarrier'])['Dest'].transform('count')
    df['carrier_count_per month'] = df.groupby(['UniqueCarrier', 'Month'])['Dest'].transform('count')
    df['deptime_cos'] = df['DepTime'].map(make_harmonic_features_cos)
    df['deptime_sin'] = df['DepTime'].map(make_harmonic_features_sin)
    df['flightUC'] = df['flight']+df['UniqueCarrier']
    df['DestUC'] = df['Dest']+df['UniqueCarrier']
    df['OriginUC'] = df['Origin']+df['UniqueCarrier']
    return df.drop('DepTime', axis=1)

In [4]:
full_df = pd.concat([train_df.drop('dep_delayed_15min', axis=1), test_df])
full_df = feature_eng(full_df)

for column in ['UniqueCarrier', 'Origin', 'Dest','flight',  'flightUC', 'DestUC', 'OriginUC']:
    full_df[column] = label_enc(full_df[column])

X_train = full_df[:train_df.shape[0]]
X_test = full_df[train_df.shape[0]:]

Now we have data to tune parameters for different models.

In [5]:
X_train.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,UniqueCarrier,Origin,Dest,Distance,flight,begin_of_month,midddle_of_month,end_of_month,hour,morning,day,evening,night,winter,spring,summer,autumn,holiday,weekday,airport_dest_per_month,airport_origin_per_month,airport_dest_count,airport_origin_count,carrier_count,carrier_count_per month,deptime_cos,deptime_sin,flightUC,DestUC,OriginUC
0,8,21,7,1,19,82,732,171,0,0,1,19,0,0,1,0,0,0,1,0,1,0,746,1016,8290,11375,18024,1569,0.34366,-0.939094,265,494,67
1,4,20,3,19,226,180,834,3986,0,0,1,15,0,1,0,0,0,1,0,0,0,1,313,105,3523,1390,13069,1094,-0.612907,-0.790155,6907,1085,1441
2,9,2,5,21,239,62,416,4091,1,0,0,14,0,1,0,0,0,0,0,1,1,0,166,136,2246,1747,11737,977,-0.835807,-0.549023,7064,359,1518
3,11,25,6,16,81,184,872,1304,0,0,1,10,1,0,0,0,0,0,0,1,1,0,136,514,1785,6222,15343,1242,-0.884988,0.465615,2258,1122,484
4,10,7,6,20,182,210,423,2979,1,0,0,18,0,1,0,0,0,0,0,1,1,0,48,226,687,2571,30958,2674,0.073238,-0.997314,5144,1313,1103


## How does it work

Bayesian optimization works by constructing a posterior distribution of functions (gaussian process) that best describes the function you want to optimize. As the number of observations grows, the posterior distribution improves, and the algorithm becomes more certain of which regions in parameter space are worth exploring and which are not, as seen in the picture below.

<img src="https://github.com/fmfn/BayesianOptimization/blob/master/examples/bo_example.png?raw=true" />
As you iterate over and over, the algorithm balances its needs of exploration and exploitation taking into account what it knows about the target function. At each step a Gaussian Process is fitted to the known samples (points previously explored), and the posterior distribution, combined with a exploration strategy (such as UCB (Upper Confidence Bound), or EI (Expected Improvement)), are used to determine the next point that should be explored (see the gif below).
<img src="https://github.com/fmfn/BayesianOptimization/raw/master/examples/bayesian_optimization.gif" />

## Simple example

At first you should create an optimizer. It uses two things:
* function to optimize
* bounds of parameters

For us function is the procedure, which counts metrics of our model quality.

**!** The important thing is that our optimization will maximize the value on function. So if your metric should be smaller the better, don't forget to use negative metric value.

In [6]:
def simple_functon(a, b):
    return a + b

In [7]:
optimizer = BayesianOptimization(
    simple_functon,
    {'a': (1, 3),
    'b': (4, 7)})

Main parameters of this function:

* **n_iter**: How many steps of bayesian optimization you want to perform. The more steps the more likely to find a good maximum you are.
* **init_points**: How many steps of random exploration you want to perform. Random exploration can help by diversifying the exploration space.

In [8]:
optimizer.maximize(3, 2)

|   iter    |  target   |     a     |     b     |
-------------------------------------------------
| [0m 1       [0m | [0m 6.9     [0m | [0m 2.869   [0m | [0m 4.031   [0m |
| [95m 2       [0m | [95m 8.432   [0m | [95m 2.137   [0m | [95m 6.294   [0m |
| [0m 3       [0m | [0m 7.177   [0m | [0m 1.488   [0m | [0m 5.689   [0m |
| [0m 4       [0m | [0m 6.34    [0m | [0m 2.201   [0m | [0m 4.139   [0m |
| [95m 5       [0m | [95m 9.851   [0m | [95m 2.851   [0m | [95m 7.0     [0m |


Ideal! We can see the best params:

In [9]:
optimizer.max['params']

{'a': 2.851454859656846, 'b': 7.0}

... and the best result

In [10]:
optimizer.max['target']

9.851454859656846

**!** The important thing is that our optimization will maximize the value on function. So if your metric should be smaller the better, don't forget to use negative metric value. Optimizer use float values of params, you should use int() in function, if this parameter must be integer.

## Test it on data

### LigthGBM

My kernel about using it on real data and real peremeters with LightGBM: 
https://www.kaggle.com/clair14/gold-is-the-reason-teams-and-bayes-for-lightgbm

There I will use random values of parameters to test.

In [11]:
categorical_features = ['Month',  'DayOfWeek', 'UniqueCarrier', 'Origin', 'Dest','flight',  'flightUC', 'DestUC', 'OriginUC']

This is function, that we want to maximize - function, that counts cross-validation metrics of lightGBM for our params.

Some params such as num_leaves, max_depth, min_child_samples, min_data_in_leaf should be integers.

In [12]:
def lgb_eval(num_leaves,max_depth,lambda_l2,lambda_l1,min_child_samples, min_data_in_leaf):
    params = {
        "objective" : "binary",
        "metric" : "auc", 
        'is_unbalance': True,
        "num_leaves" : int(num_leaves),
        "max_depth" : int(max_depth),
        "lambda_l2" : lambda_l2,
        "lambda_l1" : lambda_l1,
        "num_threads" : 20,
        "min_child_samples" : int(min_child_samples),
        'min_data_in_leaf': int(min_data_in_leaf),
        "learning_rate" : 0.03,
        "subsample_freq" : 5,
        "bagging_seed" : 42,
        "verbosity" : -1
    }
    lgtrain = lightgbm.Dataset(X_train, y_train,categorical_feature=categorical_features)
    cv_result = lightgbm.cv(params,
                       lgtrain,
                       1000,
                       early_stopping_rounds=100,
                       stratified=True,
                       nfold=3)
    return cv_result['auc-mean'][-1]

In [13]:
lgbBO = BayesianOptimization(lgb_eval, {'num_leaves': (25, 4000),
                                                'max_depth': (5, 63),
                                                'lambda_l2': (0.0, 0.05),
                                                'lambda_l1': (0.0, 0.05),
                                                'min_child_samples': (50, 10000),
                                                'min_data_in_leaf': (100, 2000)
                                                })

lgbBO.maximize(n_iter=10, init_points=2)

|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.7292  [0m | [0m 0.01147 [0m | [0m 0.02856 [0m | [0m 11.78   [0m | [0m 1.22e+03[0m | [0m 873.0   [0m | [0m 2.075e+0[0m |
| [95m 2       [0m | [95m 0.7466  [0m | [95m 0.0144  [0m | [95m 0.02283 [0m | [95m 44.47   [0m | [95m 7.188e+0[0m | [95m 1.464e+0[0m | [95m 3.649e+0[0m |
| [0m 3       [0m | [0m 0.7142  [0m | [0m 0.02691 [0m | [0m 0.03727 [0m | [0m 38.44   [0m | [0m 9.994e+0[0m | [0m 214.9   [0m | [0m 43.16   [0m |
| [0m 4       [0m | [0m 0.722   [0m | [0m 0.002346[0m | [0m 0.003244[0m | [0m 42.38   [0m | [0m 9.939e+0[0m | [0m 108.5   [0m | [0m 3.99e+03[0m |
| [0m 5       [0m | [0m 0.7457  [0m | [0m 0.01357 [0m | [0m 0.007288[0m | [0m 21.44   [0m | [0m 9.947e+0[0m | [0m 1.964e+0[0m | [0m 3

Now you can see the result

In [14]:
lgbBO.max

{'target': 0.746558505094885,
 'params': {'lambda_l1': 0.01440144517073837,
  'lambda_l2': 0.02282841464037362,
  'max_depth': 44.46957161169711,
  'min_child_samples': 7188.477326338226,
  'min_data_in_leaf': 1463.6619898784904,
  'num_leaves': 3648.7264465083626}}

And all the process in each step...

In [15]:
lgbBO.res[0]

{'target': 0.7292442020076888,
 'params': {'lambda_l1': 0.011471649107357002,
  'lambda_l2': 0.02856379808602373,
  'max_depth': 11.78349479847602,
  'min_child_samples': 1219.9290147496015,
  'min_data_in_leaf': 873.0480733096188,
  'num_leaves': 2074.997747323033}}

## Loading progress

It is wonderful! Really! You can learn you optimizer, collect some points, then you can correct something (bounds, for example, if you understand, that some values are not interesting for you. There is no point to start from beginning, you can just use previous result)
May be we can change data just a little bit, and continue to search for best parameters.

In [16]:
from bayes_opt.logger import JSONLogger
from bayes_opt.event import Events

logger = JSONLogger(path="./logs.json")
lgbBO.subscribe(Events.OPTMIZATION_STEP, logger)

In [17]:
lgbBO.maximize(n_iter=10, init_points=3)

|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
| [0m 13      [0m | [0m 0.7462  [0m | [0m 0.04216 [0m | [0m 0.02283 [0m | [0m 18.87   [0m | [0m 7.272e+0[0m | [0m 1.68e+03[0m | [0m 3.798e+0[0m |
| [0m 14      [0m | [0m 0.7465  [0m | [0m 0.01389 [0m | [0m 0.03158 [0m | [0m 12.08   [0m | [0m 241.5   [0m | [0m 1.739e+0[0m | [0m 2.871e+0[0m |
| [0m 15      [0m | [0m 0.7455  [0m | [0m 0.001313[0m | [0m 0.02852 [0m | [0m 9.068   [0m | [0m 5.419e+0[0m | [0m 1.815e+0[0m | [0m 2.155e+0[0m |
| [0m 16      [0m | [0m 0.7458  [0m | [0m 0.02807 [0m | [0m 0.04261 [0m | [0m 10.66   [0m | [0m 9.921e+0[0m | [0m 1.985e+0[0m | [0m 56.49   [0m |
| [0m 17      [0m | [0m 0.7455  [0m | [0m 0.006499[0m | [0m 0.02464 [0m | [0m 61.1    [0m | [0m 2.256e+0[0m | [0m 1.998e+0[0m | [0m 1.323e+0

Now we can read it for another optimizer

In [18]:
new_opt = BayesianOptimization(lgb_eval, {'num_leaves': (25, 100),
                                                'max_depth': (5, 63),
                                                'lambda_l2': (0.0, 0.05),
                                                'lambda_l1': (0.0, 0.05),
                                                'min_child_samples': (50, 100),
                                                'min_data_in_leaf': (50, 200)
                                                })

In [19]:
from bayes_opt.util import load_logs

load_logs(new_opt, logs=["./logs.json"]);

In [20]:
new_opt.maximize(n_iter=5, init_points=1)

|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.7174  [0m | [0m 0.01712 [0m | [0m 0.03644 [0m | [0m 6.972   [0m | [0m 82.28   [0m | [0m 185.3   [0m | [0m 66.29   [0m |
| [0m 2       [0m | [0m 0.7165  [0m | [0m 0.03837 [0m | [0m 0.02931 [0m | [0m 61.46   [0m | [0m 94.69   [0m | [0m 199.9   [0m | [0m 99.72   [0m |
| [0m 3       [0m | [0m 0.7177  [0m | [0m 0.02965 [0m | [0m 0.02506 [0m | [0m 5.49    [0m | [0m 99.81   [0m | [0m 198.5   [0m | [0m 97.95   [0m |
| [0m 4       [0m | [0m 0.7178  [0m | [0m 0.02404 [0m | [0m 0.02519 [0m | [0m 5.61    [0m | [0m 88.52   [0m | [0m 197.9   [0m | [0m 98.57   [0m |
| [0m 5       [0m | [0m 0.7174  [0m | [0m 0.03907 [0m | [0m 0.04168 [0m | [0m 5.016   [0m | [0m 97.89   [0m | [0m 199.3   [0m | [0m 91.32   

In [21]:
new_opt.max

{'target': 0.7464723999907198,
 'params': {'lambda_l1': 0.013888878564793185,
  'lambda_l2': 0.03157713050090246,
  'max_depth': 12.076290564328213,
  'min_child_samples': 241.5320237877821,
  'min_data_in_leaf': 1738.5442220037476,
  'num_leaves': 2870.6031128571854}}

As you can see this is point from the previous run

## Try certain points

We can choose the certain point ant try the result. 

In [22]:
new_opt.probe({'num_leaves': 10,
                'max_depth': 100,
                'lambda_l2': 1,
                'lambda_l1': 1,
                'min_child_samples': 300,
                'min_data_in_leaf': 1000 })

new_opt.probe({'num_leaves': 55,
                'max_depth': 400,
                'lambda_l2': 5,
                'lambda_l1': 5,
                'min_child_samples': 100,
                'min_data_in_leaf': 10 })

In [23]:
new_opt.maximize(n_iter=0, init_points=0)

|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
| [0m 7       [0m | [0m 0.7301  [0m | [0m 1.0     [0m | [0m 1.0     [0m | [0m 100.0   [0m | [0m 300.0   [0m | [0m 1e+03   [0m | [0m 10.0    [0m |
| [0m 8       [0m | [0m 0.7149  [0m | [0m 5.0     [0m | [0m 5.0     [0m | [0m 400.0   [0m | [0m 100.0   [0m | [0m 10.0    [0m | [0m 55.0    [0m |


### CatBoost

Let's try another model for test

In [24]:
def cat_eval(num_leaves,max_depth,bagging_temperature, l2_leaf_reg):
    params = {'bagging_temperature': bagging_temperature,
              'num_leaves': int(num_leaves),
              'max_depth': int(max_depth),
              'l2_leaf_reg': l2_leaf_reg,
              'iterations': 500,
              'learning_rate':0.1,
              'early_stopping_rounds':100,
              'eval_metric': "AUC",
              'verbose': False}
    cv_dataset = Pool(data=X_train,
                  label=y_train,
                  cat_features=categorical_features)
    scores = cv(cv_dataset,
            params,
            fold_count=3)
    return scores['test-AUC-mean'].max()

In [25]:
cat_opt = BayesianOptimization(cat_eval, {'num_leaves': (25, 100),
                                          'max_depth': (5, 15),
                                          'bagging_temperature': (0.1, 0.9),
                                          'l2_leaf_reg': (2,5)
                                        })

In [26]:
cat_opt.maximize(n_iter=5, init_points=2)

|   iter    |  target   | baggin... | l2_lea... | max_depth | num_le... |
-------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.7533  [0m | [0m 0.2264  [0m | [0m 2.235   [0m | [0m 6.151   [0m | [0m 34.84   [0m |
| [0m 2       [0m | [0m 0.7488  [0m | [0m 0.4136  [0m | [0m 4.38    [0m | [0m 10.55   [0m | [0m 63.16   [0m |
| [0m 3       [0m | [0m 0.7329  [0m | [0m 0.3405  [0m | [0m 4.969   [0m | [0m 15.0    [0m | [0m 25.16   [0m |
| [0m 4       [0m | [0m 0.7485  [0m | [0m 0.7435  [0m | [0m 2.091   [0m | [0m 5.035   [0m | [0m 99.69   [0m |
| [0m 5       [0m | [0m 0.7479  [0m | [0m 0.8444  [0m | [0m 4.962   [0m | [0m 5.01    [0m | [0m 47.41   [0m |
| [0m 6       [0m | [0m 0.7503  [0m | [0m 0.1     [0m | [0m 2.0     [0m | [0m 5.0     [0m | [0m 65.7    [0m |
| [0m 7       [0m | [0m 0.7314  [0m | [0m 0.3017  [0m | [0m 2.093   [0m | [0m 14.91   [0m | [0m 99.96   [0m 

## Сonclusion

This library shows perfect results, and it is much effective then rendom search or CV gread, as you don't need to try every point.

Thanks for your attention!

<img src='https://sites.google.com/site/bayesforvietnam/_/rsrc/1465811460099/home/Bayes%201.jpg'/>