# Bayesian Optimization

Bayesian optimization is a sequential design strategy for global optimization of black-box functions that does not assume any functional forms. It is usually employed to optimize expensive-to-evaluate functions.


General properties of Bayesian optimization methods:
1. It is a sequential approach. The calculations are not parallelized.
2. Do not use the derivative of the objective function. [Appropriate for cases where the derivative is unknownly discrete.
Depending on the problem, the derivative may not be defined.
Of course, even if a derivative is defined, it can also be applied when it is complicated to calculate. In this case, automatic gradient may be used.]
3. Use machine learning methods to predict where there is a better year. (surrogate model) [using artificial intelligence]
4. Various models are possible. In addition, results can change sensitively to model selection.
5. It is known as an appropriate method for situations where too many objective function calculations cannot be done.
  - That is, if the calculation is too high.
In other words, it is an optimal method when the cost of creating a surrogate model with machine learning is very low.
(However, too much data can make a problem.
Problems that are proportional to three wins in the number of data arise. This is the complexity associated with the inverse matrix calculation.
  - How much time is spent calculating the objective function? The time it takes to achieve expectations should be directly compared.
6. A general optimization algorithm that can be used when the objective function has noise.
7. Machine learning methods are continuously available when additional data is reinforced.


https://github.com/fmfn/BayesianOptimization


- It should be noted that the use is different from that of a typical local minimization algorithm.
- For example, traditional local minimization algorithms are still better for very simple function local minimization. [Nelder-Mead]
- If the derivative of the objective function is known analytically, traditional computational methods should be used. [BFGS]
- The machine learning method is to use when things are more twisted.





<img src ="https://blog.kakaocdn.net/dn/c5AeYX/btqNqC6y46P/yCsuigpZvzKbDUK6dHZg51/img.png" width="500" height="600"><img>



Here, we will use this technique to find the optimal hyper-parameters

## Pre-Modeling Operations


In [11]:
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/gdrive/')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from tqdm import tnrange, tqdm_notebook, notebook, tqdm
import time

from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV, cross_validate
import sklearn as preprocessing
from sklearn.metrics import *

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


In [12]:
path = '/content/gdrive/MyDrive/kaggle/'

In [13]:
train = pd.read_csv(path + 'train.csv')
test = pd.read_csv(path + 'test.csv')
submission = pd.read_csv(path + 'sample_submission.csv')

In [14]:
print(train.shape)
print(test.shape)
print(submission.shape)

(300000, 26)
(200000, 25)
(200000, 2)


In [15]:
train = train.iloc[:,1:]
test = test.iloc[:,1:]

In [16]:
encoder = preprocessing.LabelEncoder()
train_label = train.copy()
test_label = test.copy()

for i in range(10):
    train_label.iloc[:,i] = encoder.fit_transform(train.iloc[:,i])
    test_label.iloc[:,i] = encoder.transform(test.iloc[:,i])

In [18]:
X_label=train_label.drop('target',axis=1)
y=train['target']

## Modeling (Bayesian Optimization)

- First, find heavy parameters such as `n_estimators` and `learn_rate` using GridSearchCV first.

- For time problems, set the parameter to a low value and n_estimators are fixed to a value of 50.


In [26]:
learn_rate=[0.05,0.3,0.5,0.9]
max_depth=[5,10,15,20]

hyper={'learning_rate':learn_rate,'max_depth':max_depth}
gd=GridSearchCV(estimator=LGBMRegressor(n_estimators=50, n_jobs=-1),param_grid=hyper,verbose=True, n_jobs=-1, cv=3, scoring='neg_root_mean_squared_error')
gd.fit(X_label,y)
print(gd.best_score_)
print(gd.best_estimator_)

Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  1.5min finished


-0.7250238851106058
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.3, max_depth=20,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=50, n_jobs=-1, num_leaves=31, objective=None,
              random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)


  <br/>

We will proceed with modeling by applying the Bayesian optimization method during Hyperparameter Optimization, which is a field of AutoML.

Except for the model ensemble, we will calculate the optimal result value with a single model. The model to use is LGBM.

For bayesian-optimization to be used in a Python environment, you must install the package. You can install it with the pip command.

- bayesian-optimization 1.2.0 [link](https://pypi.org/project/bayesian-optimization/)

In [27]:
!pip install bayesian-optimization

Collecting bayesian-optimization
  Downloading bayesian-optimization-1.2.0.tar.gz (14 kB)
Building wheels for collected packages: bayesian-optimization
  Building wheel for bayesian-optimization (setup.py) ... [?25l[?25hdone
  Created wheel for bayesian-optimization: filename=bayesian_optimization-1.2.0-py3-none-any.whl size=11685 sha256=0c861c94b30437cf01203bc027053b3a5b665be6b4c9c5fe4d6aa8edc4adb63a
  Stored in directory: /root/.cache/pip/wheels/fd/9b/71/f127d694e02eb40bcf18c7ae9613b88a6be4470f57a8528c5b
Successfully built bayesian-optimization
Installing collected packages: bayesian-optimization
Successfully installed bayesian-optimization-1.2.0


The 'bayes_opt' module is downloaded to your environment and the installation is complete.

<br/>

I will import the required package.

In [28]:
from bayes_opt import BayesianOptimization

The following must create a purpose function: The objective function is a performance function of LGBMRegressor that has a combination of parameters in the model as an input value. Since the performance evaluation is RMSE, it returns the score value.

In [29]:
# create purpose function
def lgbm_cv(learning_rate, n_estimators, max_depth, num_leaves, subsample, min_child_weight, colsample_bytree, max_bin, reg_alpha, reg_lambda):
    model = LGBMRegressor(learning_rate = learning_rate,
                                n_estimators = int(n_estimators),
                                num_leaves = int(round(num_leaves)),
                                max_depth = int(round(max_depth)),
                                n_jobs = -1,
                                random_state = 0,
                                subsample = max(subsample, 0),
                                min_child_weight = int(round(min_child_weight)),
                                colsample_bytree = colsample_bytree,
                                max_bin = int(round(max_bin)),
                                reg_alpha = max(reg_alpha, 0),
                                reg_lambda = max(reg_lambda, 0)
                               )
    scores = cross_validate(model, X_label, y, cv=5, n_jobs = -1, scoring='neg_root_mean_squared_error')
    return np.mean(scores['test_score'])

Arguments for functions are parameters of the model. Enter the parameter that you want to explore the optimal value.

I composed the combination of 10 parameters in total.


`learning_rate` and `max_depth`, `n_estimators` are set to a narrow range that does not deviate significantly from the parameters found earlier. And I set the remaining parameters to a wide range. `int(round())` sets for parameters with int values, and `max` and `min` functions for parameters with fixed maximum and minimum values.

`cross_validate`is a function that calculates the score.
It is common to use the `cross_val_score` method when using a single evaluation index or the `make_scorer` method when using multiple indicators. We will use `cross_val_score` because there is a competition evaluation index.

The parameter cv value was specified as 3.


The following sets the input value, i.e., the navigation interval for the parameter.

In [30]:
# Interval to be explored for input values
pbounds = {'learning_rate': (0.005, 0.5),
           'n_estimators': (30, 80),
           'max_depth': (15, 50),
           'num_leaves': (0, 100),
           'subsample': (0, 0.99),
           'min_child_weight' : (0, 100),
           'colsample_bytree': (0, 0.99),
           'max_bin': (0, 1000),
           'reg_alpha': (0, 10),
           'reg_lambda' : (0, 10)
          }

The navigation range of the parameters was set by referring to the model-specific descriptive materials. I will omit the detailed description of the parameter.


The following creates an object:


In [31]:
lgbmBO = BayesianOptimization(f = lgbm_cv, pbounds = pbounds, verbose = 2, random_state = 0)

The first factor is the objective function f, and pbounds refer to the navigation interval of the input value. Random seed is set to zero.

<br/>

Let's do Bayesian Optimization.

`init_points` is the number of first-time searches. The calculation proceeds by sampling the input value by `init_points` within the interval set in pbound. `n_iter` is the number of operations. Therefore, you will perform 25 times in total.

I will set acq to EI. xi is an argument that controls the intensity of the expansion-exposition, which typically increases the expansion by setting it to 0.01.

The results of the operation are as follows:

In [32]:
lgbmBO.maximize(init_points=5, n_iter = 20, acq='ei', xi=0.01)

|   iter    |  target   | colsam... | learni... |  max_bin  | max_depth | min_ch... | n_esti... | num_le... | reg_alpha | reg_la... | subsample |
-------------------------------------------------------------------------------------------------------------------------------------------------
| [0m 1       [0m | [0m-0.7223  [0m | [0m 0.5433  [0m | [0m 0.359   [0m | [0m 602.8   [0m | [0m 34.07   [0m | [0m 42.37   [0m | [0m 62.29   [0m | [0m 43.76   [0m | [0m 8.918   [0m | [0m 9.637   [0m | [0m 0.3796  [0m |
| [0m 2       [0m | [0m-0.7386  [0m | [0m 0.7838  [0m | [0m 0.2668  [0m | [0m 568.0   [0m | [0m 47.4    [0m | [0m 7.104   [0m | [0m 34.36   [0m | [0m 2.022   [0m | [0m 8.326   [0m | [0m 7.782   [0m | [0m 0.8613  [0m |
| [0m 3       [0m | [0m-0.7227  [0m | [0m 0.9688  [0m | [0m 0.4006  [0m | [0m 461.5   [0m | [0m 42.32   [0m | [0m 11.83   [0m | [0m 62.0    [0m | [0m 14.34   [0m | [0m 9.447   [0m | [0m 5.218   [0m | [

The `target` value corresponds to the RMSE score as the return value of the objective function. I was able to get a value of about 0.721.

Although there is no significant improvement in performance, estimating a small number of parameter combinations or proceeding with model ensembles can result in better scores.

<br/>

The parameter values calculated are as follows:

In [33]:
lgbmBO.max

{'params': {'colsample_bytree': 0.21106978103490293,
  'learning_rate': 0.20370811972451633,
  'max_bin': 388.32866754216286,
  'max_depth': 48.637200464467874,
  'min_child_weight': 77.06806238233416,
  'n_estimators': 79.72581709966855,
  'num_leaves': 58.4002115286548,
  'reg_alpha': 9.356747820775828,
  'reg_lambda': 9.874741430024695,
  'subsample': 0.8165349055304533},
 'target': -0.7205944791230456}

In [38]:
#파라미터 적용
fit_lgbm = lgbm.LGBMRegressor(learning_rate=lgbmBO.max['params']['learning_rate'],
                               n_estimators = int(round(lgbmBO.max['params']['n_estimators'])),
                               num_leaves = int(round(lgbmBO.max['params']['num_leaves'])),
                               max_depth = int(round(lgbmBO.max['params']['max_depth'])),
                               max_bin = int(round(lgbmBO.max['params']['max_bin'])),
                               min_child_weight = int(round(lgbmBO.max['params']['min_child_weight'])),
                               colsample_bytree=lgbmBO.max['params']['colsample_bytree'],
                               subsample = lgbmBO.max['params']['subsample'],
                               reg_alpha = lgbmBO.max['params']['reg_alpha'],
                               reg_lambda = lgbmBO.max['params']['reg_lambda']
                               )

In [40]:
fit_lgbm

LGBMRegressor(boosting_type='gbdt', class_weight=None,
              colsample_bytree=0.21106978103490293, importance_type='split',
              learning_rate=0.20370811972451633, max_bin=388, max_depth=49,
              min_child_samples=20, min_child_weight=77, min_split_gain=0.0,
              n_estimators=80, n_jobs=-1, num_leaves=58, objective=None,
              random_state=None, reg_alpha=9.356747820775828,
              reg_lambda=9.874741430024695, silent=True,
              subsample=0.8165349055304533, subsample_for_bin=200000,
              subsample_freq=0)

In [39]:
model = fit_lgbm.fit(X_label,y)

The model operation is complete.

## Model Fitting

In [42]:
pred_y = model.predict(test_label)

In [43]:
submission['target']=pred_y
submission.to_csv('submission_BO.csv')

---
Hope this helps. In the previous process, creating a slightly more advanced model through GridSearchCV, estimating combinations with fewer parameters, and ensembling models with good scores will likely lead to greater performance improvements.


- reference 

https://www.dacon.io/competitions/official/235647/codeshare/1720

http://egloos.zum.com/incredible/v/7479039

https://en.wikipedia.org/wiki/Bayesian_optimization