# Introduction

Hyper-parameter tunning is a key factor for reaching high-performing models.

In this notebook, I introduce [optuna](https://optuna.org/), a framework performing hyper-parameters search using bayesian methods.

## Part 1: LGB

LGB is a popular boosting tree developped by Microsoft algorithm used in many tabular winning competition. In particular, in this notebook, we are going to use the plug-and-play optimiser for LGB models allowing to quickly find a good set of hyper-parameters, even without particular knowledges on the subject.

## More coming...

In [None]:
!pip install optuna

In [None]:
import numpy as np
import pandas as pd

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold

from tqdm import tqdm

# Loading data

In [None]:
df_train = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/train.csv')
df_test = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/test.csv')
df_train.head()

# Basic reformating

In [None]:
X = df_train.drop(['id','target'], axis=1)
Xtest = df_test.drop(['id'], axis=1)
y = df_train['target']

# Model 1: LGB

## Hyper-parameter tunning

Optuna can be used with very custom hyper-parameter research. A lot of examples are available in [their website](https://optuna.readthedocs.io/en/stable/tutorial/index.html)

They also developped special modules to automatically make the optimisation of popular algorithms such as lightgbm.

The code below show a quick data preparation and call to the optimised function.

In [None]:
train = int(len(X)*0.9)
Xtrain, Xval = X.iloc[:train], X.iloc[train:]
ytrain, yval = y.iloc[:train], y.iloc[train:]

In [None]:
import optuna.integration.lightgbm as lgb

dtrain = lgb.Dataset(Xtrain, label=ytrain)
dval = lgb.Dataset(Xval, label=yval)
params = {"objective": "regression","metric": "rmse","verbosity": -1,"boosting_type": "gbdt"}
model = lgb.train(params, dtrain, valid_sets=[dval], verbose_eval=100, early_stopping_rounds=100, )

## Training and evaluation

Now that the optimal parameters have been found, we evaluate them using KFold and prepare prediction

In [None]:
params = model.params
params

## Tunning learning rate and num_iterators

Learning rate is another important parameter, it is not tuned by the method above, we need to reduce it manually. We can take a high number of iteration and use the early_stopping_rounds parameter to stop training when we start overfitting

In [None]:
from lightgbm import LGBMRegressor
from sklearn.model_selection import KFold

N_FOLDS = 5

kf = KFold(n_splits = N_FOLDS)
oof = np.zeros(len(y))
oof_vanilla = np.zeros(len(y))
preds = np.zeros(len(Xtest))
params['learning_rate'] = 0.005
params['num_iterations'] = 5000
for train_ind, test_ind in tqdm(kf.split(X)):
    Xtrain = X.iloc[train_ind]
    Xval = X.iloc[test_ind]
    ytrain = y.iloc[train_ind]
    yval = y.iloc[test_ind]

    model = LGBMRegressor(**params)
    vanilla_model = LGBMRegressor()
    
    model.fit(Xtrain, ytrain, eval_set = ((Xval,yval)), early_stopping_rounds = 50, verbose = 0)
    vanilla_model.fit(Xtrain, ytrain)
    p = model.predict(Xval)
    p_vanilla = vanilla_model.predict(Xval)
    oof[test_ind] = p
    oof_vanilla[test_ind] = p_vanilla
    
    preds += model.predict(Xtest)/N_FOLDS
    
print(f'mean square error on training data (vanilla model): {np.round(mean_squared_error(y, oof_vanilla, squared=False),5)}')    
print(f'mean square error on training data (with optuna tuning): {np.round(mean_squared_error(y, oof, squared=False),5)}')

# Submission

In [None]:
submission = df_test[['id']]
submission['target'] = preds
submission.to_csv('submission.csv', index = False)