# Extreme Fine Tuning of LGBM using Incremental training


In my efforts to push leaderboard i stumbled across a small trick to improve predictions in 4th to 5th decimal using same parameters and a single model, essentially it is a trick to improve prediction of your best parameter, squeezing more out of them!!. Trick is executed in following steps:

* Find the best parameters for your LGBM, manually or using optimization methods of your choice.


* train the model to the best RMSE you can get in one training round using high early stopping.


* train the model for 1 or 2 rounds with reduced learning rate.


* once the first few rounds are over, start reducing regularization params by a factor at each incremental training iteration, you will start observing improvements in 5th decimal place... which is enough to get 5th decimal improvement on your models leaderboard score.

At the top of leaderboard this make a huge difference, i pushed my rank from `39` at **0.84202** to my best `6th place`(17th Feb 2021) with **0.84193**

Lets check out.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import KFold, GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder

from lightgbm import LGBMRegressor

import optuna
from functools import partial

import warnings
warnings.filterwarnings('ignore')

In [None]:
train = pd.read_csv('../input/tabular-playground-series-feb-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-feb-2021/test.csv')

In [None]:
X_train = train.drop(['id', 'target'], axis=1)
y_train = train.target
X_test = test.drop(['id'], axis=1)

In [None]:
cat_cols = [feature for feature in train.columns if 'cat' in feature]

def label_encoder(df):
    for feature in cat_cols:
        le = LabelEncoder()
        le.fit(df[feature])
        df[feature] = le.transform(df[feature])
    return df

In [None]:
X_train = label_encoder(X_train)
X_test = label_encoder(X_test)

In [None]:
split = KFold(n_splits=5, random_state=2)

In [None]:
def objective(trial, X, y, name='xgb'):
        
    params = {'max_depth':trial.suggest_int('max_depth', 5, 50),
              'n_estimators':200000,
              #'boosting':trial.suggest_categorical('boosting', ['gbdt', 'dart', 'goss']),
              'subsample': trial.suggest_uniform('subsample', 0.2, 1.0),
              'colsample_bytree':trial.suggest_uniform('colsample_bytree', 0.2, 1.0),
              'learning_rate':trial.suggest_uniform('learning_rate', 0.007, 0.02),
              'reg_lambda':trial.suggest_uniform('reg_lambda', 0.01, 50),
              'reg_alpha':trial.suggest_uniform('reg_alpha', 0.01, 50),
              'min_child_samples':trial.suggest_int('min_child_samples', 5, 100),
              'num_leaves':trial.suggest_int('num_leaves', 10, 200),
              'n_jobs' : -1,
              'metric':'rmse',
              'max_bin':trial.suggest_int('max_bin', 300, 1000),
              'cat_smooth':trial.suggest_int('cat_smooth', 5, 100),
              'cat_l2':trial.suggest_loguniform('cat_l2', 1e-3, 100)}

    model = LGBMRegressor(**params)
                  
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)
    

    model.fit(X_train, y_train, eval_set=[(X_val, y_val)],
              eval_metric=['rmse'],
              early_stopping_rounds=250, 
              categorical_feature=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
              #callbacks=[optuna.integration.LightGBMPruningCallback(trial, metric='rmse')],
              verbose=0)

    train_score = np.round(np.sqrt(mean_squared_error(y_train, model.predict(X_train))), 5)
    test_score = np.round(np.sqrt(mean_squared_error(y_val, model.predict(X_val))), 5)
                  
    print(f'TRAIN RMSE : {train_score} || TEST RMSE : {test_score}')
                  
    return test_score

In [None]:
optimize = partial(objective, X=X_train, y=y_train)

study_lgbm = optuna.create_study(direction='minimize')
#study_lgbm.optimize(optimize, n_trials=300)

# i have commented out the trials so as to cut short the notebook execution time.

In [None]:
#From the above optuna trials the best parameters i could find were the following ones!

lgbm_params = {'max_depth': 16, 
                'subsample': 0.8032697250789377, 
                'colsample_bytree': 0.21067140508531404, 
                'learning_rate': 0.009867383057779643,
                'reg_lambda': 10.987474846877767, 
                'reg_alpha': 17.335285595031994, 
                'min_child_samples': 31, 
                'num_leaves': 66, 
                'max_bin': 522, 
                'cat_smooth': 81, 
                'cat_l2': 0.029690334194270022, 
                'metric': 'rmse', 
                'n_jobs': -1, 
                'n_estimators': 20000}

In [None]:
preds_list_base = []
preds_list_final_iteration = []
preds_list_all = []

for train_idx, val_idx in split.split(X_train):
            X_tr = X_train.iloc[train_idx]
            X_val = X_train.iloc[val_idx]
            y_tr = y_train.iloc[train_idx]
            y_val = y_train.iloc[val_idx]
            
            Model = LGBMRegressor(**lgbm_params).fit(X_tr, y_tr, eval_set=[(X_val, y_val)],
                          eval_metric=['rmse'],
                          early_stopping_rounds=250, 
                          categorical_feature=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
                          #callbacks=[optuna.integration.LightGBMPruningCallback(trial, metric='rmse')],
                          verbose=0)
            
            preds_list_base.append(Model.predict(X_test))
            preds_list_all.append(Model.predict(X_test))
            print(f'RMSE for Base model is {np.sqrt(mean_squared_error(y_val, Model.predict(X_val)))}')
            first_rmse = np.sqrt(mean_squared_error(y_val, Model.predict(X_val)))
            params = lgbm_params.copy()
            
            for i in range(1, 8):
                if i >2:    
                    
                    # reducing regularizing params if 
                    
                    params['reg_lambda'] *= 0.9
                    params['reg_alpha'] *= 0.9
                    params['num_leaves'] += 40
                    
                params['learning_rate'] = 0.003
                Model = LGBMRegressor(**params).fit(X_tr, y_tr, eval_set=[(X_val, y_val)],
                          eval_metric=['rmse'],
                          early_stopping_rounds=200, 
                          categorical_feature=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
                          #callbacks=[optuna.integration.LightGBMPruningCallback(trial, metric='rmse')],
                          verbose=0,
                          init_model=Model)
                
                preds_list_all.append(Model.predict(X_test))
                print(f'RMSE for Incremental trial {i} model is {np.sqrt(mean_squared_error(y_val, Model.predict(X_val)))}')
            last_rmse = np.sqrt(mean_squared_error(y_val, Model.predict(X_val)))
            print('',end='\n\n')
            print(f'Improvement of : {first_rmse - last_rmse}')
            print('-' * 100)
            preds_list_final_iteration.append(Model.predict(X_test))

Great!! we can see that we have observed some further improvement in all the folds. Lets point out few findings:

* The first few iterations are just using very low learning_rate.. after the 2nd iteration we can see that there are iterations with very good improvement, observed by reducing regularization.


* There are also iterations where loss increased at later iterations slightly compared to previous iteration, showing that we have reached the limit in few iterations before the max iteration.


* If you try setting verbose=1, you will observe that these improvements are observed only in first few trees created... after that loss starts to increase, LGBM keeps the best model. But reducing regularization does improve loss for first few trees!!!!

I have 3 different sets of predictions, one for only the base model and one for all the predictions done and last one for only final iteration.

* `y_preds_base` : **0.84196 - 0.84199** (keeps jumping between these)


* `y_preds_all` : **0.84195 - 0.84196**


* `y_preds_final_iteration` : **0.84193**

In [None]:
y_preds_base = np.array(preds_list_base).mean(axis=0)
y_preds_base

In [None]:
y_preds_all = np.array(preds_list_all).mean(axis=0)
y_preds_all

In [None]:
y_preds_final_iteration = np.array(preds_list_final_iteration).mean(axis=0)
y_preds_final_iteration

In [None]:
submission = pd.DataFrame({'id':test.id,
              'target':y_preds_final_iteration})

In [None]:
submission.to_csv('submission.csv', index=False)

In [None]:
pd.read_csv('submission.csv')

### Finding the right regularization reducing factors using optuna

you may even try reducing or increasing few params and find the best mix of factors using optuna, it may even be possible to improve results more than achieved above, an example of the technique is shown below... 

In [None]:
# creating a pre trained model to use in objective.

X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=0)
lgbm = LGBMRegressor(**lgbm_params).fit(X_tr, y_tr, eval_set=[(X_val, y_val)],
                          eval_metric=['rmse'],
                          early_stopping_rounds=250, 
                          categorical_feature=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
                          verbose=0)

In [None]:
def objective(trial, model, X, y, iterations=5):

    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)
    f1 = trial.suggest_uniform('f1', 0.1, 1.0)
    f2 = trial.suggest_uniform('f2', 0.1, 3)
    f3 = trial.suggest_int('f3', 20, 100)
    f4 = trial.suggest_int('f4', 20, 50)
    f5 = trial.suggest_int('f5', 1, 5)
    lr_factor = trial.suggest_uniform('lr_factor', 0.1, 0.7)
    
    
    params = lgbm_params.copy()
    print(f'RMSE for base model is {np.sqrt(mean_squared_error(y_val, Model.predict(X_val)))}')

    for i in range(1, iterations):
        if i > 2:
            params['reg_lambda'] *=  f1
            params['reg_alpha'] += f2
            params['num_leaves'] += f3
            params['min_child_samples'] -= f4
            params['cat_smooth'] -= f5
            params['learning_rate'] *= lr_factor
            #params['max_depth'] += f5

       
        params['learning_rate'] = params['learning_rate'] if params['learning_rate'] > 0.0009 else 0.0009
        # need to stop learning rate to reduce to a very insignificant value, hence we use this threshold
        
        Model = model(**params).fit(X_train, y_train, eval_set=[(X_val, y_val)],
                          eval_metric=['rmse'],
                          early_stopping_rounds=200, 
                          categorical_feature=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
                          verbose=1000,
                          init_model=Model if i > 1 else lgbm)# we will use pre trained model for first iteration
     
        print(f'RMSE for {i}th model is {np.sqrt(mean_squared_error(y_val, Model.predict(X_val)))}')
           
              
    RMSE = mean_squared_error(y_val, Model.predict(X_val), squared=False)
    return RMSE



In [None]:
study = optuna.create_study(direction='minimize')
optimize = partial(objective, X=X_train, y=y_train, model=LGBMRegressor)
#study.optimize(optimize, n_trials=100)

Finally, i am still working on and experimenting why this actually works, few things i have found are:

* using only iterations with some reduction in learning rate will not give you results as good as reducing regularization at each iteration, loss improvement plateaus pretty qucikly and starts worsening in after few iterations, reducing regularizatioj too forces some more loss improvement and helps get more iterations in.


* Reducing regularization slowly forecefully improves loss at every next iteration until a bottleneck is reached, where this trick just does not work anymore, The reason for this can be maybe for the first few trees added at a new iteration with reduced regularization, the minute changes they bring to decision boundary even though overfit inducing should help generalize for just first few trees, after that overfitting starts to increase and loss shots up during each iteration, its great that lgbm iternally stores the best loss at each training!!

**Although a small trick this work has been a hardwork of few days, so if you like the work and find it useful, show your support by upvoting!!** 