# Learning and validation curves for GBDTs and parameter tuning using Optuna

There is not much information about the learning curve and validation curve in Kaggle, so I wrote about the learning/validation curves.
If there are any mistakes, I would appreciate it if you could let me know.

The following is a great information on learning curves and validation curves.  
https://scikit-learn.org/stable/modules/learning_curve.html#validation-curve  
https://www.dataquest.io/blog/learning-curves-machine-learning/  

I also described how to use Optuna, a useful package to tune the optimal hyperparameters of gradient boosting models such as XGBoost.  


# Import packages

In [None]:
import numpy as np
import pandas as pd
import xgboost as xgb
import lightgbm as lgb

import optuna
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve

import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [None]:
train = pd.read_csv('../input/tabular-playground-series-jan-2021/train.csv')
test  = pd.read_csv('../input/tabular-playground-series-jan-2021/test.csv')
sub = pd.read_csv('../input/tabular-playground-series-jan-2021/sample_submission.csv')

In [None]:
train.shape

In [None]:
train.describe()

In [None]:
# Checking features and target columns
display(train.columns)
# Checking dtypes
display(train.info())

In [None]:
features = ['cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7',
       'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13', 'cont14']

# Notice
This Notebook is more for creating learning curves and validation curves. Half of it is my own memorandum.  
The learning curve and the validation curve are not based on the whole data (300,000 data in train.csv) because it takes much time to calculate but on 5% random sampling (i.e., 15,000 data). 

In [None]:
train_01 = train.sample(frac=0.05, replace=False, random_state=1)

In [None]:
X = train_01[features]
y = train_01['target']

# XGBoost and Learning/Validation curves

## Learning curve
The learning curve plots the number of training samples on the horizontal axis and the score of an indicator (such as RSME) on the vertical axis. It shows how the indicator changes as the sample size changes. We can examine over-fitting and under-fitting by comparing the learning curves between the training set and the validation set.
It also allows us to examine whether or not it is worth adding more samples to the current data, which may provide useful suggestions for continued data collection.  

The following is a great information on learning curves and validation curves.  
https://scikit-learn.org/stable/modules/learning_curve.html#validation-curve  
https://www.dataquest.io/blog/learning-curves-machine-learning/

In [None]:
def learning_curves(estimator, title, X, y, cv= None, train_sizes=np.linspace(.3, 1.0, 5)):
    
    train_sizes, train_scores, validation_scores = \
        learning_curve(estimator, 
                       X,
                       y,
                       train_sizes = train_sizes,
                       cv = cv, 
                       scoring = 'neg_mean_squared_error')

    train_scores_mean = np.sqrt(-np.mean(train_scores, axis=1))
    train_scores_std = np.sqrt(np.std(train_scores, axis=1))
    validation_scores_mean = np.sqrt(-np.mean(validation_scores, axis=1))
    validation_scores_std = np.sqrt(np.std(validation_scores, axis=1))
    
    plt.rcParams["font.size"] = 12
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label = 'Training error')
    plt.plot(train_sizes, validation_scores_mean, 'o-', color="g",label = 'Validation error')
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, validation_scores_mean - validation_scores_std,
                     validation_scores_mean + validation_scores_std, alpha=0.1,
                     color="g")
    
    plt.rcParams["font.size"] = 10
    plt.ylabel('RMSE', fontsize = 14)
    plt.xlabel('Training set size', fontsize = 14)
    title = title
    plt.title(title, fontsize = 18, y = 1.03)
    plt.legend()
    plt.ylim(0.5,0.8)

In [None]:
params_xgb = {'lambda': 1,
 'alpha': 0,
 'colsample_bytree': 1,
 'subsample': 1,
 'learning_rate': 0.05,
 'max_depth': 6,
 'min_child_weight': 3,
 'random_state': 48}

In [None]:
model_xgb = xgb.XGBRegressor(**params_xgb)

In [None]:
title = 'Learning curve'
learning_curves(model_xgb, title, X, y, cv=5)

The graph shows an upward trend for the Training set (red line), and it shows a slight downward trend for the Validation set(green line). This is a typical learning curve.  
As you can see, in the Training set, when the size is small, we can find parameters that fit a lot of data, so there are fewer errors. On the other hand, the parameters are tuned only for the Training set, so the Validation set has poor fits and large errors (left side of the figure).  

As the size increases, the training set's over-fitting is mitigated, resulting in a worse fit to the Training set and a better fit of the Validation set.   

If the two remain far apart even after increasing the size, we can say that there is still over-fitting.
In this figure, the two sets are gradually getting closer to each other, so we can expect a better fit if we increase the sample size. Since we only used 5% of the data, we can expect a better fit if we use all the data.



## Validation curve
Validation curves are plotted on the horizontal axis with a parameter (e.g., alpha, the regularization parameter) and on the vertical axis with an indicator score (such as RSME). You can visually see how the training set and validation set behave when the parameters are changed. This can be used as a basis for determining the final parameters.  

**Note**: Originally, all available data (300,000 here) should be used for the validation curve. The following results prioritize calculation speed, so the validation curve is created using data with 5% sampling data. Please treat them as reference values.  

In the case of a semi-automatic setup like Optuna (described below), there may be no need or motivation to see a learning curve or validation curve to check (that's why there are few descriptions in Kaggle, I thought).

In [None]:
def validation_curves(estimator, title, X, y,
                      cv= None, param_name= None, param_range=None):
    
    train_scores, test_scores = \
        validation_curve(estimator, 
                         X, 
                         y, 
                         param_name=param_name, 
                         param_range=param_range,
                         cv = cv,
                         scoring='neg_mean_squared_error', #'roc_auc'
                         n_jobs=4)
    train_scores_mean = np.sqrt(-np.mean(train_scores, axis=1))
    train_scores_std = np.sqrt(np.std(train_scores, axis=1))
    test_scores_mean = np.sqrt(-np.mean(test_scores, axis=1))
    test_scores_std = np.sqrt(np.std(test_scores, axis=1))

    plt.rcParams["font.size"] = 12
    plt.title(title, fontsize = 20)
    plt.xlabel(param_name, fontsize =14)
    plt.ylabel("Score", fontsize = 14)
    plt.ylim(0.5, 0.9)
    lw = 2
    plt.plot(param_range, train_scores_mean, label="Training score",
             color="darkorange", lw=lw)
    plt.fill_between(param_range, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.2,
                     color="darkorange", lw=lw)
    plt.plot(param_range, test_scores_mean, label="Cross-validation score",
             color="navy", lw=lw)
    plt.fill_between(param_range, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.2,
                     color="navy", lw=lw)
    plt.rcParams["font.size"] = 10
    plt.legend(loc="best")
    plt.show()

In [None]:
param_range = np.linspace(0, 1, 10)
param_range

In [None]:
param_name = "alpha"

In [None]:
title = "Validation Curves for alpha"
validation_curves(model_xgb, title, X, y, cv=5, 
                  param_name = param_name, param_range = param_range)

There seems to be no room for improvement on alpha.

In [None]:
param_name = "lambda"

In [None]:
title = "Validation Curves for lambda"
validation_curves(model_xgb, title, X, y, cv=5, 
                  param_name = param_name, param_range = param_range)

There seems to be no room for improvement in lambda either.

In [None]:
param_range = np.linspace(0.1, 1, 10)
param_range

In [None]:
param_name = 'colsample_bytree'

In [None]:
title = "Validation Curves for colsample"
validation_curves(model_xgb, title, X, y, cv=5, 
                  param_name = param_name, param_range = param_range)

The colsample_bytree also doesn't seem to change much in the Validation set.

In [None]:
param_name = 'subsample'

In [None]:
title = "Validation Curves for subsample"
validation_curves(model_xgb, title, X, y, cv=5, 
                  param_name = param_name, param_range = param_range)

There doesn't seem to be a definite place that is better.

In [None]:
param_name = 'n_estimators'

In [None]:
param_range = [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]

In [None]:
title = "Validation Curve for n_estimators"
validation_curves(model_xgb, title, X, y, cv=5, 
                  param_name = param_name, param_range = param_range)

## Hyper Parameter Tuning using Optuna
Now, I think it is essential to look at them one by one as described above, but I could not come to an obvious conclusion. Also, since there are so many parameters, I would like to decide them automatically to some extent.  
In this case, I would like to use a useful function called Optuna.
The following Notebook is a good reference.  
https://www.kaggle.com/hamzaghanmi/xgboost-hyperparameter-tuning-using-optuna  


From the learning curve results, the more the number of data increases, the more the over-fitting is mitigated and the better the prediction performance becomes. We will now use all 300,000 data in our tuning using Optuna. Let's submit the results and see the scores.  
Since it takes a lot of time, we have set it to use the GPU.

In [None]:
X = train[features]
y = train['target']

In [None]:
def objective(trial,data=X,target=y):
    
    train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.15,random_state=42)
    param = {
        'tree_method':'gpu_hist',  # this parameter means using the GPU when training our model to speedup the training process
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 1),
        'alpha': trial.suggest_loguniform('alpha', 1e-3, 1),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.1, 0.2, 0.3,0.5,0.7,0.9]),
        'subsample': trial.suggest_categorical('subsample', [0.1, 0.2,0.3,0.4,0.5,0.8,1.0]),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.0008, 0.01, 0.015, 0.02,0.03, 0.05,0.08,0.1]),
        'n_estimators': 4000,
        'max_depth': trial.suggest_categorical('max_depth', [5,7,9,11,13,15,17,20,23,25]),
        'random_state': 48,
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 400),
    }
    model = xgb.XGBRegressor(**param)  
    
    model.fit(train_x,train_y,eval_set=[(test_x,test_y)],early_stopping_rounds=100,verbose=False)
    
    preds = model.predict(test_x)
    
    rmse = mean_squared_error(test_y, preds,squared=False)
    
    return rmse

In [None]:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=50)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

In [None]:
study.best_trial.params

In [None]:
Best_params_xgb = {'lambda': 0.0014311714230223992,
 'alpha': 0.008850567457271379,
 'colsample_bytree': 0.3,
 'subsample': 1.0,
 'learning_rate': 0.01,
 'max_depth': 20,
 'min_child_weight': 245,
 'n_estimators': 4000,
 'random_state': 48,
 'tree_method':'gpu_hist'}

In [None]:
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.15,random_state=42)
model_xgb = xgb.XGBRegressor(**Best_params_xgb)
model_xgb.fit(train_x,train_y,eval_set=[(test_x,test_y)],early_stopping_rounds=100,verbose=False)

feature importance

In [None]:
importances = pd.Series(model_xgb.feature_importances_, index = features)
importances = importances.sort_values()
importances.plot(kind = "barh")
plt.title("imporance in the xgboost Model", fontsize=18)
plt.show()

In [None]:
preds = model_xgb.predict(test_x)
rmse = mean_squared_error(test_y, preds,squared=False)
rmse

### Create a file for submission


In [None]:
test_X = test[features]

In [None]:
preds = model_xgb.predict(test_X)

In [None]:
sub['target']=preds
sub.to_csv('submission.csv', index=False)

# LightGBM and Learning/Validation curves
As above, we first create the learning curve and the validation curve.  
Next, we will perform parameter tuning using Optuna.

## Learning curve

In [None]:
X = train_01[features]
y = train_01['target']

In [None]:
params_lgb = {'num_leaves': 31,
 'min_data_in_leaf': 20,
 'min_child_weight': 0.001,
 'max_depth': -1,
 'learning_rate': 0.005,
 'bagging_fraction': 1,
 'feature_fraction': 1,
 'lambda_l1': 0,
 'lambda_l2': 0,
 'random_state': 48}

In [None]:
model_lgb = lgb.LGBMRegressor(**params_lgb)

In [None]:
title = 'Learning curve'
learning_curves(model_lgb, title, X, y, cv=5)

## Validation curve
Similar to the above, draw validation curves for lambda_l1 (=alpha), lambda_l2, feature_fraction, bagging_fraction, and n_estimators.

In [None]:
param_range = np.linspace(0, 1, 10)
param_range

In [None]:
param_name = 'lambda_l1'

In [None]:
title = "Validation Curves for lambda_l1"
validation_curves(model_lgb, title, X, y, cv=5, 
                  param_name = param_name, param_range = param_range)

In [None]:
param_name = 'lambda_l2'

In [None]:
title = "Validation Curves for lambda_l2"
validation_curves(model_lgb, title, X, y, cv=5, 
                  param_name = param_name, param_range = param_range)

In [None]:
param_name = 'feature_fraction'

In [None]:
title = "Validation Curves for feature_fraction"
validation_curves(model_lgb, title, X, y, cv=5, 
                  param_name = param_name, param_range = param_range)

In [None]:
param_name = 'bagging_fraction'

In [None]:
title = "Validation Curves for bagging_fraction"
validation_curves(model_lgb, title, X, y, cv=5, 
                  param_name = param_name, param_range = param_range)

In [None]:
param_name = 'n_estimators'


In [None]:
param_range = [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]

In [None]:
title = "Validation Curves for n_estimators"
validation_curves(model_lgb, title, X, y, cv=5, 
                  param_name = param_name, param_range = param_range)

## Hyper Parameter Tuning using Optuna (for LightGMB)

In [None]:
X = train[features]
y = train['target']

In [None]:
def objective_lgb(trial,data=X,target=y):
    
    train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.15,random_state=42)
    param = {
        'tree_method':'gpu_hist',  # this parameter means using the GPU when training our model to speedup the training process
        'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-3, 1),
        'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-3, 1),
        'feature_framcion': trial.suggest_categorical('feature_framcion', [0.1, 0.2, 0.3,0.5,0.7,0.9]),
        'bagging_fraction': trial.suggest_categorical('bagging_framcion', [0.1, 0.2,0.3,0.4,0.5,0.8,1.0]),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.0008, 0.01, 0.015, 0.02,0.03, 0.05,0.08,0.1]),
        'n_estimators': 4000,
        'num_leaves': trial.suggest_categorical('num_leaves', [31,50,150,200,250,300,350]),
        'max_depth': trial.suggest_categorical('max_depth', [5,7,9,11,13,15,17,20,23,25]),
        'min_data_in_leaf': trial.suggest_categorical('min_data_in_leaf', [10,20,30]),
        'min_child_weight': trial.suggest_categorical('min_child_weight', [0.001,0.005, 0.01, 0.05, 0.1,0.5]),
        'random_state': 48
    }
    model = lgb.LGBMRegressor(**param)  
    
    model.fit(train_x,train_y,eval_set=[(test_x,test_y)],early_stopping_rounds=100,verbose=False)
    
    preds = model.predict(test_x)
    
    rmse = mean_squared_error(test_y, preds,squared=False)
    
    return rmse

In [None]:
study = optuna.create_study(direction='minimize')
study.optimize(objective_lgb, n_trials=50)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

In [None]:
study.best_trial.params

In [None]:
Best_params_lgb = {'lambda_l2': 0.013616569506899653,
 'lambda_l1': 0.006495842188985166,
 'feature_framcion': 0.3,
 'bagging_framcion': 0.3,
 'learning_rate': 0.015,
 'num_leaves': 200,
 'max_depth': 25,
 'min_data_in_leaf': 30,
 'min_child_weight': 0.001,
 'n_estimators': 3000,
 'random_state': 48,
 'tree_method':'gpu_hist'}

In [None]:
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.15,random_state=42)
model_lgb = lgb.LGBMRegressor(**Best_params_lgb)
model_lgb.fit(train_x,train_y,eval_set=[(test_x,test_y)],
              early_stopping_rounds=100,verbose=False)

feature importance

In [None]:
importances = pd.Series(model_lgb.feature_importances_, index = features)
importances = importances.sort_values()
importances.plot(kind = "barh")
plt.title("imporance in the lightGBM Model", fontsize=18)
plt.show()

In [None]:
preds = model_lgb.predict(test_x)
rmse = mean_squared_error(test_y, preds,squared=False)
rmse

In [None]:
test_X = test[features]
preds = model_lgb.predict(test_X)
sub['target']=preds
sub.to_csv('submission_lgb.csv', index=False)

## I hope this post has been helpful!
If you like it, please feel free to upvote!