In this notebook, I want to remark the usage of subsample hyperparameter.

Generally, we use subsample or bagging fraction on hyperparameter tuning step.

For XGBoost and Catboost, we can set subsample to a float number between 0 and 1 to use subsampling. It is enough.

But, for LightGBM, it is not enough. We have to use subsample_freq additionally.

Let's examine them.

In [None]:
import pandas  as pd
import numpy as np

from sklearn.model_selection import KFold, cross_val_score, train_test_split

bold = "\033[1m"

In [None]:
train = pd.read_csv("../input/tabular-playground-series-aug-2021/train.csv")
print(bold + "Training Set :\n")
display(train.head())
print(bold + str(train.shape))

print(bold + "\nTest Set :\n")
test = pd.read_csv("../input/tabular-playground-series-aug-2021/test.csv")
display(test.head())
print(bold + str(test.shape))

In [None]:
target = "loss"
predictors = [x for x in train.columns if x not in ["id", "loss"]]

In [None]:
X = train[predictors]
y = train[target]
test = test[predictors]

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

In [None]:
from sklearn.metrics import mean_squared_error, make_scorer

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared = False)

rmse_scorer = make_scorer(rmse)

In [None]:
kf = KFold(n_splits = 3, shuffle = True, random_state = 42)

def rmse_cv(model, X, y):    
    return cross_val_score(model, X, y, scoring = rmse_scorer, cv = kf).mean()

First one is default lightgbm regressor model.

I just set a random_state for reproducibility, 50 estimator, and high learning rate for faster calculations.

Second one is default lgbm model with adding subsample.

Let's look at difference

# LightGBM

In [None]:
import lightgbm as lgb

lgb_reg = lgb.LGBMRegressor(random_state = 42, n_jobs = -1, n_estimators = 50, learning_rate = 0.3)

lgb_w_subsample = lgb.LGBMRegressor(random_state = 42, n_jobs = -1, n_estimators = 50, learning_rate = 0.3, subsample = 0.5)

In [None]:
print(bold + "RMSE for default LightGBM model: \t\t\t" + str(rmse_cv(lgb_reg, X_train, y_train)))

print(bold + "RMSE for LightGBM model, with setting subsample = 0.5: \t" + str(rmse_cv(lgb_w_subsample, X_train, y_train)))

Is there a difference? Nope.

**We need to use subsample_freq to enable subsampling.**

In [None]:
lgb_w_subsample_freq = lgb.LGBMRegressor(random_state = 42, n_jobs = -1, n_estimators = 100, learning_rate = 0.3, subsample = 0.5, subsample_freq = 1)

rmse_cv(lgb_w_subsample_freq, X_train, y_train)

Nice, subsampling works.

**For XGBoost and Catboost setting subsample to a float number is enough for subsampling.**

# XGBoost

In [None]:
import xgboost as xgb

xgbreg = xgb.XGBRegressor(random_state = 42, n_jobs = -1, n_estimators = 50, learning_rate = 0.5)

rmse_cv(xgbreg, X_train, y_train)

In [None]:
xgbreg_w_subsample = xgb.XGBRegressor(random_state = 42, n_jobs = -1, n_estimators = 50, learning_rate = 0.5, subsample = 0.5)

rmse_cv(xgbreg_w_subsample, X_train, y_train)

We have 8.02 rmse with default xgboost model. If we set subsample to 0.5, this score will change, we get 8.18 rmse.

# Catboost

In [None]:
from catboost import CatBoostRegressor

cbr = CatBoostRegressor(random_state = 42, thread_count = 4, verbose = 0, iterations = 50, learning_rate = 0.5)

rmse_cv(cbr, X_train, y_train)

In [None]:
cbr_w_subsample = CatBoostRegressor(random_state = 42, thread_count = 4, verbose = 0,  iterations = 50, learning_rate = 0.5, subsample = 0.5)

rmse_cv(cbr_w_subsample, X_train, y_train)

# **Conclusion**

**For using subsampling;**

**LightGBM**: set subsample to a float number and set subsample_freq to a positive integer.

**XGBoost**: set subsample to a float number.

**Catboost**: set subsample to a float number.