# Boosting (XGBoost, LightGBM, CatBoost)

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [2]:
data = pd.read_csv('./data/titanic_train.csv')
data.drop(columns=['Name', 'Fare', 'PassengerId', 'Cabin', 'Ticket', 'Embarked'], axis = 1, inplace=True)
data.dropna(inplace=True)
categorical_features = ['Sex',]
numerical_features = ['Age', 'SibSp', 'Parch']
ct = make_column_transformer(
        (OneHotEncoder(), categorical_features),
        (StandardScaler(), numerical_features),
        remainder='drop', verbose_feature_names_out=True)
data_transformed = ct.fit_transform(data.iloc[:, 1:], y=data.Survived)
X = pd.DataFrame(data_transformed, columns=ct.get_feature_names_out())
y = data.Survived.reset_index(drop=True)
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

In [3]:
test = pd.read_csv('./data/titanic_test.csv')
age_avg = test['Age'].mean()
test[['Age']] = test[['Age']].fillna(age_avg)
PassengerId = test['PassengerId']
test_transformed = ct.fit_transform(test)
test_transformed = pd.DataFrame(test_transformed, columns=ct.get_feature_names_out())

# Xgboost



### Boosting settings

**Explanation after lightgbm theory**<br>
`tree_method` – 'exact' (if you have time, you can try), 'approx', 'hist'(the best choise usually) <br>
`grow_policy` – 'depthwise', 'lossguide' (usually better)<br>
`objective` – default is quite good (sometimes "count:poisson")


### XGBoost parameters tuning

Usually we start tuning parameters with these first: <br>
`n_estimators` - number of boosting rounds, better to use `early_stopping_rounds` <br>
`eta` – learning rate, increasing lr reduces convergence time. (usually default value works good)

### Control Overfitting
When you observe high training accuracy, but low tests accuracy, it is likely that you encounter overfitting problem.

There are in general two ways that you can control overfitting:

* The first way is to directly control model complexity <p>
`max_depth` - maximum depth of a tree, increase of this value will make the model more complex; <br>
`gamma` - minimum loss reduction required to make a further partition on a leaf node of the tree.<br>
`min_child_weight` – minimum sum of instance weight (hessian) needed in a child.


* The second way is to add randomness to make training robust to noise <p>
`subsample` - subsample ratio of the training instance, <br>
`colsample_bytree` - subsample ratio of columns when constructing each tree. <br>


### Handle Imbalanced Dataset
There are two ways to improve it:

* If you care only about the ranking order (AUC) of your prediction
Balance the positive and negative weights, via `scale_pos_weight`
Use AUC for evaluation
* If you care about predicting the right probability
In such a case, you cannot re-balance the dataset. Set parameter `max_delta_step` to a finite number (say 1) will help to converge <br>

More about xgboost parameters: https://xgboost.readthedocs.io/en/latest/parameter.html

Always use `early_stopping_round` and tune `n_estimators` on validation.

In [61]:
import xgboost as xgb


parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 4,
    "random_seed": 1,
    "eval_metric": "auc"
}


xgb_train = xgb.DMatrix(X, y, feature_names=X.columns)
xgb_test = xgb.DMatrix(test_transformed)

In [62]:
results = xgb.cv(parameters, xgb_train, num_boost_round=100,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.83747+0.00341	test-auc:0.81209+0.03049
[10]	train-auc:0.87947+0.00108	test-auc:0.82349+0.02807
[20]	train-auc:0.89554+0.00542	test-auc:0.82703+0.02785
[30]	train-auc:0.90408+0.00586	test-auc:0.83114+0.02533
[40]	train-auc:0.90965+0.00617	test-auc:0.83307+0.02287
[50]	train-auc:0.91442+0.00748	test-auc:0.83482+0.02247
[60]	train-auc:0.91800+0.00737	test-auc:0.83434+0.02324
[70]	train-auc:0.92048+0.00676	test-auc:0.83398+0.02407
[80]	train-auc:0.92254+0.00654	test-auc:0.83351+0.02472
[90]	train-auc:0.92439+0.00674	test-auc:0.83293+0.02597
[99]	train-auc:0.92577+0.00665	test-auc:0.83235+0.02598


In [63]:
# added early stopping
num_rounds = 10000
results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.83747+0.00341	test-auc:0.81209+0.03049
[10]	train-auc:0.87947+0.00108	test-auc:0.82349+0.02807
[20]	train-auc:0.89554+0.00542	test-auc:0.82703+0.02785
[30]	train-auc:0.90408+0.00586	test-auc:0.83114+0.02533
[40]	train-auc:0.90965+0.00617	test-auc:0.83307+0.02287
[50]	train-auc:0.91442+0.00748	test-auc:0.83482+0.02247
[60]	train-auc:0.91800+0.00737	test-auc:0.83434+0.02324
[63]	train-auc:0.91859+0.00692	test-auc:0.83376+0.02321


In [64]:
%%time
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.5,
    "colsample_bynode":0.7
}

results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.80273+0.01105	test-auc:0.77108+0.03454
[10]	train-auc:0.85923+0.00565	test-auc:0.81371+0.02680
[20]	train-auc:0.85975+0.01087	test-auc:0.82289+0.02678
[30]	train-auc:0.86721+0.01116	test-auc:0.82567+0.02369
[40]	train-auc:0.87458+0.00665	test-auc:0.82568+0.02860
[50]	train-auc:0.87856+0.00715	test-auc:0.82680+0.02914
[60]	train-auc:0.88216+0.00623	test-auc:0.82613+0.02822
[64]	train-auc:0.88487+0.00774	test-auc:0.82668+0.02615
CPU times: user 1.75 s, sys: 305 ms, total: 2.05 s
Wall time: 1.44 s


In [65]:
%%time
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.5,
    "colsample_bynode":0.7,
    
    #lightgbm approach
    "tree_method": "hist"
}

results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.80859+0.01149	test-auc:0.78135+0.03760
[10]	train-auc:0.84836+0.00424	test-auc:0.80553+0.02592
[20]	train-auc:0.85575+0.00719	test-auc:0.81900+0.02462
[30]	train-auc:0.87002+0.01302	test-auc:0.82383+0.02960
[40]	train-auc:0.87423+0.00821	test-auc:0.82558+0.02785
[50]	train-auc:0.87750+0.00932	test-auc:0.82755+0.02811
[60]	train-auc:0.88761+0.00800	test-auc:0.83122+0.02966
[70]	train-auc:0.89078+0.00746	test-auc:0.83139+0.02633
[75]	train-auc:0.89160+0.00790	test-auc:0.83235+0.02657
CPU times: user 2.05 s, sys: 424 ms, total: 2.47 s
Wall time: 1.63 s


In [66]:
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.5,
    "colsample_bynode":0.7,
    
    #lightgbm approach
    "tree_method": "hist",
    "grow_policy": "lossguide"
}

results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.80859+0.01149	test-auc:0.78135+0.03760
[10]	train-auc:0.84836+0.00424	test-auc:0.80553+0.02592
[20]	train-auc:0.85575+0.00719	test-auc:0.81900+0.02462
[30]	train-auc:0.87002+0.01302	test-auc:0.82383+0.02960
[40]	train-auc:0.87423+0.00821	test-auc:0.82558+0.02785
[50]	train-auc:0.87750+0.00932	test-auc:0.82755+0.02811
[60]	train-auc:0.88761+0.00800	test-auc:0.83122+0.02966
[70]	train-auc:0.89078+0.00746	test-auc:0.83139+0.02633
[75]	train-auc:0.89160+0.00790	test-auc:0.83235+0.02657


In [135]:
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.05,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    
    # regularization parameters
    "max_leaves": 3,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    
    #lightgbm approach
    "tree_method": "hist",
    "grow_policy": "lossguide"
}

results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=50,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.78623+0.01418	test-auc:0.78143+0.02289
[10]	train-auc:0.79643+0.01385	test-auc:0.79081+0.02651
[20]	train-auc:0.80309+0.01147	test-auc:0.79234+0.02714
[30]	train-auc:0.81222+0.00631	test-auc:0.79027+0.02598
[40]	train-auc:0.81983+0.00850	test-auc:0.78657+0.02396
[50]	train-auc:0.82861+0.00783	test-auc:0.79679+0.02993
[60]	train-auc:0.83822+0.00757	test-auc:0.80820+0.02521
[70]	train-auc:0.84139+0.00509	test-auc:0.81295+0.02795
[80]	train-auc:0.84550+0.00651	test-auc:0.81364+0.02546
[90]	train-auc:0.84701+0.00844	test-auc:0.81625+0.02308
[100]	train-auc:0.84804+0.00638	test-auc:0.81940+0.02393
[110]	train-auc:0.85132+0.00868	test-auc:0.82219+0.02144
[120]	train-auc:0.85315+0.00969	test-auc:0.82252+0.02054
[130]	train-auc:0.85560+0.00900	test-auc:0.82257+0.02115
[140]	train-auc:0.85812+0.00897	test-auc:0.82520+0.02235
[150]	train-auc:0.85857+0.00921	test-auc:0.82616+0.02222
[160]	train-auc:0.85980+0.00948	test-auc:0.82741+0.02317
[170]	train-auc:0.86069+0.00909	test-auc:0

In [136]:
epochs = 15
clf = xgb.XGBClassifier(
    objective="binary:logistic",
    eta=0.05,
    verbosity=0,
    nthread=10,
    random_seed=1,
    eval_metric="auc",
    # regularization parameters
    max_leaves=3,
    subsample=0.7,
    colsample_bytree=0.7,
    #lightgbm approach
    tree_method="hist",
    grow_policy="lossguide")
bst = clf.fit(X, y)



In [137]:
y_pred = bst.predict(test_transformed)
submission = pd.DataFrame({
        "PassengerId": PassengerId,
        "Survived": y_pred
    })
submission.to_csv('submission_xgboost.csv', index=False)

### Some useful notes

1. Custom objective and evaluation functions – https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_objective.py
2. Never use timeseries validation with **xgb.cv**, it is broken :( 
3. Investigate your task and metrics. There are many objective functions that are worth to try.


# LightGBM

### Lightgbm Parameters Tunning


**Core parameters**

Begin with `learning_rate` parameter setting with default value <br>
`n_estimators`(number of boosting iterations) set 10k (some big number).<br>
`num_leaves` (number of leaves in one tree) set 155 as a starting point.<br>
`objective` –  the same as for xgboost `objective` (adding "regression_l1")

**Control Overfitting and Accuracy**

`max_depth` – max tree depth (`default=-1`). Start with tunning `num_leaves`, in the end try to change `max_depth`; <br>
`min_data_in_leaf` – very important parameter that helps to control overfitting (`default=20`);<br>
`colsample_bytree` – select part of features on each iteration (`default=1.0`). Always have to be tunned! <br>
`subsample` – select part of data without resampling (`default=1.0`). To enable it, set `subsample_freq` = 1 (other values always work worse)
`early_stopping_round` – as in xgboost



In [5]:
import lightgbm as lgb
from lightgbm.callback import log_evaluation, early_stopping
callbacks = [log_evaluation(10), early_stopping(10)]
parameters = {
    "objective": "binary",
    "learning_rate": 0.1,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42
}
n_rounds = 1000

lgb_train = lgb.Dataset(X, label=y, free_raw_data=False)

In [115]:
result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)

[LightGBM] [Info] Number of positive: 194, number of negative: 282
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 84
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] Number of positive: 193, number of negative: 283
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 84
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] Number of positive: 193, number of negative: 283
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 84
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.407563 -> initscore=-0.374049
[LightGBM] [Info] Start training from score -0.374049
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.405462 -> initscore=-0.382757
[LightGBM] [Info] Start training from score -

[40]	cv_agg's train auc: 0.89565 + 0.00876619	cv_agg's valid auc: 0.826206 + 0.0225788
[50]	cv_agg's train auc: 0.902719 + 0.0086773	cv_agg's valid auc: 0.82889 + 0.022946
[60]	cv_agg's train auc: 0.908019 + 0.00770689	cv_agg's valid auc: 0.829791 + 0.0222139


[70]	cv_agg's train auc: 0.912649 + 0.00802577	cv_agg's valid auc: 0.828828 + 0.0231523
Early stopping, best iteration is:
[62]	cv_agg's train auc: 0.909258 + 0.0075869	cv_agg's valid auc: 0.831098 + 0.0232589


In [6]:
parameters = {
    #default
    "objective": "binary",
    "learning_rate": 0.1,
    "num_threads": 10,
    "metric": 'auc',
    "seed": 42,
    
    #regularization
    "colsample_bytree": 0.8,
    "subsample": 0.9, 
    "subsample_freq": 1,
    "colsample_bytree": 0.8, 
}

result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)

[LightGBM] [Info] Number of positive: 194, number of negative: 282
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 84
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] Number of positive: 193, number of negative: 283
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 84
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] Number of positive: 193, number of negative: 283
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 84
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.407563 -> initscore=-0.374049
[LightGBM] [Info] Start training from score -0.374049
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.405462 -> initscore=-0.382757
[LightGBM] [Info] Start training from score -

In [9]:
parameters = {
    #default
    "objective": "binary",
    "learning_rate": 0.1,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42,
    
    #regularization
    "colsample_bytree": 0.8,
    "subsample": 0.8,
    "subsample_freq": 1,
    "min_data_in_leaf": 10,
    "feature_pre_filter": False
}

result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)

[LightGBM] [Info] Number of positive: 194, number of negative: 282
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 84
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] Number of positive: 193, number of negative: 283
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 84
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] Number of positive: 193, number of negative: 283
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 84
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.407563 -> initscore=-0.374049
[LightGBM] [Info] Start training from score -0.374049
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.405462 -> initscore=-0.382757
[LightGBM] [Info] Start training from score -

In [64]:
callbacks = [log_evaluation(50), early_stopping(100)]
parameters = {
    #default
    "objective": "binary",
    "learning_rate": 0.01,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42,
    
    #regularization
    "colsample_bytree": 0.8,
    "subsample": 0.8,
    "subsample_freq": 1,
    "min_data_in_leaf": 60,
    "num_leaves": 3,
    "max_depth": 3,
    'reg_alpha': 3,
    'reg_lambda': 10
}


result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)

[LightGBM] [Info] Number of positive: 194, number of negative: 282
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 57
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] Number of positive: 193, number of negative: 283
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 57
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] Number of positive: 193, number of negative: 283
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 57
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.407563 -> initscore=-0.374049
[LightGBM] [Info] Start training from score -0.374049
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.405462 -> initscore=-0.382757
[LightGBM] [Info] Start training from score -



[50]	cv_agg's train auc: 0.778511 + 0.0127771	cv_agg's valid auc: 0.766356 + 0.0230026
[100]	cv_agg's train auc: 0.779329 + 0.0132641	cv_agg's valid auc: 0.765783 + 0.0216133


[150]	cv_agg's train auc: 0.786432 + 0.0132562	cv_agg's valid auc: 0.780931 + 0.0287779
[200]	cv_agg's train auc: 0.795314 + 0.0113047	cv_agg's valid auc: 0.780142 + 0.027068


Early stopping, best iteration is:
[149]	cv_agg's train auc: 0.786432 + 0.0132562	cv_agg's valid auc: 0.780931 + 0.0287779


In [11]:
print(f"Train auc:      {result['train auc-mean'][-1]:.4f}, std: {result['train auc-stdv'][-1]:.4f}")
print(f"Validation auc: {result['valid auc-mean'][-1]:.4f}, std: {result['valid auc-stdv'][-1]:.4f}")



Train auc:      0.8346, std: 0.0097
Validation auc: 0.7996, std: 0.0241


In [54]:
parameters = {
    #default
    "objective": "binary",
    "learning_rate": 0.01,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42,
    "max_depth":3,
    
    #regularization
    "colsample_bytree": 0.9,
    "subsample": 0.9,
    "subsample_freq": 1,
    "min_data_in_leaf": 70
}


result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)



[LightGBM] [Info] Number of positive: 194, number of negative: 282
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] Number of positive: 193, number of negative: 283
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] Number of positive: 193, number of negative: 283
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.407563 -> initscore=-0.374049
[LightGBM] [Info] Start training from score -0.374049
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.405462 -> initscore=-0.382757
[LightGBM] [Info] Start training from score -

[100]	cv_agg's train auc: 0.806889 + 0.0112411	cv_agg's valid auc: 0.77674 + 0.0285731


[150]	cv_agg's train auc: 0.809363 + 0.0098242	cv_agg's valid auc: 0.779613 + 0.0315766


[200]	cv_agg's train auc: 0.813967 + 0.0121214	cv_agg's valid auc: 0.781955 + 0.0291303
[250]	cv_agg's train auc: 0.820638 + 0.0116415	cv_agg's valid auc: 0.788623 + 0.0233111




[300]	cv_agg's train auc: 0.824214 + 0.0116753	cv_agg's valid auc: 0.796812 + 0.0281939
[350]	cv_agg's train auc: 0.828024 + 0.00950984	cv_agg's valid auc: 0.802315 + 0.0341832


[400]	cv_agg's train auc: 0.829296 + 0.0105878	cv_agg's valid auc: 0.803218 + 0.0335402


[450]	cv_agg's train auc: 0.83115 + 0.0111724	cv_agg's valid auc: 0.805122 + 0.03498


[500]	cv_agg's train auc: 0.833032 + 0.0113114	cv_agg's valid auc: 0.804853 + 0.0346371


[550]	cv_agg's train auc: 0.833852 + 0.01132	cv_agg's valid auc: 0.805367 + 0.0353991


Early stopping, best iteration is:
[485]	cv_agg's train auc: 0.832895 + 0.0115397	cv_agg's valid auc: 0.805756 + 0.0348414


In [15]:
print(f"Train auc:      {result['train auc-mean'][-1]:.4f}, std: {result['train auc-stdv'][-1]:.4f}")
print(f"Validation auc: {result['valid auc-mean'][-1]:.4f}, std: {result['valid auc-stdv'][-1]:.4f}")



Train auc:      0.8034, std: 0.0090
Validation auc: 0.7786, std: 0.0293


In [79]:
data.dropna(inplace=True)
numerical_features = ['Age', 'SibSp', 'Parch']
ct = make_column_transformer(
        (StandardScaler(), numerical_features),
        remainder='passthrough', verbose_feature_names_out=True)
data_transformed = ct.fit_transform(data.iloc[:, 1:], y=data.Survived)
X = pd.DataFrame(data_transformed, columns=ct.get_feature_names_out())
X['remainder__Sex'].replace(['male','female'], [0, 1], inplace=True)
y = data.Survived.reset_index(drop=True)

In [89]:
X = X.astype(np.float16)
X.dtypes

standardscaler__Age      float16
standardscaler__SibSp    float16
standardscaler__Parch    float16
remainder__Pclass        float16
remainder__Sex           float16
dtype: object

In [90]:
parameters = {
    #default
    "objective": "binary",
    "learning_rate": 0.01,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42,
    
    #regularization
    "colsample_bytree": 0.9,
    "subsample": 0.9,
    "subsample_freq": 1,
    "min_data_in_leaf": 50,
    
    #categorical features
    'cat_smooth': 1,
    'min_data_per_group': 50
}
lgb_train = lgb.Dataset(X, label=y, free_raw_data=False, categorical_feature=['remainder__Sex', 'remainder__Pclass'])
result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)



[LightGBM] [Info] Number of positive: 194, number of negative: 282
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 87
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] Number of positive: 193, number of negative: 283
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 87
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] Number of positive: 193, number of negative: 283
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 87
[LightGBM] [Info] Number of data points in the train set: 476, number of used features: 5
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.407563 -> initsco

[50]	cv_agg's train auc: 0.862145 + 0.00784573	cv_agg's valid auc: 0.854398 + 0.014535


[100]	cv_agg's train auc: 0.861882 + 0.00550614	cv_agg's valid auc: 0.851588 + 0.0122786


[150]	cv_agg's train auc: 0.874277 + 0.00984866	cv_agg's valid auc: 0.855714 + 0.0132979


[200]	cv_agg's train auc: 0.879563 + 0.00769956	cv_agg's valid auc: 0.861564 + 0.0169997


[250]	cv_agg's train auc: 0.886686 + 0.00889589	cv_agg's valid auc: 0.863013 + 0.0148057


[300]	cv_agg's train auc: 0.891224 + 0.00885419	cv_agg's valid auc: 0.864789 + 0.0120226


[350]	cv_agg's train auc: 0.894981 + 0.00912954	cv_agg's valid auc: 0.866374 + 0.0115298


[400]	cv_agg's train auc: 0.89796 + 0.00766626	cv_agg's valid auc: 0.869585 + 0.0129346




[450]	cv_agg's train auc: 0.900409 + 0.00791849	cv_agg's valid auc: 0.870389 + 0.0121835


[500]	cv_agg's train auc: 0.90223 + 0.00774277	cv_agg's valid auc: 0.872011 + 0.0124178


[550]	cv_agg's train auc: 0.904169 + 0.00704391	cv_agg's valid auc: 0.873036 + 0.012385


[600]	cv_agg's train auc: 0.905807 + 0.006879	cv_agg's valid auc: 0.874463 + 0.0123387


[650]	cv_agg's train auc: 0.907331 + 0.00696224	cv_agg's valid auc: 0.875769 + 0.012476


[700]	cv_agg's train auc: 0.908667 + 0.00669876	cv_agg's valid auc: 0.876111 + 0.0127176


[750]	cv_agg's train auc: 0.910027 + 0.00644055	cv_agg's valid auc: 0.87616 + 0.0132726


[800]	cv_agg's train auc: 0.911186 + 0.00672115	cv_agg's valid auc: 0.876965 + 0.0131562


[850]	cv_agg's train auc: 0.912263 + 0.00666018	cv_agg's valid auc: 0.877209 + 0.0130129


[900]	cv_agg's train auc: 0.913037 + 0.00652035	cv_agg's valid auc: 0.877233 + 0.0126932


[950]	cv_agg's train auc: 0.914038 + 0.00684662	cv_agg's valid auc: 0.877111 + 0.0127844


Early stopping, best iteration is:
[876]	cv_agg's train auc: 0.912604 + 0.00663256	cv_agg's valid auc: 0.87755 + 0.0125134


In [91]:
X.columns

Index(['standardscaler__Age', 'standardscaler__SibSp', 'standardscaler__Parch',
       'remainder__Pclass', 'remainder__Sex'],
      dtype='object')

In [92]:
print(f"Train auc:      {result['train auc-mean'][-1]:.4f}, std: {result['train auc-stdv'][-1]:.4f}")
print(f"Validation auc: {result['valid auc-mean'][-1]:.4f}, std: {result['valid auc-stdv'][-1]:.4f}")



Train auc:      0.9126, std: 0.0066
Validation auc: 0.8775, std: 0.0125


In [93]:
lgbc = lgb.LGBMClassifier( max_depth = 3,
                           num_leaves = 3,
                           objective="binary",
                           learning_rate =  0.05, #0.01
                           num_iterations=4000,
                           n_estimators = 25,
                           colsample_bytree = 0.7, #0.8
                           subsample_freq=1,
                           subsample = 0.7, #0.8
                           min_data_in_leaf=60,
                           n_jobs = 5,
                           reg_alpha=3,
                           cat_smooth=1,
                           min_data_per_group=50)
lgbf = lgbc.fit(X,y)





In [94]:
y_pred = lgbf.predict(test_transformed)
submission = pd.DataFrame({
        "PassengerId": PassengerId,
        "Survived": y_pred
    })
submission.to_csv('submission_lightgbm.csv', index=False)

# CatBoost

### CatBoost parameters tunning

**`loss_function`** – The metric to use in training (the same as objective in xgb/lgb)<br>
**`eval_metric`** – The metric used for overfitting detection (if enabled) and the best model selection (if enabled).<br>
**`iterations`** – The maximum number of trees<br>
**`learning_rate`** – Learing rate :) (default 0.03)<br>
**`random_seed`** – is not set by default. always set to reproduce results.<br>
**`subsample`** – Sample rate for bagging (default 0.66)<br>
**`use_best_model`** – True if a validation set is input (the eval_set parameter is defined) and at least one of the values of objects in this set. False otherwise<br>
**`depth`** – the same as max depth earlier<br>
**`rsm`** – Random subspace method. The percentage of features that can be used at each split selection. (colsample_bylevel)<br>
**`class_weights`** – Classes weights<br>
**`od_type`** – The type of the overfitting detector to use. (better to use Iter)<br>
**`od_wait`** – The number of iterations to continue the training after the metric value.<br>






In [95]:
import catboost as ctb
parameters = {
    "loss_function": "Logloss",
    "eval_metric": "AUC",
    "iterations": 1000,
    "learning_rate": 0.03,
    "random_seed": 42,
    "od_wait": 30,
    "od_type": "Iter",
    "thread_count": 10
}

In [96]:
ctb_data = ctb.Pool(X,y)
result = ctb.cv(ctb_data, parameters, folds=skf, seed=42, verbose_eval=100, plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/3]
0:	test: 0.8063380	best: 0.8063380 (0)	total: 55.7ms	remaining: 55.6s
100:	test: 0.8992077	best: 0.9004548 (91)	total: 271ms	remaining: 2.42s

bestTest = 0.9004548122
bestIteration = 91

Training on fold [1/3]
0:	test: 0.7631425	best: 0.7631425 (0)	total: 1.29ms	remaining: 1.29s

bestTest = 0.8761424289
bestIteration = 3

Training on fold [2/3]
0:	test: 0.8638225	best: 0.8638225 (0)	total: 3.05ms	remaining: 3.05s

bestTest = 0.8718651751
bestIteration = 51



In [97]:
result.loc[result["test-AUC-mean"] == result["test-AUC-mean"].max()]

Unnamed: 0,iterations,test-AUC-mean,test-AUC-std,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std
1,1,0.875277,0.011678,0.656741,0.004899,0.654857,0.003038


In [106]:
len(X.columns)
X = X.astype({"remainder__Sex": int, "remainder__Pclass": int})
categorical_features_indices = [X.columns.get_loc('remainder__Sex'), X.columns.get_loc('remainder__Pclass')]

In [107]:
X.dtypes

standardscaler__Age      float16
standardscaler__SibSp    float16
standardscaler__Parch    float16
remainder__Pclass          int64
remainder__Sex             int64
dtype: object

In [108]:
categorical_features_indices

[4, 3]

In [109]:
parameters = {
    #default
    "loss_function": "Logloss",
    "eval_metric": "AUC",
    "iterations": 1000,
    "learning_rate": 0.03,
    "random_seed": 42,
    "od_wait": 50,
    "od_type": "Iter",
    "thread_count": 10
}

ctb_data = ctb.Pool(X,y,cat_features=categorical_features_indices)
result = ctb.cv(ctb_data, parameters, folds=skf, seed=42, verbose_eval=100, plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/3]
0:	test: 0.8163879	best: 0.8163879 (0)	total: 6.42ms	remaining: 6.41s
100:	test: 0.8948063	best: 0.8977773 (68)	total: 366ms	remaining: 3.26s

bestTest = 0.907790493
bestIteration = 148

Training on fold [1/3]
0:	test: 0.7640930	best: 0.7640930 (0)	total: 9.16ms	remaining: 9.15s
100:	test: 0.8571324	best: 0.8584119 (84)	total: 392ms	remaining: 3.49s
200:	test: 0.8722308	best: 0.8722308 (199)	total: 802ms	remaining: 3.19s

bestTest = 0.8736930613
bestIteration = 218

Training on fold [2/3]
0:	test: 0.8133363	best: 0.8133363 (0)	total: 4.04ms	remaining: 4.04s
100:	test: 0.8669299	best: 0.8676610 (93)	total: 426ms	remaining: 3.79s

bestTest = 0.8682459604
bestIteration = 113



In [110]:
result.loc[result["test-AUC-mean"] == result["test-AUC-mean"].max()]

Unnamed: 0,iterations,test-AUC-mean,test-AUC-std,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std
218,218,0.881325,0.019477,0.422106,0.027575,0.33924,0.017757


# Summary


- Boosting algorithms are the best for heterogeneous data.
- Lightgbm is the fastest and usually the most accurate;
- CatBoost doesn't need a lot of tuning;
- CatBoost shows the best result on data with many categorical variables;
- Always try three methods and ensemble them;
- Remember about regularization and comparing your training score with validation;
- Use LightGBM for experiments and in the end execute other algorithms;
- Firstly, set default parameters, do feature engineering and then come back to parameters tuning;
- More practice will give you more understanding and intuition;
- Use magic of boosting in real life :)



# Sources:

- Открытый курс машинного обучения. Тема 10. Градиентный бустинг. Часть 1 : https://habrahabr.ru/company/ods/blog/327250/
- Introduction to Boosted Trees: http://xgboost.readthedocs.io/en/latest/model.html
- XGBoost, a Top Machine Learning Method on Kaggle, Explained : https://www.kdnuggets.com/2017/10/xgboost-top-machine-learning-method-kaggle-explained.html
- LightGBM: A Highly Efficient Gradient Boosting Decision Tree: https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
- CatBoost: A machine learning library to handle categorical (CAT) data automatically: https://www.analyticsvidhya.com/blog/2017/08/catboost-automated-categorical-data/
- CatBoost: https://catboost.ai/docs/
- CatBoost tutorials: https://github.com/catboost/tutorials