# Boosting (XGBoost, LightGBM, CatBoost)

In [210]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score

In [211]:
data = pd.read_csv('./data/titanic_train.csv')
data.drop(columns=['Name', 'Fare', 'PassengerId', 'Cabin', 'Ticket'], axis = 1, inplace=True)
age_avg = data['Age'].mean()
data['Age'] = data['Age'].fillna(age_avg)
data.dropna(inplace=True)

y_cat = y = data.Survived.reset_index(drop=True)

In [212]:
data.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked'], dtype='object')

In [213]:
data.dtypes

Survived      int64
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Embarked     object
dtype: object

In [214]:
features = data.columns[1:]
features

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked'], dtype='object')

In [333]:
weight_not_surv = counts[0]/(counts[0]+counts[1])
weight_surv = counts[1]/(counts[0]+counts[1])
class_weights = {0: weight_not_surv, 1 : weight_surv}
class_weights

{0: 0.6175478065241845, 1: 0.38245219347581555}

In [216]:
categorical_features = ['Sex', 'Pclass', 'Embarked']
ct = make_column_transformer(
        (OneHotEncoder(), categorical_features),
        remainder='passthrough', verbose_feature_names_out=True)
data_transformed = ct.fit_transform(data.iloc[:, 1:], y=data.Survived)
X = pd.DataFrame(data_transformed, columns=ct.get_feature_names_out())
X = X.astype({"onehotencoder__Sex_female": int,
              "onehotencoder__Sex_male": int,
             "onehotencoder__Pclass_1": int,
             "onehotencoder__Pclass_2": int,
             "onehotencoder__Pclass_3": int,
             "onehotencoder__Embarked_C": int,
             "onehotencoder__Embarked_Q": int,
             "onehotencoder__Embarked_S": int,
             "remainder__SibSp":int,
             "remainder__Parch": int})
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
X.dtypes

onehotencoder__Sex_female      int64
onehotencoder__Sex_male        int64
onehotencoder__Pclass_1        int64
onehotencoder__Pclass_2        int64
onehotencoder__Pclass_3        int64
onehotencoder__Embarked_C      int64
onehotencoder__Embarked_Q      int64
onehotencoder__Embarked_S      int64
remainder__Age               float64
remainder__SibSp               int64
remainder__Parch               int64
dtype: object

In [217]:
test = pd.read_csv('./data/titanic_test.csv')
age_avg = test['Age'].mean()
test[['Age']] = test[['Age']].fillna(age_avg)
PassengerId = test['PassengerId']
test_transformed = ct.fit_transform(test[features])
test_transformed = pd.DataFrame(test_transformed, columns=ct.get_feature_names_out()).astype(np.float64)
test_transformed = test_transformed.astype({"onehotencoder__Sex_female": int,
              "onehotencoder__Sex_male": int,
             "onehotencoder__Pclass_1": int,
             "onehotencoder__Pclass_2": int,
             "onehotencoder__Pclass_3": int,
             "onehotencoder__Embarked_C": int,
             "onehotencoder__Embarked_Q": int,
             "onehotencoder__Embarked_S": int,
             "remainder__SibSp":int,
             "remainder__Parch": int})
test_transformed.dtypes

onehotencoder__Sex_female      int64
onehotencoder__Sex_male        int64
onehotencoder__Pclass_1        int64
onehotencoder__Pclass_2        int64
onehotencoder__Pclass_3        int64
onehotencoder__Embarked_C      int64
onehotencoder__Embarked_Q      int64
onehotencoder__Embarked_S      int64
remainder__Age               float64
remainder__SibSp               int64
remainder__Parch               int64
dtype: object

In [218]:
X_cat = data
X_cat.drop('Survived', inplace=True, axis=1)
X_cat['Sex'].replace(['male','female'], [0, 1], inplace=True)
X_cat['Embarked'].replace(['C','Q', 'S'], [0, 1, 2], inplace=True)
X_cat = X_cat.astype(np.float64)
X_cat = X_cat.astype({"Sex": int, "Pclass": int, "SibSp": int, 'Parch': int, 'Embarked':int})
X_cat.dtypes

Pclass        int64
Sex           int64
Age         float64
SibSp         int64
Parch         int64
Embarked      int64
dtype: object

In [219]:
test_transformed_categorical = test
test_transformed_categorical['Sex'].replace(['male','female'], [0, 1], inplace=True)
test_transformed_categorical['Embarked'].replace(['C','Q', 'S'], [0, 1, 2], inplace=True)
test_transformed_categorical = test_transformed_categorical[features]
test_transformed_categorical.dtypes

Pclass        int64
Sex           int64
Age         float64
SibSp         int64
Parch         int64
Embarked      int64
dtype: object

In [220]:
X_cat.dtypes

Pclass        int64
Sex           int64
Age         float64
SibSp         int64
Parch         int64
Embarked      int64
dtype: object

In [341]:
from sklearn.utils.class_weight import compute_class_weight
 
classes = np.unique(y)
weights = compute_class_weight(class_weight='balanced', classes=classes, y=y_train)
class_weights = dict(zip(classes, weights))
class_weights

{0: 0.8161764705882353, 1: 1.2906976744186047}

# Xgboost



### Boosting settings

**Explanation after lightgbm theory**<br>
`tree_method` – 'exact' (if you have time, you can try), 'approx', 'hist'(the best choise usually) <br>
`grow_policy` – 'depthwise', 'lossguide' (usually better)<br>
`objective` – default is quite good (sometimes "count:poisson")


### XGBoost parameters tuning

Usually we start tuning parameters with these first: <br>
`n_estimators` - number of boosting rounds, better to use `early_stopping_rounds` <br>
`eta` – learning rate, increasing lr reduces convergence time. (usually default value works good)

### Control Overfitting
When you observe high training accuracy, but low tests accuracy, it is likely that you encounter overfitting problem.

There are in general two ways that you can control overfitting:

* The first way is to directly control model complexity <p>
`max_depth` - maximum depth of a tree, increase of this value will make the model more complex; <br>
`gamma` - minimum loss reduction required to make a further partition on a leaf node of the tree.<br>
`min_child_weight` – minimum sum of instance weight (hessian) needed in a child.


* The second way is to add randomness to make training robust to noise <p>
`subsample` - subsample ratio of the training instance, <br>
`colsample_bytree` - subsample ratio of columns when constructing each tree. <br>


### Handle Imbalanced Dataset
There are two ways to improve it:

* If you care only about the ranking order (AUC) of your prediction
Balance the positive and negative weights, via `scale_pos_weight`
Use AUC for evaluation
* If you care about predicting the right probability
In such a case, you cannot re-balance the dataset. Set parameter `max_delta_step` to a finite number (say 1) will help to converge <br>

More about xgboost parameters: https://xgboost.readthedocs.io/en/latest/parameter.html

Always use `early_stopping_round` and tune `n_estimators` on validation.

In [158]:
import xgboost as xgb


parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 4,
    "random_seed": 1,
    "eval_metric": "auc"
}


xgb_train = xgb.DMatrix(X, y, feature_names=X.columns)
xgb_test = xgb.DMatrix(test_transformed)

In [162]:
def get_kfold_accuracy(X, y, parameters, n_splits=5):
    kfold = StratifiedKFold(shuffle=True, random_state=42, n_splits=n_splits)
    scores = []
    for train_ix, test_ix in kfold.split(X, y):
        train_X, test_X = X.iloc[train_ix], X.iloc[test_ix]
        train_y, test_y = y[train_ix], y[test_ix]
        xgb_clf = xgb.XGBClassifier(**parameters, use_label_encoder=False)
        xgb_clf.fit(X, y)
        yhat =  xgb_clf.predict(test_X, iteration_range = (0, xgb_clf.best_iteration + 1))
        acc = accuracy_score(test_y, yhat)
        # store score
        scores.append(acc)
    return np.mean(scores)

In [163]:
results = xgb.cv(parameters, xgb_train, num_boost_round=100,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.88124+0.00830	test-auc:0.85481+0.01655
[10]	train-auc:0.91025+0.00422	test-auc:0.86101+0.01081
[20]	train-auc:0.92289+0.00166	test-auc:0.86111+0.00592
[30]	train-auc:0.93096+0.00095	test-auc:0.85649+0.00806
[40]	train-auc:0.93715+0.00171	test-auc:0.85512+0.00745
[50]	train-auc:0.94187+0.00215	test-auc:0.85620+0.00689
[60]	train-auc:0.94590+0.00190	test-auc:0.85579+0.00657
[70]	train-auc:0.94870+0.00122	test-auc:0.85413+0.00662
[80]	train-auc:0.95157+0.00136	test-auc:0.85411+0.00591
[90]	train-auc:0.95398+0.00096	test-auc:0.85414+0.00537
[99]	train-auc:0.95504+0.00096	test-auc:0.85310+0.00460


In [164]:
get_kfold_accuracy(X, y, parameters)

0.8852853424744493

In [165]:
# added early stopping
num_rounds = 10000
results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.88124+0.00830	test-auc:0.85481+0.01655
[10]	train-auc:0.91025+0.00422	test-auc:0.86101+0.01081
[20]	train-auc:0.92289+0.00166	test-auc:0.86111+0.00592
[23]	train-auc:0.92587+0.00154	test-auc:0.86128+0.00528


In [166]:
%%time
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.5,
    "colsample_bynode":0.7
}

results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.67484+0.02780	test-auc:0.66735+0.04739
[10]	train-auc:0.87624+0.01337	test-auc:0.86133+0.00938
[20]	train-auc:0.89226+0.01028	test-auc:0.86609+0.01438
[30]	train-auc:0.90063+0.00641	test-auc:0.86843+0.01397
[37]	train-auc:0.90230+0.00609	test-auc:0.86676+0.01569
CPU times: user 1.27 s, sys: 0 ns, total: 1.27 s
Wall time: 813 ms


In [167]:
get_kfold_accuracy(X, y, parameters)

0.8616644448676443

In [168]:
%%time
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.5,
    "colsample_bynode":0.7,
    
    #lightgbm approach
    "tree_method": "hist"
}

results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.75046+0.09516	test-auc:0.72975+0.08344
[10]	train-auc:0.87708+0.01037	test-auc:0.85744+0.01414
[20]	train-auc:0.87960+0.01072	test-auc:0.86087+0.01492
[30]	train-auc:0.89193+0.01425	test-auc:0.86249+0.01411
[39]	train-auc:0.89870+0.00982	test-auc:0.86215+0.01375
CPU times: user 1.2 s, sys: 174 ms, total: 1.38 s
Wall time: 820 ms


In [169]:
get_kfold_accuracy(X, y, parameters)

0.8661524788929092

In [170]:
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.6,
    "colsample_bynode":0.7,
    
    #lightgbm approach
    "tree_method": "hist",
    "grow_policy": "lossguide"
}

results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.70762+0.04155	test-auc:0.67281+0.04116
[10]	train-auc:0.87986+0.01141	test-auc:0.85294+0.01983
[20]	train-auc:0.88701+0.00975	test-auc:0.85817+0.01344
[30]	train-auc:0.89653+0.00577	test-auc:0.86200+0.01501
[40]	train-auc:0.90243+0.00473	test-auc:0.86363+0.01459
[46]	train-auc:0.90715+0.00369	test-auc:0.86278+0.01447


In [171]:
get_kfold_accuracy(X, y, parameters)

0.8526566368310797

0.77990
parameters = parameters = {
    #default
    "objective": "binary:logistic",
    "n_estimators": 400,
    "eta": 0.03,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    # regularization parameters
    "max_depth": 3,
    "subsample": 0.9,
    "colsample_bytree": 0.8,
    "gamma": 2,
    "reg_lambda": 4,
    "scale_pos_weight": 0.6,
    
    #lightgbm approach
    "tree_method": "hist",
    "grow_policy": "lossguide",
}
[0]	train-auc:0.79896+0.01104	test-auc:0.79356+0.01737
[63]	train-auc:0.81029+0.01105	test-auc:0.79764+0.02647

0.8215429037725188

In [172]:
# 0.78229
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.6,
    "colsample_bynode":0.7,
    
    #lightgbm approach
    "tree_method": "hist",
    "grow_policy": "lossguide"
}
num_rounds = 500
results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=50,
                 folds=skf, verbose_eval=100)

[0]	train-auc:0.70762+0.04155	test-auc:0.67281+0.04116
[86]	train-auc:0.92203+0.00190	test-auc:0.85901+0.01380


In [173]:
get_kfold_accuracy(X, y, parameters)

0.8526566368310797

In [34]:
clf = xgb.XGBClassifier(**parameters, use_label_encoder=False)
bst = clf.fit(X, y)

In [35]:
y_077990 = pd.read_csv('submission_xgboost_077990.csv')

In [38]:
bst.best_iteration

99

In [39]:
y_077990

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [40]:
y_pred = bst.predict(test_transformed, iteration_range = (0, bst.best_iteration + 1))
submission = pd.DataFrame({
        "PassengerId": PassengerId,
        "Survived": y_pred
    })
submission.to_csv('submission_xgboost.csv', index=False)
(submission == y_077990).all()

PassengerId     True
Survived       False
dtype: bool

### Some useful notes

1. Custom objective and evaluation functions – https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_objective.py
2. Never use timeseries validation with **xgb.cv**, it is broken :( 
3. Investigate your task and metrics. There are many objective functions that are worth to try.


# LightGBM

### Lightgbm Parameters Tunning


**Core parameters**

Begin with `learning_rate` parameter setting with default value <br>
`n_estimators`(number of boosting iterations) set 10k (some big number).<br>
`num_leaves` (number of leaves in one tree) set 155 as a starting point.<br>
`objective` –  the same as for xgboost `objective` (adding "regression_l1")

**Control Overfitting and Accuracy**

`max_depth` – max tree depth (`default=-1`). Start with tunning `num_leaves`, in the end try to change `max_depth`; <br>
`min_data_in_leaf` – very important parameter that helps to control overfitting (`default=20`);<br>
`colsample_bytree` – select part of features on each iteration (`default=1.0`). Always have to be tunned! <br>
`subsample` – select part of data without resampling (`default=1.0`). To enable it, set `subsample_freq` = 1 (other values always work worse)
`early_stopping_round` – as in xgboost



In [42]:
import lightgbm as lgb
from lightgbm.callback import log_evaluation, early_stopping
callbacks = [log_evaluation(50), early_stopping(100)]
parameters = {
    "objective": "binary",
    "learning_rate": 0.1,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42
}
n_rounds = 1000

lgb_train = lgb.Dataset(X, label=y, free_raw_data=False)

In [192]:
def l_get_kfold_accuracy(X, y, parameters, n_splits=5, num_iter = 30):
    kfold = StratifiedKFold(shuffle=True, random_state=42, n_splits=n_splits)
    scores = []
    for train_ix, test_ix in kfold.split(X, y):
        train_X, test_X = X.iloc[train_ix], X.iloc[test_ix]
        train_y, test_y = y[train_ix], y[test_ix]
        lgbc = lgb.LGBMClassifier(**parameters)
        lgbc.fit(X, y)
        best = lgbc.best_iteration_ if lgbc.best_iteration_ else num_iter
        yhat =  lgbc.predict(test_X, num_iteration=best)
        acc = accuracy_score(test_y, yhat)
        # store score
        scores.append(acc)
    return np.mean(scores)

In [44]:
result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=False)

[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> init

[50]	cv_agg's auc: 0.859546 + 0.0114899


[100]	cv_agg's auc: 0.848048 + 0.0143326
Early stopping, best iteration is:
[28]	cv_agg's auc: 0.867194 + 0.0100888


In [45]:
parameters = {
    #default
    "objective": "binary",
    "learning_rate": 0.01,
    "num_threads": 10,
    "metric": 'auc',
    "seed": 42,
    
    #regularization
    "colsample_bytree": 0.8,
    "subsample": 0.9, 
    "subsample_freq": 1,
    "colsample_bytree": 0.8, 
}
result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=False)

[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> initscore=-0.482098
[LightGBM] [Info] Start training from score -0.4

[50]	cv_agg's auc: 0.857548 + 0.0138563


[100]	cv_agg's auc: 0.860579 + 0.0146317


Early stopping, best iteration is:
[28]	cv_agg's auc: 0.867194 + 0.0100888


In [46]:
l_get_kfold_accuracy(X, y, parameters)



0.8290166952326541

In [47]:
parameters = {
    #default
    "objective": "binary",
    "learning_rate": 0.1,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42,
    
    #regularization
    "colsample_bytree": 0.8,
    "subsample": 0.8,
    "subsample_freq": 1,
    "min_data_in_leaf": 10,
}

result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=False)

[LightGBM] [Fatal] Reducing `min_data_in_leaf` with `feature_pre_filter=true` may cause unexpected behaviour for features that were pre-filtered by the larger `min_data_in_leaf`.
You need to set `feature_pre_filter=false` to dynamically change the `min_data_in_leaf`.


[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> initscore=-0.482098
[LightGBM] [Info] Start training from score -0.4

In [48]:
l_get_kfold_accuracy(X, y, parameters)



0.916784104614994

In [49]:
callbacks = [log_evaluation(50), early_stopping(100)]
parameters = {
    #default
    "objective": "binary",
    "learning_rate": 0.05,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42,
    
    #regularization
    "colsample_bytree": 0.8,
    "subsample": 0.8,
    "subsample_freq": 1,
    "min_data_in_leaf": 10,
}


result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)

[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> init

In [50]:
l_get_kfold_accuracy(X, y, parameters)



0.8819082079603886

In [51]:
print(f"Train auc:      {result['train auc-mean'][-1]:.4f}, std: {result['train auc-stdv'][-1]:.4f}")
print(f"Validation auc: {result['valid auc-mean'][-1]:.4f}, std: {result['valid auc-stdv'][-1]:.4f}")



Train auc:      0.9256, std: 0.0019
Validation auc: 0.8607, std: 0.0157


In [185]:
# 0.78229

parameters = {
    #default
    "n_estimators": 400,
    "objective": "binary",
    "learning_rate": 0.1,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42,
    "max_depth": 7,
    "num_leaves": 8,
    #regularization
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bynode":0.7,
    "subsample_freq": 1,
    "min_data_in_leaf": 20,
}


result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)



[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 91
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 6
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 91
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 6
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 91
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> initscore=-0.482098
[LightGBM] [Info] Start training from score -0.482098
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.382799 -> i

In [186]:
print(f"Train auc:      {result['train auc-mean'][-1]:.4f}, std: {result['train auc-stdv'][-1]:.4f}")
print(f"Validation auc: {result['valid auc-mean'][-1]:.4f}, std: {result['valid auc-stdv'][-1]:.4f}")



Train auc:      0.9221, std: 0.0062
Validation auc: 0.8637, std: 0.0141


In [187]:
l_get_kfold_accuracy(X, y, parameters)



0.8402716942804546

0.74880
parameters = {
    #default
    "n_estimators": 300,
    "objective": "binary",
    "learning_rate": 0.05,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42,
    "max_depth":3,
    "num_leaves": 7,
    "max_bin": 512,
    "subsample_for_bin": 200,
    #regularization
    "colsample_bytree": 0.5,
    "subsample": 1,
    "subsample_freq": 1,
    "min_data_in_leaf": 10,
    "scale_pos_weight": scale_pos_weight
}


result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)

In [188]:
parameters = {
    #default
    "n_estimators": 400,
    "objective": "binary",
    "learning_rate": 0.1,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42,
    "max_depth": 7,
    "num_leaves": 8,
    #regularization
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.6,
    "colsample_bynode":0.7,
    "subsample_freq": 1,
    "min_data_in_leaf": 20,
    #categorical features
    'cat_smooth': 1,
    'min_data_per_group': 50
}
lgb_train = lgb.Dataset(X_cat, label=y, free_raw_data=False, categorical_feature=['Sex', 'Pclass', 'Embarked'])
result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)



[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 91
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 6
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 91
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 6
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 91
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> initsco

In [194]:
l_get_kfold_accuracy(X_cat, y, parameters, num_iter=30)



0.8672824223957342

In [195]:
lgbc = lgb.LGBMClassifier(**parameters)
lgbf = lgbc.fit(X_cat,y,categorical_feature=['Sex', 'Pclass', 'Embarked'])





In [196]:
print(lgbf.best_iteration_)

None


In [197]:
y_pred_l = lgbf.predict(test_transformed_categorical, num_iteration=30, categorical_feature=['Sex', 'Pclass', 'Embarked'])
submission = pd.DataFrame({
        "PassengerId": PassengerId,
        "Survived": y_pred_l
    })
submission.to_csv('submission_lightgbm.csv', index=False)

# CatBoost

### CatBoost parameters tunning

**`loss_function`** – The metric to use in training (the same as objective in xgb/lgb)<br>
**`eval_metric`** – The metric used for overfitting detection (if enabled) and the best model selection (if enabled).<br>
**`iterations`** – The maximum number of trees<br>
**`learning_rate`** – Learing rate :) (default 0.03)<br>
**`random_seed`** – is not set by default. always set to reproduce results.<br>
**`subsample`** – Sample rate for bagging (default 0.66)<br>
**`use_best_model`** – True if a validation set is input (the eval_set parameter is defined) and at least one of the values of objects in this set. False otherwise<br>
**`depth`** – the same as max depth earlier<br>
**`rsm`** – Random subspace method. The percentage of features that can be used at each split selection. (colsample_bylevel)<br>
**`class_weights`** – Classes weights<br>
**`od_type`** – The type of the overfitting detector to use. (better to use Iter)<br>
**`od_wait`** – The number of iterations to continue the training after the metric value.<br>






In [199]:
import catboost as ctb
parameters = {
    "loss_function": "Logloss",
    "eval_metric": "AUC",
    "iterations": 1000,
    "learning_rate": 0.03,
    "random_seed": 42,
    "od_wait": 30,
    "od_type": "Iter",
    "thread_count": 10,
    "use_best_model": True 
}

In [283]:
from sklearn.model_selection import train_test_split

X_train, X_validation, y_train, y_validation = train_test_split(X_cat, y, train_size=0.75, random_state=42)
train_pool = ctb.Pool(X_train, y_train, cat_features=categorical_features_indices)
validate_pool = ctb.Pool(X_validation, y_validation, cat_features=categorical_features_indices)


In [284]:
def c_get_kfold_accuracy(X, y, parameters, n_splits=5, num_iter = 30):
    kfold = StratifiedKFold(shuffle=True, random_state=42, n_splits=n_splits)
    scores = []
    for train_ix, test_ix in kfold.split(X, y):
        train_X, test_X = X.iloc[train_ix], X.iloc[test_ix]
        train_y, test_y = y[train_ix], y[test_ix]
        boost = ctb.CatBoostClassifier(**parameters)
        boost.fit(X, y, eval_set=(X_validation, y_validation))
        best = boost.best_iteration_ if boost.best_iteration_ else num_iter
        yhat =  boost.predict(test_X, ntree_end=best)
        acc = accuracy_score(test_y, yhat)
        # store score
        scores.append(acc)
    return np.mean(scores)

In [285]:
ctb_data = ctb.Pool(X_train,y_train)
result = ctb.cv(ctb_data, parameters, folds=skf, seed=42, verbose_eval=100, plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/3]
0:	test: 0.8060021	best: 0.8060021 (0)	total: 2.46ms	remaining: 2.46s
100:	test: 0.8747008	best: 0.8747008 (100)	total: 206ms	remaining: 1.84s
200:	test: 0.8791467	best: 0.8833362 (168)	total: 359ms	remaining: 1.43s

bestTest = 0.8833361833
bestIteration = 168

Training on fold [1/3]
0:	test: 0.8184850	best: 0.8184850 (0)	total: 2.45ms	remaining: 2.45s
100:	test: 0.8743588	best: 0.8764107 (89)	total: 183ms	remaining: 1.63s

bestTest = 0.8764107387
bestIteration = 89

Training on fold [2/3]
0:	test: 0.7937329	best: 0.7937329 (0)	total: 9.47ms	remaining: 9.46s

bestTest = 0.848495212
bestIteration = 10



In [286]:
result.loc[result["test-AUC-mean"] == result["test-AUC-mean"].max()]

Unnamed: 0,iterations,test-AUC-mean,test-AUC-std,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std
123,123,0.863572,0.024997,0.425154,0.031357,0.390213,0.000815


In [287]:
categorical_features_indices = [X_cat.columns.get_loc('Sex'), X_cat.columns.get_loc('Pclass'), X_cat.columns.get_loc('Embarked')]

In [288]:
X_cat.dtypes

Pclass        int64
Sex           int64
Age         float64
SibSp         int64
Parch         int64
Embarked      int64
dtype: object

In [289]:
categorical_features_indices

[1, 0, 5]

In [352]:
parameters = {
    #default
    "loss_function": "Logloss",
    "eval_metric": "AUC",
    "iterations": 1000,
    "learning_rate": 0.3,
    "random_seed": 42,
    "od_wait": 30,
    "od_type": "Iter",
    "thread_count": 10,
    'depth': 8,
    'l2_leaf_reg': 4,
    "use_best_model": True,
    "subsample": 0.8,
}

ctb_data = ctb.Pool(X_train,y_train,cat_features=categorical_features_indices)
result = ctb.cv(ctb_data, parameters, folds=skf, seed=42, verbose_eval=100, plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/3]
0:	test: 0.7690236	best: 0.7690236 (0)	total: 8.61ms	remaining: 8.6s

bestTest = 0.8782062244
bestIteration = 16

Training on fold [1/3]
0:	test: 0.8428950	best: 0.8428950 (0)	total: 6.15ms	remaining: 6.14s

bestTest = 0.8782062244
bestIteration = 5

Training on fold [2/3]
0:	test: 0.7580797	best: 0.7580797 (0)	total: 5.4ms	remaining: 5.39s

bestTest = 0.8305831053
bestIteration = 54



In [353]:
result.loc[result["test-AUC-mean"] == result["test-AUC-mean"].max()]

Unnamed: 0,iterations,test-AUC-mean,test-AUC-std,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std
6,6,0.855848,0.02873,0.435106,0.031918,0.373594,0.01145


In [354]:
c_get_kfold_accuracy(X_train, y_train.reset_index(drop=True), parameters, num_iter=20)

0:	test: 0.8544369	best: 0.8544369 (0)	total: 1.75ms	remaining: 1.75s
1:	test: 0.8679727	best: 0.8679727 (1)	total: 3.37ms	remaining: 1.68s
2:	test: 0.8575938	best: 0.8679727 (1)	total: 4.18ms	remaining: 1.39s
3:	test: 0.8589344	best: 0.8679727 (1)	total: 5.6ms	remaining: 1.39s
4:	test: 0.8656807	best: 0.8679727 (1)	total: 6.41ms	remaining: 1.27s
5:	test: 0.8577668	best: 0.8679727 (1)	total: 7.83ms	remaining: 1.3s
6:	test: 0.8610102	best: 0.8679727 (1)	total: 9.37ms	remaining: 1.33s
7:	test: 0.8593669	best: 0.8679727 (1)	total: 10.3ms	remaining: 1.28s
8:	test: 0.8557343	best: 0.8679727 (1)	total: 11.8ms	remaining: 1.3s
9:	test: 0.8556911	best: 0.8679727 (1)	total: 12.8ms	remaining: 1.27s
10:	test: 0.8537883	best: 0.8679727 (1)	total: 13.6ms	remaining: 1.22s
11:	test: 0.8560803	best: 0.8679727 (1)	total: 14.6ms	remaining: 1.2s
12:	test: 0.8551289	best: 0.8679727 (1)	total: 15.6ms	remaining: 1.19s
13:	test: 0.8561235	best: 0.8679727 (1)	total: 17.1ms	remaining: 1.2s
14:	test: 0.8551721	b

0.8348109078666817

In [355]:
clf = ctb.CatBoostClassifier(**parameters)

In [348]:
clf.fit(X_train, y_train, 
        cat_features=categorical_features_indices, 
        eval_set=(X_validation, y_validation),
        verbose=True
)

0:	test: 0.7830825	best: 0.7830825 (0)	total: 2.48ms	remaining: 2.48s
1:	test: 0.8319495	best: 0.8319495 (1)	total: 4.95ms	remaining: 2.47s
2:	test: 0.8375281	best: 0.8375281 (2)	total: 7.86ms	remaining: 2.61s
3:	test: 0.8505881	best: 0.8505881 (3)	total: 10.7ms	remaining: 2.67s
4:	test: 0.8527504	best: 0.8527504 (4)	total: 13.6ms	remaining: 2.72s
5:	test: 0.8475610	best: 0.8527504 (4)	total: 15.2ms	remaining: 2.52s
6:	test: 0.8476475	best: 0.8527504 (4)	total: 16.6ms	remaining: 2.35s
7:	test: 0.8483826	best: 0.8527504 (4)	total: 19.4ms	remaining: 2.4s
8:	test: 0.8479502	best: 0.8527504 (4)	total: 20.9ms	remaining: 2.31s
9:	test: 0.8485556	best: 0.8527504 (4)	total: 22.2ms	remaining: 2.2s
10:	test: 0.8485556	best: 0.8527504 (4)	total: 24.1ms	remaining: 2.16s
11:	test: 0.8483826	best: 0.8527504 (4)	total: 25.2ms	remaining: 2.07s
12:	test: 0.8495935	best: 0.8527504 (4)	total: 26.3ms	remaining: 2s
13:	test: 0.8495070	best: 0.8527504 (4)	total: 28.1ms	remaining: 1.98s
14:	test: 0.8500692	b

<catboost.core.CatBoostClassifier at 0x7fa59d8c45e0>

In [328]:
print(clf.get_best_iteration())

25


In [329]:
clf.get_all_params()

{'nan_mode': 'Min',
 'eval_metric': 'AUC',
 'combinations_ctr': ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCount=1:TargetBorderType=MinEntropy:Prior=0/1:Prior=0.5/1:Prior=1/1',
  'Counter:CtrBorderCount=15:CtrBorderType=Uniform:Prior=0/1'],
 'iterations': 1000,
 'sampling_frequency': 'PerTree',
 'fold_permutation_block': 0,
 'leaf_estimation_method': 'Newton',
 'od_pval': 0,
 'counter_calc_method': 'SkipTest',
 'grow_policy': 'SymmetricTree',
 'penalties_coefficient': 1,
 'boosting_type': 'Plain',
 'model_shrink_mode': 'Constant',
 'feature_border_type': 'GreedyLogSum',
 'ctr_leaf_count_limit': 18446744073709551615,
 'bayesian_matrix_reg': 0.10000000149011612,
 'one_hot_max_size': 2,
 'force_unit_auto_pair_weights': False,
 'l2_leaf_reg': 4,
 'random_strength': 1,
 'od_type': 'Iter',
 'rsm': 1,
 'boost_from_average': False,
 'max_ctr_complexity': 4,
 'model_size_reg': 0.5,
 'simple_ctr': ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCount=1:TargetBord

In [330]:
clf.get_best_iteration()

25

In [319]:
y_pred = clf.predict(test_transformed_categorical, ntree_end=clf.get_best_iteration())
submission = pd.DataFrame({
        "PassengerId": PassengerId,
        "Survived": y_pred
    })
submission.to_csv('submission_catboost.csv', index=False)

In [301]:
y_pred


array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

# Summary


- Boosting algorithms are the best for heterogeneous data.
- Lightgbm is the fastest and usually the most accurate;
- CatBoost doesn't need a lot of tuning;
- CatBoost shows the best result on data with many categorical variables;
- Always try three methods and ensemble them;
- Remember about regularization and comparing your training score with validation;
- Use LightGBM for experiments and in the end execute other algorithms;
- Firstly, set default parameters, do feature engineering and then come back to parameters tuning;
- More practice will give you more understanding and intuition;
- Use magic of boosting in real life :)



# Sources:

- Открытый курс машинного обучения. Тема 10. Градиентный бустинг. Часть 1 : https://habrahabr.ru/company/ods/blog/327250/
- Introduction to Boosted Trees: http://xgboost.readthedocs.io/en/latest/model.html
- XGBoost, a Top Machine Learning Method on Kaggle, Explained : https://www.kdnuggets.com/2017/10/xgboost-top-machine-learning-method-kaggle-explained.html
- LightGBM: A Highly Efficient Gradient Boosting Decision Tree: https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
- CatBoost: A machine learning library to handle categorical (CAT) data automatically: https://www.analyticsvidhya.com/blog/2017/08/catboost-automated-categorical-data/
- CatBoost: https://catboost.ai/docs/
- CatBoost tutorials: https://github.com/catboost/tutorials