# Boosting (XGBoost, LightGBM, CatBoost)

In [470]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score

In [471]:
data = pd.read_csv('./data/titanic_train.csv')
data.drop(columns=['Name', 'Fare', 'PassengerId', 'Cabin', 'Ticket'], axis = 1, inplace=True)
age_avg = data['Age'].mean()
data['Age'] = data['Age'].fillna(age_avg)
data.dropna(inplace=True)

y_cat = y = data.Survived.reset_index(drop=True)

In [472]:
data.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked'], dtype='object')

In [473]:
data.dtypes

Survived      int64
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Embarked     object
dtype: object

In [474]:
features = data.columns[1:]
features

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked'], dtype='object')

In [475]:
weight_not_surv = counts[0]/(counts[0]+counts[1])
weight_surv = counts[1]/(counts[0]+counts[1])
class_weights = {0: weight_not_surv, 1 : weight_surv}
class_weights

{0: 0.6175478065241845, 1: 0.38245219347581555}

In [476]:
categorical_features = ['Sex', 'Pclass', 'Embarked']
ct = make_column_transformer(
        (OneHotEncoder(), categorical_features),
        remainder='passthrough', verbose_feature_names_out=True)
data_transformed = ct.fit_transform(data.iloc[:, 1:], y=data.Survived)
X = pd.DataFrame(data_transformed, columns=ct.get_feature_names_out())
X = X.astype({"onehotencoder__Sex_female": int,
              "onehotencoder__Sex_male": int,
             "onehotencoder__Pclass_1": int,
             "onehotencoder__Pclass_2": int,
             "onehotencoder__Pclass_3": int,
             "onehotencoder__Embarked_C": int,
             "onehotencoder__Embarked_Q": int,
             "onehotencoder__Embarked_S": int,
             "remainder__SibSp":int,
             "remainder__Parch": int})
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
X.dtypes

onehotencoder__Sex_female      int64
onehotencoder__Sex_male        int64
onehotencoder__Pclass_1        int64
onehotencoder__Pclass_2        int64
onehotencoder__Pclass_3        int64
onehotencoder__Embarked_C      int64
onehotencoder__Embarked_Q      int64
onehotencoder__Embarked_S      int64
remainder__Age               float64
remainder__SibSp               int64
remainder__Parch               int64
dtype: object

In [477]:
test = pd.read_csv('./data/titanic_test.csv')
age_avg = test['Age'].mean()
test[['Age']] = test[['Age']].fillna(age_avg)
PassengerId = test['PassengerId']
test_transformed = ct.fit_transform(test[features])
test_transformed = pd.DataFrame(test_transformed, columns=ct.get_feature_names_out()).astype(np.float64)
test_transformed = test_transformed.astype({"onehotencoder__Sex_female": int,
              "onehotencoder__Sex_male": int,
             "onehotencoder__Pclass_1": int,
             "onehotencoder__Pclass_2": int,
             "onehotencoder__Pclass_3": int,
             "onehotencoder__Embarked_C": int,
             "onehotencoder__Embarked_Q": int,
             "onehotencoder__Embarked_S": int,
             "remainder__SibSp":int,
             "remainder__Parch": int})
test_transformed.dtypes

onehotencoder__Sex_female      int64
onehotencoder__Sex_male        int64
onehotencoder__Pclass_1        int64
onehotencoder__Pclass_2        int64
onehotencoder__Pclass_3        int64
onehotencoder__Embarked_C      int64
onehotencoder__Embarked_Q      int64
onehotencoder__Embarked_S      int64
remainder__Age               float64
remainder__SibSp               int64
remainder__Parch               int64
dtype: object

In [478]:
X_cat = data
X_cat.drop('Survived', inplace=True, axis=1)
X_cat['Sex'].replace(['male','female'], [0, 1], inplace=True)
X_cat['Embarked'].replace(['C','Q', 'S'], [0, 1, 2], inplace=True)
X_cat = X_cat.astype(np.float64)
X_cat = X_cat.astype({"Sex": int, "Pclass": int, "SibSp": int, 'Parch': int, 'Embarked':int})
X_cat.dtypes

Pclass        int64
Sex           int64
Age         float64
SibSp         int64
Parch         int64
Embarked      int64
dtype: object

In [479]:
test_transformed_categorical = test
test_transformed_categorical['Sex'].replace(['male','female'], [0, 1], inplace=True)
test_transformed_categorical['Embarked'].replace(['C','Q', 'S'], [0, 1, 2], inplace=True)
test_transformed_categorical = test_transformed_categorical[features]
test_transformed_categorical.dtypes

Pclass        int64
Sex           int64
Age         float64
SibSp         int64
Parch         int64
Embarked      int64
dtype: object

In [480]:
X_cat.dtypes

Pclass        int64
Sex           int64
Age         float64
SibSp         int64
Parch         int64
Embarked      int64
dtype: object

In [481]:
from sklearn.utils.class_weight import compute_class_weight
 
classes = np.unique(y)
weights = compute_class_weight(class_weight='balanced', classes=classes, y=y_train)
class_weights = dict(zip(classes, weights))
class_weights

{0: 0.8161764705882353, 1: 1.2906976744186047}

# Xgboost



### Boosting settings

**Explanation after lightgbm theory**<br>
`tree_method` – 'exact' (if you have time, you can try), 'approx', 'hist'(the best choise usually) <br>
`grow_policy` – 'depthwise', 'lossguide' (usually better)<br>
`objective` – default is quite good (sometimes "count:poisson")


### XGBoost parameters tuning

Usually we start tuning parameters with these first: <br>
`n_estimators` - number of boosting rounds, better to use `early_stopping_rounds` <br>
`eta` – learning rate, increasing lr reduces convergence time. (usually default value works good)

### Control Overfitting
When you observe high training accuracy, but low tests accuracy, it is likely that you encounter overfitting problem.

There are in general two ways that you can control overfitting:

* The first way is to directly control model complexity <p>
`max_depth` - maximum depth of a tree, increase of this value will make the model more complex; <br>
`gamma` - minimum loss reduction required to make a further partition on a leaf node of the tree.<br>
`min_child_weight` – minimum sum of instance weight (hessian) needed in a child.


* The second way is to add randomness to make training robust to noise <p>
`subsample` - subsample ratio of the training instance, <br>
`colsample_bytree` - subsample ratio of columns when constructing each tree. <br>


### Handle Imbalanced Dataset
There are two ways to improve it:

* If you care only about the ranking order (AUC) of your prediction
Balance the positive and negative weights, via `scale_pos_weight`
Use AUC for evaluation
* If you care about predicting the right probability
In such a case, you cannot re-balance the dataset. Set parameter `max_delta_step` to a finite number (say 1) will help to converge <br>

More about xgboost parameters: https://xgboost.readthedocs.io/en/latest/parameter.html

Always use `early_stopping_round` and tune `n_estimators` on validation.

In [482]:
import xgboost as xgb


parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 4,
    "random_seed": 1,
    "eval_metric": "error"
}


xgb_train = xgb.DMatrix(X, y, feature_names=X.columns)
xgb_test = xgb.DMatrix(test_transformed)

In [483]:
results = xgb.cv(parameters, xgb_train, num_boost_round=100,
                 folds=skf, verbose_eval=10)

[0]	train-error:0.14004+0.01089	test-error:0.19233+0.01700
[10]	train-error:0.12767+0.00648	test-error:0.19459+0.01126
[20]	train-error:0.12317+0.00596	test-error:0.19572+0.01258
[30]	train-error:0.12036+0.00790	test-error:0.19347+0.01385
[40]	train-error:0.11867+0.00790	test-error:0.19235+0.01259
[50]	train-error:0.11530+0.00676	test-error:0.19122+0.01409
[60]	train-error:0.11080+0.00692	test-error:0.19460+0.00967
[70]	train-error:0.10855+0.00552	test-error:0.20023+0.00177
[80]	train-error:0.10630+0.00597	test-error:0.20248+0.00304
[90]	train-error:0.10461+0.00234	test-error:0.20136+0.00596
[99]	train-error:0.10068+0.00484	test-error:0.20586+0.00760


In [484]:
# added early stopping
num_rounds = 10000
results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-error:0.14004+0.01089	test-error:0.19233+0.01700
[10]	train-error:0.12767+0.00648	test-error:0.19459+0.01126
[17]	train-error:0.12205+0.00705	test-error:0.19572+0.00987


In [485]:
%%time
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "error",
    
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.5,
    "colsample_bynode":0.7
}

results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-error:0.31889+0.00824	test-error:0.32507+0.01740
[10]	train-error:0.17323+0.02041	test-error:0.18785+0.01119
[20]	train-error:0.15748+0.00558	test-error:0.18334+0.01363
[25]	train-error:0.15242+0.00881	test-error:0.18672+0.00819
CPU times: user 938 ms, sys: 105 ms, total: 1.04 s
Wall time: 689 ms


In [486]:
%%time
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "error",
    
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.5,
    "colsample_bynode":0.7,
    
    #lightgbm approach
    "tree_method": "hist"
}

results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-error:0.26099+0.06304	test-error:0.26881+0.03737
[10]	train-error:0.17773+0.01003	test-error:0.18672+0.00819
[14]	train-error:0.17099+0.01063	test-error:0.18446+0.01393
CPU times: user 624 ms, sys: 103 ms, total: 728 ms
Wall time: 471 ms


In [487]:
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "error",
    
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.6,
    "colsample_bynode":0.7,
    
    #lightgbm approach
    "tree_method": "hist",
    "grow_policy": "lossguide"
}

results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-error:0.28291+0.01604	test-error:0.31496+0.01826
[10]	train-error:0.16873+0.00864	test-error:0.18672+0.03094
[15]	train-error:0.16030+0.01081	test-error:0.18896+0.01911


0.77990
parameters = parameters = {
    #default
    "objective": "binary:logistic",
    "n_estimators": 400,
    "eta": 0.03,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    # regularization parameters
    "max_depth": 3,
    "subsample": 0.9,
    "colsample_bytree": 0.8,
    "gamma": 2,
    "reg_lambda": 4,
    "scale_pos_weight": 0.6,
    
    #lightgbm approach
    "tree_method": "hist",
    "grow_policy": "lossguide",
}
[0]	train-auc:0.79896+0.01104	test-auc:0.79356+0.01737
[63]	train-auc:0.81029+0.01105	test-auc:0.79764+0.02647

0.8215429037725188

In [488]:
# 0.78229
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "error",
    
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.6,
    "colsample_bynode":0.7,
    
    #lightgbm approach
    "tree_method": "hist",
    "grow_policy": "lossguide"
}
num_rounds = 500
results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=50,
                 folds=skf, verbose_eval=100)

[0]	train-error:0.28291+0.01604	test-error:0.31496+0.01826
[76]	train-error:0.13330+0.00233	test-error:0.18448+0.00967


In [489]:
parameters['n_estimators'] = 77
clf = xgb.XGBClassifier(**parameters, use_label_encoder=False)
bst = clf.fit(X, y)

In [490]:
parameters

{'objective': 'binary:logistic',
 'eta': 0.1,
 'verbosity': 0,
 'nthread': 10,
 'random_seed': 1,
 'eval_metric': 'error',
 'max_depth': 7,
 'subsample': 0.8,
 'colsample_bytree': 0.6,
 'colsample_bylevel': 0.6,
 'colsample_bynode': 0.7,
 'tree_method': 'hist',
 'grow_policy': 'lossguide',
 'n_estimators': 77}

In [491]:
y_077990 = pd.read_csv('submission_xgboost_077990.csv')

In [492]:
bst.best_iteration

76

In [428]:
y_077990

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [429]:
y_pred = bst.predict(test_transformed, iteration_range = (0, bst.best_iteration + 1))
submission = pd.DataFrame({
        "PassengerId": PassengerId,
        "Survived": y_pred
    })
submission.to_csv('submission_xgboost.csv', index=False)
(submission == y_077990).all()

PassengerId     True
Survived       False
dtype: bool

### Some useful notes

1. Custom objective and evaluation functions – https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_objective.py
2. Never use timeseries validation with **xgb.cv**, it is broken :( 
3. Investigate your task and metrics. There are many objective functions that are worth to try.


# LightGBM

### Lightgbm Parameters Tunning


**Core parameters**

Begin with `learning_rate` parameter setting with default value <br>
`n_estimators`(number of boosting iterations) set 10k (some big number).<br>
`num_leaves` (number of leaves in one tree) set 155 as a starting point.<br>
`objective` –  the same as for xgboost `objective` (adding "regression_l1")

**Control Overfitting and Accuracy**

`max_depth` – max tree depth (`default=-1`). Start with tunning `num_leaves`, in the end try to change `max_depth`; <br>
`min_data_in_leaf` – very important parameter that helps to control overfitting (`default=20`);<br>
`colsample_bytree` – select part of features on each iteration (`default=1.0`). Always have to be tunned! <br>
`subsample` – select part of data without resampling (`default=1.0`). To enable it, set `subsample_freq` = 1 (other values always work worse)
`early_stopping_round` – as in xgboost



In [442]:
import lightgbm as lgb
from lightgbm.callback import log_evaluation, early_stopping
callbacks = [log_evaluation(50), early_stopping(100)]
parameters = {
    "objective": "binary",
    "learning_rate": 0.1,
    "num_threads": 10,
    "metric": "binary_error",
    "seed": 42
}
n_rounds = 1000

lgb_train = lgb.Dataset(X, label=y, free_raw_data=False)

In [443]:
result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=False)

[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> init

[50]	cv_agg's binary_error: 0.194592 + 0.0109907


[100]	cv_agg's binary_error: 0.204727 + 0.0111986


Early stopping, best iteration is:
[46]	cv_agg's binary_error: 0.18222 + 0.0171456


In [444]:
parameters = {
    #default
    "objective": "binary",
    "learning_rate": 0.01,
    "num_threads": 10,
    "metric": 'binary_error',
    "seed": 42,
    
    #regularization
    "colsample_bytree": 0.8,
    "subsample": 0.9, 
    "subsample_freq": 1,
    "colsample_bytree": 0.8, 
}
result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=False)

[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> init

[50]	cv_agg's binary_error: 0.197959 + 0.0229844


[100]	cv_agg's binary_error: 0.188965 + 0.0217929


Early stopping, best iteration is:
[46]	cv_agg's binary_error: 0.18222 + 0.0171456


In [445]:
parameters = {
    #default
    "objective": "binary",
    "learning_rate": 0.1,
    "num_threads": 10,
    "metric": "binary_error",
    "seed": 42,
    
    #regularization
    "colsample_bytree": 0.8,
    "subsample": 0.8,
    "subsample_freq": 1,
    "min_data_in_leaf": 10,
}

result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=False)

[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> init

[LightGBM] [Fatal] Reducing `min_data_in_leaf` with `feature_pre_filter=true` may cause unexpected behaviour for features that were pre-filtered by the larger `min_data_in_leaf`.
You need to set `feature_pre_filter=false` to dynamically change the `min_data_in_leaf`.


[50]	cv_agg's binary_error: 0.188977 + 0.00552504
[100]	cv_agg's binary_error: 0.197974 + 0.00416539
Early stopping, best iteration is:
[26]	cv_agg's binary_error: 0.179968 + 0.0150688


In [446]:
callbacks = [log_evaluation(50), early_stopping(100)]
parameters = {
    #default
    "objective": "binary",
    "learning_rate": 0.05,
    "num_threads": 10,
    "metric": "binary_error",
    "seed": 42,
    
    #regularization
    "colsample_bytree": 0.8,
    "subsample": 0.8,
    "subsample_freq": 1,
    "min_data_in_leaf": 10,
}


result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)

[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> init

In [447]:
# 0.78229

parameters = {
    #default
    "n_estimators": 400,
    "objective": "binary",
    "learning_rate": 0.1,
    "num_threads": 10,
    "metric": "binary_error",
    "seed": 42,
    "max_depth": 7,
    "num_leaves": 8,
    #regularization
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bynode":0.7,
    "subsample_freq": 1,
    "min_data_in_leaf": 20,
}


result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)

[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> init



[50]	cv_agg's train binary_error: 0.14454 + 0.00562693	cv_agg's valid binary_error: 0.182213 + 0.0124073
[100]	cv_agg's train binary_error: 0.133291 + 0.0062159	cv_agg's valid binary_error: 0.191214 + 0.00942462
Early stopping, best iteration is:
[17]	cv_agg's train binary_error: 0.16535 + 0.00619184	cv_agg's valid binary_error: 0.173219 + 0.0202439


0.74880
parameters = {
    #default
    "n_estimators": 300,
    "objective": "binary",
    "learning_rate": 0.05,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42,
    "max_depth":3,
    "num_leaves": 7,
    "max_bin": 512,
    "subsample_for_bin": 200,
    #regularization
    "colsample_bytree": 0.5,
    "subsample": 1,
    "subsample_freq": 1,
    "min_data_in_leaf": 10,
    "scale_pos_weight": scale_pos_weight
}


result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)

In [448]:
parameters = {
    #default
    "n_estimators": 400,
    "objective": "binary",
    "learning_rate": 0.1,
    "num_threads": 10,
    "metric": "binary_error",
    "seed": 42,
    "max_depth": 7,
    "num_leaves": 8,
    #regularization
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.6,
    "colsample_bynode":0.7,
    "subsample_freq": 1,
    "min_data_in_leaf": 20,
    #categorical features
    'cat_smooth': 1,
    'min_data_per_group': 50
}
lgb_train = lgb.Dataset(X_cat, label=y, free_raw_data=False, categorical_feature=['Sex', 'Pclass', 'Embarked'])
result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)

[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 91
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 6
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 91
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 6
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 91
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> initscore=-0.482098
[LightGBM] [Info] Start training from score -0.482098
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.382799 -> i



[50]	cv_agg's train binary_error: 0.145665 + 0.00799777	cv_agg's valid binary_error: 0.190092 + 0.0102533
[100]	cv_agg's train binary_error: 0.132165 + 0.00773612	cv_agg's valid binary_error: 0.190095 + 0.00781205
Early stopping, best iteration is:
[17]	cv_agg's train binary_error: 0.16535 + 0.00619184	cv_agg's valid binary_error: 0.173219 + 0.0202439


In [449]:
parameters['n_estimators'] = 17
lgbc = lgb.LGBMClassifier(**parameters)
lgbf = lgbc.fit(X_cat,y,categorical_feature=['Sex', 'Pclass', 'Embarked'])





In [450]:
print(lgbf.best_iteration_)

None


In [197]:
y_pred_l = lgbf.predict(test_transformed_categorical, num_iteration=30, categorical_feature=['Sex', 'Pclass', 'Embarked'])
submission = pd.DataFrame({
        "PassengerId": PassengerId,
        "Survived": y_pred_l
    })
submission.to_csv('submission_lightgbm.csv', index=False)

# CatBoost

### CatBoost parameters tunning

**`loss_function`** – The metric to use in training (the same as objective in xgb/lgb)<br>
**`eval_metric`** – The metric used for overfitting detection (if enabled) and the best model selection (if enabled).<br>
**`iterations`** – The maximum number of trees<br>
**`learning_rate`** – Learing rate :) (default 0.03)<br>
**`random_seed`** – is not set by default. always set to reproduce results.<br>
**`subsample`** – Sample rate for bagging (default 0.66)<br>
**`use_best_model`** – True if a validation set is input (the eval_set parameter is defined) and at least one of the values of objects in this set. False otherwise<br>
**`depth`** – the same as max depth earlier<br>
**`rsm`** – Random subspace method. The percentage of features that can be used at each split selection. (colsample_bylevel)<br>
**`class_weights`** – Classes weights<br>
**`od_type`** – The type of the overfitting detector to use. (better to use Iter)<br>
**`od_wait`** – The number of iterations to continue the training after the metric value.<br>






In [456]:
import catboost as ctb
parameters = {
    "loss_function": "Logloss",
    "eval_metric": "Accuracy",
    "iterations": 1000,
    "learning_rate": 0.03,
    "random_seed": 42,
    "od_wait": 30,
    "od_type": "Iter",
    "thread_count": 10,
    "use_best_model": True 
}

In [457]:
from sklearn.model_selection import train_test_split

X_train, X_validation, y_train, y_validation = train_test_split(X_cat, y, train_size=0.75, random_state=42)
train_pool = ctb.Pool(X_train, y_train, cat_features=categorical_features_indices)
validate_pool = ctb.Pool(X_validation, y_validation, cat_features=categorical_features_indices)


In [458]:
ctb_data = ctb.Pool(X_train,y_train)
result = ctb.cv(ctb_data, parameters, folds=skf, seed=42, verbose_eval=100, plot=True)

Custom logger is already specified. Specify more than one logger at same time is not thread safe.

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/3]
0:	learn: 0.8198198	test: 0.7882883	best: 0.7882883 (0)	total: 869us	remaining: 868ms

bestTest = 0.8153153153
bestIteration = 61

Training on fold [1/3]
0:	learn: 0.8265766	test: 0.8198198	best: 0.8198198 (0)	total: 2.32ms	remaining: 2.32s

bestTest = 0.8513513514
bestIteration = 9

Training on fold [2/3]
0:	learn: 0.8063063	test: 0.7432432	best: 0.7432432 (0)	total: 2.27ms	remaining: 2.27s

bestTest = 0.8063063063
bestIteration = 3



In [286]:
result.loc[result["test-AUC-mean"] == result["test-AUC-mean"].max()]

Unnamed: 0,iterations,test-AUC-mean,test-AUC-std,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std
123,123,0.863572,0.024997,0.425154,0.031357,0.390213,0.000815


In [287]:
categorical_features_indices = [X_cat.columns.get_loc('Sex'), X_cat.columns.get_loc('Pclass'), X_cat.columns.get_loc('Embarked')]

In [288]:
X_cat.dtypes

Pclass        int64
Sex           int64
Age         float64
SibSp         int64
Parch         int64
Embarked      int64
dtype: object

In [289]:
categorical_features_indices

[1, 0, 5]

In [459]:
parameters = {
    #default
    "loss_function": "Logloss",
    "eval_metric": "Accuracy",
    "iterations": 1000,
    "learning_rate": 0.3,
    "random_seed": 42,
    "od_wait": 30,
    "od_type": "Iter",
    "thread_count": 10,
    'depth': 8,
    'l2_leaf_reg': 4,
    "use_best_model": True,
    "subsample": 0.8,
}

ctb_data = ctb.Pool(X_train,y_train,cat_features=categorical_features_indices)
result = ctb.cv(ctb_data, parameters, folds=skf, seed=42, verbose_eval=100, plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/3]
0:	learn: 0.8040541	test: 0.7792793	best: 0.7792793 (0)	total: 5.6ms	remaining: 5.6s

bestTest = 0.8153153153
bestIteration = 17

Training on fold [1/3]
0:	learn: 0.8198198	test: 0.8423423	best: 0.8423423 (0)	total: 5.24ms	remaining: 5.24s

bestTest = 0.8423423423
bestIteration = 0

Training on fold [2/3]
0:	learn: 0.8085586	test: 0.7702703	best: 0.7702703 (0)	total: 3.39ms	remaining: 3.38s

bestTest = 0.8153153153
bestIteration = 20



In [462]:
result.loc[result["test-Accuracy-mean"] == result["test-Accuracy-mean"].max()]

Unnamed: 0,iterations,test-Accuracy-mean,test-Accuracy-std,train-Accuracy-mean,train-Accuracy-std,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std
12,12,0.816817,0.01448,0.865616,0.005201,0.433354,0.035397,0.333538,0.001055
13,13,0.816817,0.01448,0.866366,0.00344,0.433656,0.035892,0.330914,0.003845


In [463]:
parameters['iterations'] = 54
clf = ctb.CatBoostClassifier(**parameters)

In [464]:
clf.fit(X_train, y_train, 
        cat_features=categorical_features_indices, 
        eval_set=(X_validation, y_validation),
        verbose=True
)

0:	learn: 0.8153153	test: 0.8026906	best: 0.8026906 (0)	total: 1.39ms	remaining: 73.8ms
1:	learn: 0.7852853	test: 0.7892377	best: 0.8026906 (0)	total: 2.32ms	remaining: 60.2ms
2:	learn: 0.7852853	test: 0.7892377	best: 0.8026906 (0)	total: 3.29ms	remaining: 55.9ms
3:	learn: 0.7852853	test: 0.7892377	best: 0.8026906 (0)	total: 4.32ms	remaining: 54ms
4:	learn: 0.7987988	test: 0.7982063	best: 0.8026906 (0)	total: 5.62ms	remaining: 55ms
5:	learn: 0.8093093	test: 0.8071749	best: 0.8071749 (5)	total: 6.78ms	remaining: 54.2ms
6:	learn: 0.8168168	test: 0.8116592	best: 0.8116592 (6)	total: 8.12ms	remaining: 54.5ms
7:	learn: 0.8198198	test: 0.8161435	best: 0.8161435 (7)	total: 9ms	remaining: 51.7ms
8:	learn: 0.8258258	test: 0.8206278	best: 0.8206278 (8)	total: 10.5ms	remaining: 52.5ms
9:	learn: 0.8258258	test: 0.8206278	best: 0.8206278 (8)	total: 11.4ms	remaining: 50ms
10:	learn: 0.8258258	test: 0.8206278	best: 0.8206278 (8)	total: 12.5ms	remaining: 48.7ms
11:	learn: 0.8258258	test: 0.8206278	bes

<catboost.core.CatBoostClassifier at 0x7fa59d8dacd0>

In [465]:
print(clf.get_best_iteration())

14


In [466]:
clf.get_all_params()

{'nan_mode': 'Min',
 'eval_metric': 'Accuracy',
 'combinations_ctr': ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCount=1:TargetBorderType=MinEntropy:Prior=0/1:Prior=0.5/1:Prior=1/1',
  'Counter:CtrBorderCount=15:CtrBorderType=Uniform:Prior=0/1'],
 'iterations': 54,
 'sampling_frequency': 'PerTree',
 'fold_permutation_block': 0,
 'leaf_estimation_method': 'Newton',
 'od_pval': 0,
 'counter_calc_method': 'SkipTest',
 'grow_policy': 'SymmetricTree',
 'penalties_coefficient': 1,
 'boosting_type': 'Plain',
 'model_shrink_mode': 'Constant',
 'feature_border_type': 'GreedyLogSum',
 'ctr_leaf_count_limit': 18446744073709551615,
 'bayesian_matrix_reg': 0.10000000149011612,
 'one_hot_max_size': 2,
 'force_unit_auto_pair_weights': False,
 'l2_leaf_reg': 4,
 'random_strength': 1,
 'od_type': 'Iter',
 'rsm': 1,
 'boost_from_average': False,
 'max_ctr_complexity': 1,
 'model_size_reg': 0.5,
 'simple_ctr': ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCount=1:TargetB

In [467]:
clf.get_best_iteration()

14

In [468]:
y_pred = clf.predict(test_transformed_categorical, ntree_end=clf.get_best_iteration())
submission = pd.DataFrame({
        "PassengerId": PassengerId,
        "Survived": y_pred
    })
submission.to_csv('submission_catboost.csv', index=False)

In [469]:
y_pred


array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

# Summary


- Boosting algorithms are the best for heterogeneous data.
- Lightgbm is the fastest and usually the most accurate;
- CatBoost doesn't need a lot of tuning;
- CatBoost shows the best result on data with many categorical variables;
- Always try three methods and ensemble them;
- Remember about regularization and comparing your training score with validation;
- Use LightGBM for experiments and in the end execute other algorithms;
- Firstly, set default parameters, do feature engineering and then come back to parameters tuning;
- More practice will give you more understanding and intuition;
- Use magic of boosting in real life :)



# Sources:

- Открытый курс машинного обучения. Тема 10. Градиентный бустинг. Часть 1 : https://habrahabr.ru/company/ods/blog/327250/
- Introduction to Boosted Trees: http://xgboost.readthedocs.io/en/latest/model.html
- XGBoost, a Top Machine Learning Method on Kaggle, Explained : https://www.kdnuggets.com/2017/10/xgboost-top-machine-learning-method-kaggle-explained.html
- LightGBM: A Highly Efficient Gradient Boosting Decision Tree: https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
- CatBoost: A machine learning library to handle categorical (CAT) data automatically: https://www.analyticsvidhya.com/blog/2017/08/catboost-automated-categorical-data/
- CatBoost: https://catboost.ai/docs/
- CatBoost tutorials: https://github.com/catboost/tutorials