# Boosting (XGBoost, LightGBM, CatBoost)

In [592]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score

In [593]:
data = pd.read_csv('./data/titanic_train.csv')
data.drop(columns=['Name', 'Fare', 'PassengerId', 'Cabin', 'Ticket'], axis = 1, inplace=True)
age_avg = data['Age'].mean()
data['Age'] = data['Age'].fillna(age_avg)
data.dropna(inplace=True)

y_cat = y = data.Survived.reset_index(drop=True)

In [594]:
data.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked'], dtype='object')

In [595]:
data.dtypes

Survived      int64
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Embarked     object
dtype: object

In [596]:
features = data.columns[1:]
features

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked'], dtype='object')

In [597]:
counts = data.Survived.value_counts()
scale_pos_weight = counts[0]/counts[1]
scale_pos_weight

1.6147058823529412

In [598]:
categorical_features = ['Sex', 'Pclass', 'Embarked']
ct = make_column_transformer(
        (OneHotEncoder(), categorical_features),
        remainder='passthrough', verbose_feature_names_out=True)
data_transformed = ct.fit_transform(data.iloc[:, 1:], y=data.Survived)
X = pd.DataFrame(data_transformed, columns=ct.get_feature_names_out())
X = X.astype({"onehotencoder__Sex_female": int,
              "onehotencoder__Sex_male": int,
             "onehotencoder__Pclass_1": int,
             "onehotencoder__Pclass_2": int,
             "onehotencoder__Pclass_3": int,
             "onehotencoder__Embarked_C": int,
             "onehotencoder__Embarked_Q": int,
             "onehotencoder__Embarked_S": int,
             "remainder__SibSp":int,
             "remainder__Parch": int})
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
X.dtypes

onehotencoder__Sex_female      int64
onehotencoder__Sex_male        int64
onehotencoder__Pclass_1        int64
onehotencoder__Pclass_2        int64
onehotencoder__Pclass_3        int64
onehotencoder__Embarked_C      int64
onehotencoder__Embarked_Q      int64
onehotencoder__Embarked_S      int64
remainder__Age               float64
remainder__SibSp               int64
remainder__Parch               int64
dtype: object

In [599]:
test = pd.read_csv('./data/titanic_test.csv')
age_avg = test['Age'].mean()
test[['Age']] = test[['Age']].fillna(age_avg)
PassengerId = test['PassengerId']
test_transformed = ct.fit_transform(test[features])
test_transformed = pd.DataFrame(test_transformed, columns=ct.get_feature_names_out()).astype(np.float64)
test_transformed = test_transformed.astype({"onehotencoder__Sex_female": int,
              "onehotencoder__Sex_male": int,
             "onehotencoder__Pclass_1": int,
             "onehotencoder__Pclass_2": int,
             "onehotencoder__Pclass_3": int,
             "onehotencoder__Embarked_C": int,
             "onehotencoder__Embarked_Q": int,
             "onehotencoder__Embarked_S": int,
             "remainder__SibSp":int,
             "remainder__Parch": int})
test_transformed.dtypes

onehotencoder__Sex_female      int64
onehotencoder__Sex_male        int64
onehotencoder__Pclass_1        int64
onehotencoder__Pclass_2        int64
onehotencoder__Pclass_3        int64
onehotencoder__Embarked_C      int64
onehotencoder__Embarked_Q      int64
onehotencoder__Embarked_S      int64
remainder__Age               float64
remainder__SibSp               int64
remainder__Parch               int64
dtype: object

In [600]:
X_cat = data
X_cat.drop('Survived', inplace=True, axis=1)
X_cat['Sex'].replace(['male','female'], [0, 1], inplace=True)
X_cat['Embarked'].replace(['C','Q', 'S'], [0, 1, 2], inplace=True)
X_cat = X_cat.astype(np.float64)
X_cat = X_cat.astype({"Sex": int, "Pclass": int, "SibSp": int, 'Parch': int})
X_cat.dtypes

Pclass        int64
Sex           int64
Age         float64
SibSp         int64
Parch         int64
Embarked    float64
dtype: object

In [601]:
test_transformed_categorical = test
test_transformed_categorical['Sex'].replace(['male','female'], [0, 1], inplace=True)
test_transformed_categorical['Embarked'].replace(['C','Q', 'S'], [0, 1, 2], inplace=True)
test_transformed_categorical = test_transformed_categorical[features]
test_transformed_categorical.dtypes

Pclass        int64
Sex           int64
Age         float64
SibSp         int64
Parch         int64
Embarked      int64
dtype: object

In [602]:
X_cat.dtypes

Pclass        int64
Sex           int64
Age         float64
SibSp         int64
Parch         int64
Embarked    float64
dtype: object

# Xgboost



### Boosting settings

**Explanation after lightgbm theory**<br>
`tree_method` – 'exact' (if you have time, you can try), 'approx', 'hist'(the best choise usually) <br>
`grow_policy` – 'depthwise', 'lossguide' (usually better)<br>
`objective` – default is quite good (sometimes "count:poisson")


### XGBoost parameters tuning

Usually we start tuning parameters with these first: <br>
`n_estimators` - number of boosting rounds, better to use `early_stopping_rounds` <br>
`eta` – learning rate, increasing lr reduces convergence time. (usually default value works good)

### Control Overfitting
When you observe high training accuracy, but low tests accuracy, it is likely that you encounter overfitting problem.

There are in general two ways that you can control overfitting:

* The first way is to directly control model complexity <p>
`max_depth` - maximum depth of a tree, increase of this value will make the model more complex; <br>
`gamma` - minimum loss reduction required to make a further partition on a leaf node of the tree.<br>
`min_child_weight` – minimum sum of instance weight (hessian) needed in a child.


* The second way is to add randomness to make training robust to noise <p>
`subsample` - subsample ratio of the training instance, <br>
`colsample_bytree` - subsample ratio of columns when constructing each tree. <br>


### Handle Imbalanced Dataset
There are two ways to improve it:

* If you care only about the ranking order (AUC) of your prediction
Balance the positive and negative weights, via `scale_pos_weight`
Use AUC for evaluation
* If you care about predicting the right probability
In such a case, you cannot re-balance the dataset. Set parameter `max_delta_step` to a finite number (say 1) will help to converge <br>

More about xgboost parameters: https://xgboost.readthedocs.io/en/latest/parameter.html

Always use `early_stopping_round` and tune `n_estimators` on validation.

In [176]:
import xgboost as xgb


parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 4,
    "random_seed": 1,
    "eval_metric": "auc"
}


xgb_train = xgb.DMatrix(X, y, feature_names=X.columns)
xgb_test = xgb.DMatrix(test_transformed)

In [603]:
def get_kfold_accuracy(X, y, parameters, n_splits=5):
    kfold = StratifiedKFold(shuffle=True, random_state=42, n_splits=n_splits)
    scores = []
    for train_ix, test_ix in kfold.split(X, y):
        train_X, test_X = X.iloc[train_ix], X.iloc[test_ix]
        train_y, test_y = y[train_ix], y[test_ix]
        xgb_clf = xgb.XGBClassifier(**parameters, use_label_encoder=False)
        xgb_clf.fit(X, y)
        yhat =  xgb_clf.predict(test_X)
        acc = accuracy_score(test_y, yhat)
        # store score
        scores.append(acc)
    return np.mean(scores)

In [178]:
results = xgb.cv(parameters, xgb_train, num_boost_round=100,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.82492+0.01125	test-auc:0.80373+0.02034
[10]	train-auc:0.86628+0.00668	test-auc:0.81630+0.01153
[20]	train-auc:0.88233+0.00453	test-auc:0.82166+0.00825
[30]	train-auc:0.89304+0.00624	test-auc:0.82134+0.01044
[40]	train-auc:0.89720+0.00770	test-auc:0.82156+0.00838
[50]	train-auc:0.90133+0.00816	test-auc:0.81829+0.00893
[60]	train-auc:0.90550+0.00895	test-auc:0.81645+0.00710
[70]	train-auc:0.90681+0.00907	test-auc:0.81744+0.00664
[80]	train-auc:0.90911+0.00837	test-auc:0.81593+0.00472
[90]	train-auc:0.91139+0.00796	test-auc:0.81622+0.00491
[99]	train-auc:0.91331+0.00798	test-auc:0.81592+0.00478


In [179]:
get_kfold_accuracy(X, y, parameters)

0.8507250015692673

In [180]:
# added early stopping
num_rounds = 10000
results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.82492+0.01125	test-auc:0.80373+0.02034
[10]	train-auc:0.86628+0.00668	test-auc:0.81630+0.01153
[20]	train-auc:0.88233+0.00453	test-auc:0.82166+0.00825
[30]	train-auc:0.89304+0.00624	test-auc:0.82134+0.01044
[36]	train-auc:0.89507+0.00659	test-auc:0.82257+0.00922


In [181]:
%%time
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.5,
    "colsample_bynode":0.7
}

results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.80209+0.01554	test-auc:0.77826+0.01973
[10]	train-auc:0.83464+0.00802	test-auc:0.80850+0.02258
[20]	train-auc:0.84793+0.00740	test-auc:0.81453+0.02053
[30]	train-auc:0.85439+0.00872	test-auc:0.82037+0.01910
[40]	train-auc:0.86181+0.00848	test-auc:0.82117+0.01810
[44]	train-auc:0.86483+0.00973	test-auc:0.82105+0.01970
CPU times: user 1.29 s, sys: 0 ns, total: 1.29 s
Wall time: 870 ms


In [182]:
get_kfold_accuracy(X, y, parameters)

0.8305191136777352

In [183]:
%%time
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.5,
    "colsample_bynode":0.7,
    
    #lightgbm approach
    "tree_method": "hist"
}

results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.80444+0.01054	test-auc:0.79140+0.03241
[10]	train-auc:0.84366+0.01167	test-auc:0.81280+0.01651
[17]	train-auc:0.84579+0.01096	test-auc:0.81159+0.01659
CPU times: user 498 ms, sys: 80.6 ms, total: 579 ms
Wall time: 359 ms


In [184]:
get_kfold_accuracy(X, y, parameters)

0.8293955181721173

In [185]:
parameters = {
    #default
    "objective": "binary:logistic",
    "eta": 0.1,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.8,
    "colsample_bytree": 0.6,
    "colsample_bylevel":0.6,
    "colsample_bynode":0.7,
    
    #lightgbm approach
    "tree_method": "hist",
    "grow_policy": "lossguide"
}

results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=10,
                 folds=skf, verbose_eval=10)

[0]	train-auc:0.80444+0.01054	test-auc:0.79140+0.03241
[10]	train-auc:0.84366+0.01167	test-auc:0.81280+0.01651
[16]	train-auc:0.84462+0.00905	test-auc:0.81109+0.01737


In [186]:
get_kfold_accuracy(X, y, parameters)

0.8293955181721173

0.77990
parameters = parameters = {
    #default
    "objective": "binary:logistic",
    "n_estimators": 400,
    "eta": 0.03,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    # regularization parameters
    "max_depth": 3,
    "subsample": 0.9,
    "colsample_bytree": 0.8,
    "gamma": 2,
    "reg_lambda": 4,
    "scale_pos_weight": 0.6,
    
    #lightgbm approach
    "tree_method": "hist",
    "grow_policy": "lossguide",
}
[0]	train-auc:0.79896+0.01104	test-auc:0.79356+0.01737
[63]	train-auc:0.81029+0.01105	test-auc:0.79764+0.02647

0.8215429037725188

In [604]:
parameters = {
    #default
    "objective": "binary:logistic",
    "n_estimators": 11,
    "eta": 0.55,
    "verbosity": 0,
    "nthread": 10,
    "random_seed": 1,
    "eval_metric": "auc",
    # regularization parameters
    "max_depth": 7,
    "subsample": 0.9,
    "colsample_bytree": 0.8,
    'colsample_bylevel':0.8,
    "gamma": 2,
    "reg_lambda": 6,
    "scale_pos_weight": 0.6,
    "min_child_weight": 0.4,
    "max_delta_step": 10,
    #lightgbm approach
    "tree_method": "hist",
    "grow_policy": "lossguide",
}
num_rounds = 500
results = xgb.cv(parameters, xgb_train, num_rounds, early_stopping_rounds=50,
                 folds=skf, verbose_eval=100)

[0]	train-auc:0.79661+0.01351	test-auc:0.78335+0.02134
[86]	train-auc:0.84883+0.01054	test-auc:0.81992+0.01971


In [605]:
get_kfold_accuracy(X, y, parameters)

0.8256332127213865

In [610]:
clf = xgb.XGBClassifier(**parameters, use_label_encoder=False)
bst = clf.fit(X, y)

In [611]:
y_077990 = pd.read_csv('submission_xgboost_077990.csv')

In [612]:
y_pred = bst.predict(test_transformed)
submission = pd.DataFrame({
        "PassengerId": PassengerId,
        "Survived": y_pred
    })
submission.to_csv('submission_xgboost.csv', index=False)
(submission == y_077990).all()

PassengerId     True
Survived       False
dtype: bool

In [613]:
pd.concat([y_077990,submission]).drop_duplicates(keep=False)

Unnamed: 0,PassengerId,Survived
1,893,1
4,896,1
18,910,1
21,913,1
32,924,1
...,...,...
376,1268,0
382,1274,0
383,1275,0
409,1301,0


### Some useful notes

1. Custom objective and evaluation functions – https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_objective.py
2. Never use timeseries validation with **xgb.cv**, it is broken :( 
3. Investigate your task and metrics. There are many objective functions that are worth to try.


# LightGBM

### Lightgbm Parameters Tunning


**Core parameters**

Begin with `learning_rate` parameter setting with default value <br>
`n_estimators`(number of boosting iterations) set 10k (some big number).<br>
`num_leaves` (number of leaves in one tree) set 155 as a starting point.<br>
`objective` –  the same as for xgboost `objective` (adding "regression_l1")

**Control Overfitting and Accuracy**

`max_depth` – max tree depth (`default=-1`). Start with tunning `num_leaves`, in the end try to change `max_depth`; <br>
`min_data_in_leaf` – very important parameter that helps to control overfitting (`default=20`);<br>
`colsample_bytree` – select part of features on each iteration (`default=1.0`). Always have to be tunned! <br>
`subsample` – select part of data without resampling (`default=1.0`). To enable it, set `subsample_freq` = 1 (other values always work worse)
`early_stopping_round` – as in xgboost



In [614]:
import lightgbm as lgb
from lightgbm.callback import log_evaluation, early_stopping
callbacks = [early_stopping(50)]
parameters = {
    "objective": "binary",
    "learning_rate": 0.1,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42
}
n_rounds = 1000

lgb_train = lgb.Dataset(X, label=y, free_raw_data=False)

In [615]:
def l_get_kfold_accuracy(X, y, parameters, n_splits=5):
    kfold = StratifiedKFold(shuffle=True, random_state=42, n_splits=n_splits)
    scores = []
    for train_ix, test_ix in kfold.split(X, y):
        train_X, test_X = X.iloc[train_ix], X.iloc[test_ix]
        train_y, test_y = y[train_ix], y[test_ix]
        lgbc = lgb.LGBMClassifier(**parameters)
        lgbc.fit(X, y)
        yhat =  lgbc.predict(test_X)
        acc = accuracy_score(test_y, yhat)
        # store score
        scores.append(acc)
    return np.mean(scores)

In [616]:
result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=False)

[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> init

Early stopping, best iteration is:
[28]	cv_agg's auc: 0.867194 + 0.0100888


In [617]:
parameters = {
    #default
    "objective": "binary",
    "learning_rate": 0.1,
    "num_threads": 10,
    "metric": 'auc',
    "seed": 42,
    
    #regularization
    "colsample_bytree": 0.8,
    "subsample": 0.9, 
    "subsample_freq": 1,
    "colsample_bytree": 0.8, 
}

result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)

[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> init

IndexError: list index out of range

In [618]:
l_get_kfold_accuracy(X, y, parameters)



0.8875325334856852

In [619]:
parameters = {
    #default
    "objective": "binary",
    "learning_rate": 0.1,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42,
    
    #regularization
    "colsample_bytree": 0.8,
    "subsample": 0.8,
    "subsample_freq": 1,
    "min_data_in_leaf": 10,
}

result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)

[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> init

[LightGBM] [Fatal] Reducing `min_data_in_leaf` with `feature_pre_filter=true` may cause unexpected behaviour for features that were pre-filtered by the larger `min_data_in_leaf`.
You need to set `feature_pre_filter=false` to dynamically change the `min_data_in_leaf`.


IndexError: list index out of range

In [620]:
l_get_kfold_accuracy(X, y, parameters)



0.916784104614994

In [621]:
callbacks = [log_evaluation(50), early_stopping(100)]
parameters = {
    #default
    "objective": "binary",
    "learning_rate": 0.01,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42,
    
    #regularization
    "colsample_bytree": 0.8,
    "subsample": 0.8,
    "subsample_freq": 1,
    "min_data_in_leaf": 60,
}


result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)

[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 96
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> initscore=-0.482098
[LightGBM] [Info] Start training from score -0.4

[50]	cv_agg's train auc: 0.853422 + 0.00573384	cv_agg's valid auc: 0.842269 + 0.0148561


[100]	cv_agg's train auc: 0.856617 + 0.007087	cv_agg's valid auc: 0.842456 + 0.0134122
Early stopping, best iteration is:
[32]	cv_agg's train auc: 0.851532 + 0.00825112	cv_agg's valid auc: 0.843589 + 0.0138112


In [622]:
l_get_kfold_accuracy(X, y, parameters)



0.7862756300387228

In [623]:
print(f"Train auc:      {result['train auc-mean'][-1]:.4f}, std: {result['train auc-stdv'][-1]:.4f}")
print(f"Validation auc: {result['valid auc-mean'][-1]:.4f}, std: {result['valid auc-stdv'][-1]:.4f}")



Train auc:      0.8515, std: 0.0083
Validation auc: 0.8436, std: 0.0138


In [624]:
parameters = {
    #default
    "n_estimators": 300,
    "objective": "binary",
    "learning_rate": 0.01,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42,
    "max_depth":3,
    "num_leaves": 7,
    "max_bin": 512,
    "subsample_for_bin": 200,
    #regularization
    "colsample_bytree": 0.5,
    "subsample": 1,
    "subsample_freq": 1,
    "min_data_in_leaf": 50,
    "scale_pos_weight": scale_pos_weight,
}


result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)

[LightGBM] [Fatal] Cannot change max_bin after constructed Dataset handle.


[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 64
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 64
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 64
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> init

[100]	cv_agg's train auc: 0.860403 + 0.00881894	cv_agg's valid auc: 0.848272 + 0.0163086


[150]	cv_agg's train auc: 0.862964 + 0.00766102	cv_agg's valid auc: 0.850044 + 0.0164212


[200]	cv_agg's train auc: 0.864374 + 0.0079366	cv_agg's valid auc: 0.852421 + 0.0172089
[250]	cv_agg's train auc: 0.867445 + 0.00832917	cv_agg's valid auc: 0.853733 + 0.0169494




[300]	cv_agg's train auc: 0.871898 + 0.00763559	cv_agg's valid auc: 0.857296 + 0.0169383
Did not meet early stopping. Best iteration is:
[300]	cv_agg's train auc: 0.871898 + 0.00763559	cv_agg's valid auc: 0.857296 + 0.0169383


In [625]:
print(f"Train auc:      {result['train auc-mean'][-1]:.4f}, std: {result['train auc-stdv'][-1]:.4f}")
print(f"Validation auc: {result['valid auc-mean'][-1]:.4f}, std: {result['valid auc-stdv'][-1]:.4f}")



Train auc:      0.8719, std: 0.0076
Validation auc: 0.8573, std: 0.0169


In [626]:
l_get_kfold_accuracy(X, y, parameters)



0.8290293912270679

In [628]:
parameters = {
    #default
    "n_estimators": 300,
    "objective": "binary",
    "learning_rate": 0.01,
    "num_threads": 10,
    "metric": "auc",
    "seed": 42,
    "max_depth":3,
    "num_leaves": 7,
    "max_bin": 512,
    "subsample_for_bin": 200,
    #regularization
    "colsample_bytree": 0.5,
    "subsample": 1,
    "subsample_freq": 1,
    "min_data_in_leaf": 50,
    "scale_pos_weight": scale_pos_weight,
    #categorical features
    'cat_smooth': 1,
    'min_data_per_group': 50
}
lgb_train = lgb.Dataset(X_cat, label=y, free_raw_data=False, categorical_feature=['Sex', 'Pclass', 'Embarked'])
result = lgb.cv(parameters, lgb_train, n_rounds, folds=skf, callbacks=callbacks, eval_train_metric=True)



[LightGBM] [Info] Number of positive: 226, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 59
[LightGBM] [Info] Number of data points in the train set: 592, number of used features: 6
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 59
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 6
[LightGBM] [Info] Number of positive: 227, number of negative: 366
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 59
[LightGBM] [Info] Number of data points in the train set: 593, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381757 -> initsco

[100]	cv_agg's train auc: 0.864473 + 0.00687423	cv_agg's valid auc: 0.852166 + 0.0190967


[150]	cv_agg's train auc: 0.865939 + 0.00663507	cv_agg's valid auc: 0.852669 + 0.0187544
[200]	cv_agg's train auc: 0.867381 + 0.00606166	cv_agg's valid auc: 0.853856 + 0.0173208


[250]	cv_agg's train auc: 0.869212 + 0.00656206	cv_agg's valid auc: 0.854378 + 0.0169122


[300]	cv_agg's train auc: 0.871082 + 0.00630595	cv_agg's valid auc: 0.854845 + 0.0172143
Did not meet early stopping. Best iteration is:
[300]	cv_agg's train auc: 0.871898 + 0.00763559	cv_agg's valid auc: 0.857296 + 0.0169383


In [629]:
l_get_kfold_accuracy(X_cat, y, parameters)



0.8166634926680633

In [632]:
lgbc = lgb.LGBMClassifier(**parameters)
lgbf = lgbc.fit(X_cat,y,categorical_feature=['Sex', 'Pclass', 'Embarked'])





In [633]:
X_cat

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Embarked
0,3,0,22.000000,1,0,2.0
1,1,1,38.000000,1,0,0.0
2,3,1,26.000000,0,0,2.0
3,1,1,35.000000,1,0,2.0
4,3,0,35.000000,0,0,2.0
...,...,...,...,...,...,...
886,2,0,27.000000,0,0,2.0
887,1,1,19.000000,0,0,2.0
888,3,1,29.699118,1,2,2.0
889,1,0,26.000000,0,0,0.0


In [634]:
test_transformed_categorical

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Embarked
0,3,0,34.50000,0,0,1
1,3,1,47.00000,1,0,2
2,2,0,62.00000,0,0,1
3,3,0,27.00000,0,0,2
4,3,1,22.00000,1,1,2
...,...,...,...,...,...,...
413,3,0,30.27259,0,0,2
414,1,1,39.00000,0,0,0
415,3,0,38.50000,0,0,2
416,3,0,30.27259,0,0,2


In [637]:
y_pred = lgbf.predict(test_transformed_categorical, num_iteration=200)
submission = pd.DataFrame({
        "PassengerId": PassengerId,
        "Survived": y_pred
    })
submission.to_csv('submission_lightgbm.csv', index=False)

In [638]:
y_pred

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

# CatBoost

### CatBoost parameters tunning

**`loss_function`** – The metric to use in training (the same as objective in xgb/lgb)<br>
**`eval_metric`** – The metric used for overfitting detection (if enabled) and the best model selection (if enabled).<br>
**`iterations`** – The maximum number of trees<br>
**`learning_rate`** – Learing rate :) (default 0.03)<br>
**`random_seed`** – is not set by default. always set to reproduce results.<br>
**`subsample`** – Sample rate for bagging (default 0.66)<br>
**`use_best_model`** – True if a validation set is input (the eval_set parameter is defined) and at least one of the values of objects in this set. False otherwise<br>
**`depth`** – the same as max depth earlier<br>
**`rsm`** – Random subspace method. The percentage of features that can be used at each split selection. (colsample_bylevel)<br>
**`class_weights`** – Classes weights<br>
**`od_type`** – The type of the overfitting detector to use. (better to use Iter)<br>
**`od_wait`** – The number of iterations to continue the training after the metric value.<br>






In [33]:
import catboost as ctb
parameters = {
    "loss_function": "Logloss",
    "eval_metric": "AUC",
    "iterations": 1000,
    "learning_rate": 0.03,
    "random_seed": 42,
    "od_wait": 30,
    "od_type": "Iter",
    "thread_count": 10
}

In [34]:
ctb_data = ctb.Pool(X,y)
result = ctb.cv(ctb_data, parameters, folds=skf, seed=42, verbose_eval=100, plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/3]
0:	test: 0.8063380	best: 0.8063380 (0)	total: 48.4ms	remaining: 48.3s
100:	test: 0.8992077	best: 0.9004548 (91)	total: 243ms	remaining: 2.16s

bestTest = 0.9004548122
bestIteration = 91

Training on fold [1/3]
0:	test: 0.7631425	best: 0.7631425 (0)	total: 2.22ms	remaining: 2.22s

bestTest = 0.8761424289
bestIteration = 3

Training on fold [2/3]
0:	test: 0.8638225	best: 0.8638225 (0)	total: 2.23ms	remaining: 2.23s

bestTest = 0.8718651751
bestIteration = 51



In [35]:
result.loc[result["test-AUC-mean"] == result["test-AUC-mean"].max()]

Unnamed: 0,iterations,test-AUC-mean,test-AUC-std,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std
1,1,0.875277,0.011678,0.656741,0.004899,0.654857,0.003038


In [64]:
len(X.columns)
X = X.astype({"remainder__Sex": int, "remainder__Pclass": int})
categorical_features_indices = [X.columns.get_loc('remainder__Sex'), X.columns.get_loc('remainder__Pclass')]

In [65]:
X.dtypes

standardscaler__Age      float16
standardscaler__SibSp    float16
standardscaler__Parch    float16
remainder__Pclass          int64
remainder__Sex             int64
dtype: object

In [66]:
categorical_features_indices

[4, 3]

In [77]:
parameters = {
    #default
    "loss_function": "Logloss",
    "eval_metric": "AUC",
    "iterations": 1000,
    "learning_rate": 0.03,
    "random_seed": 42,
    "od_wait": 50,
    "od_type": "Iter",
    "thread_count": 10,
    'depth': 3,
    'l2_leaf_reg': 1.0
}

ctb_data = ctb.Pool(X,y,cat_features=categorical_features_indices)
result = ctb.cv(ctb_data, parameters, folds=skf, seed=42, verbose_eval=100, plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/3]
0:	test: 0.8209360	best: 0.8209360 (0)	total: 3.51ms	remaining: 3.51s
100:	test: 0.8901849	best: 0.8917987 (92)	total: 232ms	remaining: 2.06s
200:	test: 0.8970804	best: 0.8974472 (194)	total: 438ms	remaining: 1.74s

bestTest = 0.9003080986
bestIteration = 238

Training on fold [1/3]
0:	test: 0.7791182	best: 0.7791182 (0)	total: 2.96ms	remaining: 2.96s

bestTest = 0.8492359436
bestIteration = 10

Training on fold [2/3]
0:	test: 0.7981648	best: 0.7981648 (0)	total: 2.82ms	remaining: 2.82s
100:	test: 0.8614462	best: 0.8644074 (58)	total: 219ms	remaining: 1.95s

bestTest = 0.8644073993
bestIteration = 58



In [78]:
result.loc[result["test-AUC-mean"] == result["test-AUC-mean"].max()]

Unnamed: 0,iterations,test-AUC-mean,test-AUC-std,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std
238,238,0.869246,0.028093,0.42655,0.033748,0.40158,0.014218
239,239,0.869246,0.028093,0.426548,0.03375,0.401574,0.014228
240,240,0.869246,0.028093,0.426548,0.03375,0.401566,0.014239


In [87]:
clf = ctb.CatBoostClassifier(
    loss_function="Logloss",
    eval_metric="AUC",
    iterations=3000,
    learning_rate=0.01,
    random_seed=42,
    #od_wait=50,
    #od_type="Iter",
    thread_count=10,
    depth=3,
    l2_leaf_reg=1.0)

In [88]:
clf.fit(X, y, 
        cat_features=categorical_features_indices, 
        verbose=True
)

0:	total: 1.03ms	remaining: 3.1s
1:	total: 2.34ms	remaining: 3.5s
2:	total: 3.63ms	remaining: 3.63s
3:	total: 5.2ms	remaining: 3.9s
4:	total: 6.79ms	remaining: 4.07s
5:	total: 7.98ms	remaining: 3.98s
6:	total: 9.17ms	remaining: 3.92s
7:	total: 10.4ms	remaining: 3.88s
8:	total: 11.6ms	remaining: 3.84s
9:	total: 12.4ms	remaining: 3.71s
10:	total: 13.2ms	remaining: 3.57s
11:	total: 14.3ms	remaining: 3.56s
12:	total: 15.3ms	remaining: 3.52s
13:	total: 16.4ms	remaining: 3.49s
14:	total: 17.1ms	remaining: 3.4s
15:	total: 18.2ms	remaining: 3.39s
16:	total: 19.5ms	remaining: 3.42s
17:	total: 20.5ms	remaining: 3.39s
18:	total: 21.8ms	remaining: 3.42s
19:	total: 22.9ms	remaining: 3.41s
20:	total: 23.7ms	remaining: 3.37s
21:	total: 24.3ms	remaining: 3.29s
22:	total: 25.3ms	remaining: 3.28s
23:	total: 26.1ms	remaining: 3.24s
24:	total: 27.1ms	remaining: 3.22s
25:	total: 28.2ms	remaining: 3.22s
26:	total: 29.3ms	remaining: 3.22s
27:	total: 30.4ms	remaining: 3.23s
28:	total: 31.1ms	remaining: 3.18s


397:	total: 375ms	remaining: 2.45s
398:	total: 376ms	remaining: 2.45s
399:	total: 377ms	remaining: 2.45s
400:	total: 378ms	remaining: 2.45s
401:	total: 379ms	remaining: 2.45s
402:	total: 380ms	remaining: 2.44s
403:	total: 380ms	remaining: 2.44s
404:	total: 381ms	remaining: 2.44s
405:	total: 382ms	remaining: 2.44s
406:	total: 383ms	remaining: 2.44s
407:	total: 384ms	remaining: 2.44s
408:	total: 384ms	remaining: 2.44s
409:	total: 385ms	remaining: 2.43s
410:	total: 386ms	remaining: 2.43s
411:	total: 387ms	remaining: 2.43s
412:	total: 388ms	remaining: 2.43s
413:	total: 389ms	remaining: 2.43s
414:	total: 390ms	remaining: 2.43s
415:	total: 391ms	remaining: 2.43s
416:	total: 392ms	remaining: 2.43s
417:	total: 393ms	remaining: 2.43s
418:	total: 394ms	remaining: 2.43s
419:	total: 395ms	remaining: 2.42s
420:	total: 396ms	remaining: 2.42s
421:	total: 397ms	remaining: 2.42s
422:	total: 398ms	remaining: 2.42s
423:	total: 399ms	remaining: 2.42s
424:	total: 400ms	remaining: 2.42s
425:	total: 400ms	re

831:	total: 751ms	remaining: 1.96s
832:	total: 752ms	remaining: 1.96s
833:	total: 753ms	remaining: 1.96s
834:	total: 754ms	remaining: 1.96s
835:	total: 755ms	remaining: 1.95s
836:	total: 756ms	remaining: 1.95s
837:	total: 756ms	remaining: 1.95s
838:	total: 758ms	remaining: 1.95s
839:	total: 759ms	remaining: 1.95s
840:	total: 759ms	remaining: 1.95s
841:	total: 760ms	remaining: 1.95s
842:	total: 761ms	remaining: 1.95s
843:	total: 762ms	remaining: 1.95s
844:	total: 763ms	remaining: 1.95s
845:	total: 764ms	remaining: 1.94s
846:	total: 765ms	remaining: 1.94s
847:	total: 765ms	remaining: 1.94s
848:	total: 766ms	remaining: 1.94s
849:	total: 767ms	remaining: 1.94s
850:	total: 768ms	remaining: 1.94s
851:	total: 769ms	remaining: 1.94s
852:	total: 770ms	remaining: 1.94s
853:	total: 771ms	remaining: 1.94s
854:	total: 772ms	remaining: 1.94s
855:	total: 773ms	remaining: 1.94s
856:	total: 774ms	remaining: 1.93s
857:	total: 775ms	remaining: 1.93s
858:	total: 775ms	remaining: 1.93s
859:	total: 776ms	re

1259:	total: 1.13s	remaining: 1.56s
1260:	total: 1.13s	remaining: 1.55s
1261:	total: 1.13s	remaining: 1.55s
1262:	total: 1.13s	remaining: 1.55s
1263:	total: 1.13s	remaining: 1.55s
1264:	total: 1.13s	remaining: 1.55s
1265:	total: 1.13s	remaining: 1.55s
1266:	total: 1.13s	remaining: 1.55s
1267:	total: 1.13s	remaining: 1.55s
1268:	total: 1.13s	remaining: 1.55s
1269:	total: 1.14s	remaining: 1.55s
1270:	total: 1.14s	remaining: 1.54s
1271:	total: 1.14s	remaining: 1.54s
1272:	total: 1.14s	remaining: 1.54s
1273:	total: 1.14s	remaining: 1.54s
1274:	total: 1.14s	remaining: 1.54s
1275:	total: 1.14s	remaining: 1.54s
1276:	total: 1.14s	remaining: 1.54s
1277:	total: 1.14s	remaining: 1.54s
1278:	total: 1.14s	remaining: 1.54s
1279:	total: 1.15s	remaining: 1.54s
1280:	total: 1.15s	remaining: 1.54s
1281:	total: 1.15s	remaining: 1.54s
1282:	total: 1.15s	remaining: 1.54s
1283:	total: 1.15s	remaining: 1.53s
1284:	total: 1.15s	remaining: 1.53s
1285:	total: 1.15s	remaining: 1.53s
1286:	total: 1.15s	remaining

1698:	total: 1.5s	remaining: 1.15s
1699:	total: 1.5s	remaining: 1.15s
1700:	total: 1.5s	remaining: 1.15s
1701:	total: 1.5s	remaining: 1.15s
1702:	total: 1.51s	remaining: 1.15s
1703:	total: 1.51s	remaining: 1.15s
1704:	total: 1.51s	remaining: 1.15s
1705:	total: 1.51s	remaining: 1.14s
1706:	total: 1.51s	remaining: 1.14s
1707:	total: 1.51s	remaining: 1.14s
1708:	total: 1.51s	remaining: 1.14s
1709:	total: 1.51s	remaining: 1.14s
1710:	total: 1.51s	remaining: 1.14s
1711:	total: 1.51s	remaining: 1.14s
1712:	total: 1.51s	remaining: 1.14s
1713:	total: 1.51s	remaining: 1.14s
1714:	total: 1.52s	remaining: 1.14s
1715:	total: 1.52s	remaining: 1.13s
1716:	total: 1.52s	remaining: 1.13s
1717:	total: 1.52s	remaining: 1.13s
1718:	total: 1.52s	remaining: 1.13s
1719:	total: 1.52s	remaining: 1.13s
1720:	total: 1.52s	remaining: 1.13s
1721:	total: 1.52s	remaining: 1.13s
1722:	total: 1.52s	remaining: 1.13s
1723:	total: 1.52s	remaining: 1.13s
1724:	total: 1.52s	remaining: 1.13s
1725:	total: 1.52s	remaining: 1.

2169:	total: 1.89s	remaining: 723ms
2170:	total: 1.89s	remaining: 723ms
2171:	total: 1.89s	remaining: 722ms
2172:	total: 1.89s	remaining: 721ms
2173:	total: 1.89s	remaining: 720ms
2174:	total: 1.9s	remaining: 719ms
2175:	total: 1.9s	remaining: 718ms
2176:	total: 1.9s	remaining: 717ms
2177:	total: 1.9s	remaining: 716ms
2178:	total: 1.9s	remaining: 715ms
2179:	total: 1.9s	remaining: 714ms
2180:	total: 1.9s	remaining: 714ms
2181:	total: 1.9s	remaining: 713ms
2182:	total: 1.9s	remaining: 712ms
2183:	total: 1.9s	remaining: 711ms
2184:	total: 1.9s	remaining: 710ms
2185:	total: 1.9s	remaining: 709ms
2186:	total: 1.91s	remaining: 708ms
2187:	total: 1.91s	remaining: 707ms
2188:	total: 1.91s	remaining: 707ms
2189:	total: 1.91s	remaining: 707ms
2190:	total: 1.91s	remaining: 706ms
2191:	total: 1.91s	remaining: 705ms
2192:	total: 1.91s	remaining: 704ms
2193:	total: 1.91s	remaining: 703ms
2194:	total: 1.91s	remaining: 702ms
2195:	total: 1.92s	remaining: 701ms
2196:	total: 1.92s	remaining: 700ms
2197

2624:	total: 2.27s	remaining: 324ms
2625:	total: 2.27s	remaining: 323ms
2626:	total: 2.27s	remaining: 322ms
2627:	total: 2.27s	remaining: 321ms
2628:	total: 2.27s	remaining: 320ms
2629:	total: 2.27s	remaining: 319ms
2630:	total: 2.27s	remaining: 318ms
2631:	total: 2.27s	remaining: 318ms
2632:	total: 2.27s	remaining: 317ms
2633:	total: 2.27s	remaining: 316ms
2634:	total: 2.27s	remaining: 315ms
2635:	total: 2.27s	remaining: 314ms
2636:	total: 2.27s	remaining: 313ms
2637:	total: 2.28s	remaining: 312ms
2638:	total: 2.28s	remaining: 312ms
2639:	total: 2.28s	remaining: 311ms
2640:	total: 2.28s	remaining: 310ms
2641:	total: 2.28s	remaining: 309ms
2642:	total: 2.28s	remaining: 308ms
2643:	total: 2.28s	remaining: 307ms
2644:	total: 2.28s	remaining: 306ms
2645:	total: 2.28s	remaining: 306ms
2646:	total: 2.28s	remaining: 305ms
2647:	total: 2.29s	remaining: 304ms
2648:	total: 2.29s	remaining: 303ms
2649:	total: 2.29s	remaining: 302ms
2650:	total: 2.29s	remaining: 301ms
2651:	total: 2.29s	remaining

<catboost.core.CatBoostClassifier at 0x7ff0710dd0d0>

In [89]:
test_transformed = test_transformed.astype({"remainder__Sex": int, "remainder__Pclass": int})
y_pred = clf.predict(test_transformed)
submission = pd.DataFrame({
        "PassengerId": PassengerId,
        "Survived": y_pred
    })
submission.to_csv('submission_catboost.csv', index=False)

In [90]:
y_pred


array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,

# Summary


- Boosting algorithms are the best for heterogeneous data.
- Lightgbm is the fastest and usually the most accurate;
- CatBoost doesn't need a lot of tuning;
- CatBoost shows the best result on data with many categorical variables;
- Always try three methods and ensemble them;
- Remember about regularization and comparing your training score with validation;
- Use LightGBM for experiments and in the end execute other algorithms;
- Firstly, set default parameters, do feature engineering and then come back to parameters tuning;
- More practice will give you more understanding and intuition;
- Use magic of boosting in real life :)



# Sources:

- Открытый курс машинного обучения. Тема 10. Градиентный бустинг. Часть 1 : https://habrahabr.ru/company/ods/blog/327250/
- Introduction to Boosted Trees: http://xgboost.readthedocs.io/en/latest/model.html
- XGBoost, a Top Machine Learning Method on Kaggle, Explained : https://www.kdnuggets.com/2017/10/xgboost-top-machine-learning-method-kaggle-explained.html
- LightGBM: A Highly Efficient Gradient Boosting Decision Tree: https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
- CatBoost: A machine learning library to handle categorical (CAT) data automatically: https://www.analyticsvidhya.com/blog/2017/08/catboost-automated-categorical-data/
- CatBoost: https://catboost.ai/docs/
- CatBoost tutorials: https://github.com/catboost/tutorials