![](https://i.imgur.com/vdca02S.png)

# Introduction

I've recently being playing around with CatBoost and thought it has some pretty cool features that would be fun to try on this dataset. CatBoost's an open-source ML library that uses gradient boosting on decision trees and has a rich toolkit for things like GPU-based training, model analysis, live metrics and visualization. In this nootebook we'll look at CatBoost using the Estonia dataset.

About the data: The datset contains details, like names, age, gender and fate of the 989 passengers abord the MS Estonia at the night of the sinking, 28 September 1994. The MS Estonia was a cruise ferry built in 1979/80 and used to the capricious wheater of the Baltic Sea, but at the night of the 28 a mixture of mechanical failures and poor decision-making collided, which led to her sinking. The diasater claimed 852 lives, which are all accounted for the in data.

Note: Unfortunealy CatBoost plots do not work on Kaggle, but if you run the notebook they do.

Find CatBoost at: https://catboost.ai/

Technical introduction: https://catboost.ai/news/catboost-enables-fast-gradient-boosting-on-decision-trees-using-gpus

Also check out Félix Revert's excellent article on CatBoost: https://towardsdatascience.com/why-you-should-learn-catboost-now-390fb3895f76

In [None]:
import numpy as np
import pandas as pd

pd.reset_option('^display.', silent=True)
pd.set_option('mode.chained_assignment', None)

# Load the full dataset
df = pd.read_csv('/kaggle/input/passenger-list-for-the-estonia-ferry-disaster/estonia-passenger-list.csv')

# List the first five passengers and their fate
df.head()

# Feature preparation

First let's check for any null values in the data.

In [None]:
# Check for null values
null_value_stats = df.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]

Lucky for us, there weren't any. Next let's see what kind of data types we're dealing with. We should either exclude or encode the categorial columns. We'll also delete some of the variables that are not usable for modelling, like name, passenger id and country. These features have no say in predicting the probability for survival. If they have, likely some bias has crept in.

In [None]:
# Print the data types
print(df.dtypes)

# Delete unused variables
df = df.drop(['PassengerId', 'Country', 'Firstname', 'Lastname'],axis=1)

# Save indices of categorial features (Sex, category)
categorical_features_indices = np.where(df.dtypes != np.int64)[0]

# Show the final dataframe
df.head()

Lastly we'll split the dataset in two parts - a training and a validation set.

In [None]:
from sklearn.model_selection import train_test_split

# Split X and y
X = df.drop('Survived', axis=1)
y = df.Survived

# Make a train and validation set of the data
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, stratify=y, random_state=0)

# Model training

Now let's create the model itself: We would go here with default parameters (as they provide a really good baseline almost all the time), the only thing should specify here is custom_loss parameter, as this would give us an ability to see what's going on in terms of this competition metric - accuracy, as well as to be able to watch for logloss, as it would be more smooth on dataset of such size.

We'll make the model and fit it on the data. We should see some cool plots from the process.[](http://)

In [None]:
from catboost import CatBoostClassifier, Pool, cv
from sklearn.metrics import accuracy_score

model = CatBoostClassifier(custom_loss=['Accuracy'],
                           random_seed=0,
                           verbose=200)

model.fit(X_train,
          y_train,
          cat_features=categorical_features_indices,
          eval_set=(X_val, y_val),
          plot=True);

As you can see, it is possible to watch our model learn through some nice plots. You can enable logging_level=Verbose for fit() if you'd like more output. We're presented with logloss first for the train and validation set, with the training decreasing for all iterations and the validation increasing after iteration 100.

If we look at accuracy, with a default CatBoost configuration we can see that the best accuracy value of 0.88 (on training set) was acheived about iteration 400. The validation accuracy speaks about iteration 250, then it starts overfitting the data.

# Model cross-validation

We can run cross-validation as well with some nice plots. This tells CatBoost to record the log loss values at each iteartion. The use a fold count here of 5 with shuffling on. You can set verbose here too to get more details. When you run the block, it will plot the graphs live, which is pretty cool. We use the params_update() method here to change the previous parameters.

In [None]:
# Set log loss as the log function
cv_params = model.get_params()
cv_params.update({ 'loss_function': 'Logloss' })

cv_data = cv(
    Pool(X, y, cat_features=categorical_features_indices),
    cv_params,
    fold_count=5,
    plot=True
)

The cv_data variable contains the result of the CV process. Let's view the mean test accuracy and the best iteration. Overfitting sets in as well when using CV, as the best iteration when measuring validation accuracy was at iteration 162. This means that we should ideally had stopped training here, since the validation error does not improve beyond this point. Another apporach is early stopping, that stops training prematurely when validation accuracy does not improve. We will see that next.

In [None]:
# Print the best validation accuracy and its iteration
print('Best validation accuracy score: {:.2f}±{:.2f} at step {}'.format(
    np.max(cv_data['test-Accuracy-mean']),
    cv_data['test-Accuracy-std'][np.argmax(cv_data['test-Accuracy-mean'])],
    np.argmax(cv_data['test-Accuracy-mean'])
))

# Explicit model params

For convenience, you can define the params explicitly and easily change them before modelling. The below code sets some useful parameters and fits the model without plotting. It uses a learning rate of 0.1 and 500 iterations. If you have a validation set, you should set *use_best_model* to True during training. This way, the resulting tress ensemble is shrinking to the best iteration.

In [None]:
from time import time
params = {
    'iterations': 500,
    'learning_rate': 0.1,
    'eval_metric': 'Accuracy',
    'random_seed': 0,
    'verbose': 200,
    'use_best_model': False
}

train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)
validate_pool = Pool(X_val, y_val, cat_features=categorical_features_indices)

no_stop_model = CatBoostClassifier(**params)
t0 = time()
no_stop_model.fit(train_pool, eval_set=validate_pool)
print("Training time no stopping:", round(time()-t0, 4), "s")

# Early stopping

CatBoost offers a **overfitting-detector** (early stopping) feature to stop training, when validation accuracy stops improving beyond a certain iteration. Here we use the iteration count and wait a maximum of 40 iterations.

In [None]:
params.update({
    'od_type': 'Iter',
    'od_wait': 40
})
early_stop_model = CatBoostClassifier(**params)
t0 = time()
early_stop_model.fit(train_pool, eval_set=validate_pool);
print("Training time early stopping:", round(time()-t0, 4), "s")

Notice we cut the training time down by 1/10 by enabling early stopping! Lets see how the two models compare in terms of complexity (tree count) and performance (accuracy). It turns out, that you actually get more for less in this case.

In [None]:
print(f'No stop model tree count: {no_stop_model.tree_count_}')
print(f'No stop model validation accuracy: {accuracy_score(y_val, no_stop_model.predict(X_val))}')
print()
print(f'Early stop model tree count: {early_stop_model.tree_count_}')
print(f'Early stop model validation accuracy: {accuracy_score(y_val, early_stop_model.predict(X_val))}')

# The snapshot feature

It's possible to save snapshots at each iteration with CatBoost. You can use it for recovering training after an interruption or for starting training with previous results. During the training, CatBoost makes these snapshots — backup copies of intermediate results. The next time you train, it will simply pick up where it left. In this case, the completed iterations of building trees don't need to be repeated.

In [None]:
params = {
    'iterations': 5,
    'eval_metric': 'Accuracy',
    'random_seed': 0,
    'verbose': 200
}
model_snapshot = CatBoostClassifier(**params)
model_snapshot.fit(train_pool, eval_set=validate_pool, save_snapshot=True)

params.update({
    'iterations': 10,
    'learning_rate': 0.1,
})
model_snapshot = CatBoostClassifier(**params)
model_snapshot.fit(train_pool, eval_set=validate_pool, save_snapshot=True)

# Feature importance

Since CatBoost is a tree-based library, it comes with a feature importance attribute out of the box. This is great for understanding why the model predicts the way it does and can also help you select the best features, if you were to re-train the model. Apperantly **Age** is slightly more important that **Sex**, whereas **Category** is of less significance.

In [None]:
model = CatBoostClassifier(iterations=50, random_seed=0, logging_level='Silent').fit(train_pool)
feature_importances = model.get_feature_importance(train_pool)
feature_names = X_train.columns
for score, name in sorted(zip(feature_importances, feature_names), reverse=True):
    print('{}: {}'.format(name, score))

# Model predictions

Let's find the survivors. CatBoost offers the well-known predict() and predict_proba() methods to make class and probability predictions. Apparently, our model thinks that the first 10 passengers in the validation set all died. Here, passenger 10 came closest to surviving with a probability of 32%.

In [None]:
predictions = model.predict(X_val)
predictions_probs = model.predict_proba(X_val)
print(f'Predictions of classes: {predictions[:10]}')
print(f'Prediction of probs: {predictions_probs[:10]}')