# Table of Contents

* [Target Exploration](#1)
* [Numerical Features](#2)
* [Categorical Features/Feature Engineering](#3)
* [Test Set vs Train Set](#4)
* [Target vs Features](#5)
* [Build GBM Model](#6)
* [Build Random Forest Model](#7)
* [Predict on Test Set & Submission](#8)
* [Explanations for GBM model](#9)

In [None]:
# packages

# standard
import numpy as np
import pandas as pd
import time

# plots
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.mosaicplot import mosaic

# missing values visualization
import missingno as msno

# machine learning tools
import h2o
from h2o.estimators import H2OGeneralizedLinearEstimator, H2ORandomForestEstimator, H2OGradientBoostingEstimator

In [None]:
# load data + first glance
df_train = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv')
df_test = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv')
df_sub = pd.read_csv('../input/tabular-playground-series-apr-2021/sample_submission.csv')

# first glance (training data)
df_train.head()

In [None]:
# dimensions
print('Train Set:', df_train.shape)
print('Test Set :', df_test.shape)

In [None]:
# structure
df_train.info()

#### We have quite a few missings here!

In [None]:
# show structure of missings
msno.matrix(df_train)
plt.show()

In [None]:
# fix missings in cabin feature by dummy imputation
df_train.Cabin = df_train.Cabin.fillna('0000')
df_test.Cabin = df_test.Cabin.fillna('0000')

<a id='1'></a>
# Target Exploration

In [None]:
# basic stats
print(df_train.Survived.value_counts())
df_train.Survived.value_counts().plot(kind='bar')
plt.grid()
plt.show()

#### Nice, there is no balancing issue here.

<a id='2'></a>
# Numerical Features

In [None]:
features_num = ['Age', 'SibSp', 'Parch', 'Fare']

In [None]:
# basic summary stats
df_train[features_num].describe(percentiles=[0.01,0.1,0.25,0.5,0.75,0.9,0.99])

In [None]:
# plot distribution of numerical features
for f in features_num:
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10,6), sharex=True)
    ax1.hist(df_train[f], bins=30)
    ax1.grid()
    ax1.set_title(f)
    # for boxplot we need to remove the NaNs first
    feature_wo_nan = df_train[~np.isnan(df_train[f])][f]
    ax2.boxplot(feature_wo_nan, vert=False)
    ax2.grid()
    ax2.set_title(f + '- boxplot')
    plt.show()

### Correlations

In [None]:
corr_pearson = df_train[features_num].corr(method='pearson')
corr_spearman = df_train[features_num].corr(method='spearman')

plt.figure(figsize=(15,5))
ax1 = plt.subplot(1,2,1)
sns.heatmap(corr_pearson, annot=True, cmap='RdYlGn', vmin=-1, vmax=+1)
plt.title('Pearson Correlation')

ax2 = plt.subplot(1,2,2, sharex=ax1)
sns.heatmap(corr_spearman, annot=True, cmap='RdYlGn', vmin=-1, vmax=+1)
plt.title('Spearman Correlation')
plt.show()

In [None]:
# pairwise scatter plot of numerical features
t1 = time.time()
sns.pairplot(df_train[features_num],
             diag_kws = {'alpha': 1.0},
             plot_kws = {'alpha': 0.1})
plt.show()
t2 = time.time()
print('Elapsed time:', np.round(t2-t1,2))

<a id='3'></a>
# Categorical Features / Feature Engineering

In [None]:
features_cat = ['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

In [None]:
# explicit conversions
df_train.Pclass = df_train.Pclass.astype('object')

In [None]:
# summary stats
df_train[features_cat].describe(include='all')

#### Name, Ticket and Cabin have too many levels. We have to look at them separately...

In [None]:
features_cat_4plot = ['Pclass', 'Sex', 'Embarked']

In [None]:
# plot distribution of categorical features
for f in features_cat_4plot:
    plt.figure(figsize=(8,4))
    df_train[f].value_counts().plot(kind='bar')
    plt.title(f)
    plt.grid()
    plt.show()

In [None]:
features_cat_too_many = ['Name', 'Ticket', 'Cabin']

In [None]:
# show frequency counts for the features with many levels
for f in features_cat_too_many:
    print('FEATURE', f, ':')
    print(df_train[f].value_counts())
    print()

### Let's try some feature engineering

In [None]:
# prefix of cabin could be useful
df_train['CabinPrefix'] = df_train.Cabin.apply(lambda x : x[0])
df_test['CabinPrefix'] = df_test.Cabin.apply(lambda x : x[0])

df_train.CabinPrefix.value_counts()

In [None]:
# check for test set as well
df_test.CabinPrefix.value_counts()

In [None]:
df_train['FirstName'] = df_train.Name.map(lambda x: x.split(', ')[1])
df_test['FirstName'] = df_test.Name.map(lambda x: x.split(', ')[1])

In [None]:
df_train.FirstName.value_counts()

In [None]:
df_test.FirstName.value_counts()

In [None]:
df_train['LastName'] = df_train.Name.map(lambda x: x.split(',')[0])
df_test['LastName'] = df_test.Name.map(lambda x: x.split(',')[0])

In [None]:
df_train.LastName.value_counts()

In [None]:
df_test.LastName.value_counts()

<a id='4'></a>
# Test Set vs Train Set

In [None]:
# basic stats for numerical features for training
df_train[features_num].describe()

In [None]:
# and the same for test set
df_test[features_num].describe()

#### We observe quite some diffences between train and test set, e. g. lower age and higher SibSp in test set.

####  Let's explore the age feature a little bit more:

In [None]:
# compare age distributions
plt.figure(figsize=(10,4))
plt.hist(df_train.Age, bins=20, alpha=0.5, label='Train')
plt.hist(df_test.Age, bins=20, alpha=0.5, label='Test')
plt.title('Age - Train vs Test')
plt.legend()
plt.grid()
plt.show()

#### => Age distribution is completely different. Especially, the test set includes much more individuals in the age range 20-30, those have a low survival probability as we will see later! Ceteris paribus we would therefore expect lower survival rates on the test set.

In [None]:
# compare gender distribution
print('Train Set:')
print(df_train.Sex.value_counts(normalize=True))
print()
print('Test Set:')
print(df_test.Sex.value_counts(normalize=True))

#### => Significantly higher percentages of males in the test set (69.8% vs 56.1%). Males have a much lower probability of survival so we can again (as for feature age) expect to see a different survival situation between test and train set, especially because sex is the most important feature (see below).

#### Let's also check the **combined** effect of sex and age:

In [None]:
# let's first add a binned version of age
df_train['Age_bin10'] = pd.cut(df_train.Age, [0,10,20,30,40,50,60,70,80,90])
df_test['Age_bin10'] = pd.cut(df_test.Age, [0,10,20,30,40,50,60,70,80,90])

plt.figure(figsize=(16,4))
ax1 = plt.subplot(1,2,1)
foo = df_train.Age_bin10.value_counts().sort_index()
plt.bar(x=foo.index.astype(str), height=foo.values)
plt.grid()
plt.title('Age binned - Train Set')

ax2 = plt.subplot(1,2,2, sharex=ax1, sharey=ax1)
foo = df_test.Age_bin10.value_counts().sort_index()
plt.bar(x=foo.index.astype(str), height=foo.values)
plt.grid()
plt.title('Age binned - Test Set')
plt.show()

In [None]:
# calc cross tables Sex/Age[binned]
tab_sex_age_train = pd.crosstab(df_train.Sex, df_train.Age_bin10)
tab_sex_age_test = pd.crosstab(df_test.Sex, df_test.Age_bin10)

# and visualize
plt.figure(figsize=(14,7))
ax1 = plt.subplot(2,1,1)
sns.heatmap(tab_sex_age_train, cmap='Blues', 
            annot=True, fmt='d',
            vmin=0, vmax=30000,
            linecolor='black',
            linewidths=0.1)
plt.title('Age/Sex - Train Set')

ax2 = plt.subplot(2,1,2)
plt.subplots_adjust(hspace=0.35)
sns.heatmap(tab_sex_age_test, cmap='Blues',
            annot=True, fmt='d',
            vmin=0, vmax=30000,
            linecolor='black',
            linewidths=0.1)
plt.title('Age/Sex - Test Set')
plt.show()

#### Also the Pclass feature shows completely different distributions:

In [None]:
plt.figure(figsize=(14,4))
ax1 = plt.subplot(1,2,1)
foo = df_train.Pclass.value_counts().sort_index()
plt.bar(x=foo.index.astype(str), height=foo.values)
plt.grid()
plt.title('Pclass - Train Set')

ax2 = plt.subplot(1,2,2, sharex=ax1, sharey=ax1)
foo = df_test.Pclass.value_counts().sort_index()
plt.bar(x=foo.index.astype(str), height=foo.values)
plt.grid()
plt.title('Pclass - Test Set')
plt.show()

<a id='5'></a>
# Target vs Features

### Numerical Features

In [None]:
# plot target vs BINNED numerical features using mosaic plot
plt_para_save = plt.rcParams['figure.figsize'] # remember plot settings

for f in ['Age', 'Fare']:
    # add binned version of each numerical feature first
    new_var = f + '_bin'
    df_train[new_var] = pd.qcut(df_train[f], 10)
    # then create mosaic plot
    plt.rcParams['figure.figsize'] = (16,6) # increase plot size for mosaics
    mosaic(df_train, [new_var, 'Survived'], title='Target vs ' + f + ' [binned]')
    plt.show()
    
# reset plot size again
plt.rcParams['figure.figsize'] = plt_para_save

In [None]:
# plot target vs (discrete) numerical features using mosaic plot
plt_para_save = plt.rcParams['figure.figsize'] # remember plot settings

for f in ['SibSp', 'Parch']:
    plt.rcParams['figure.figsize'] = (16,6) # increase plot size for mosaics
    mosaic(df_train, [f, 'Survived'], title='Target vs ' + f)
    plt.show()
    
# reset plot size again
plt.rcParams['figure.figsize'] = plt_para_save

### Categorical Features

In [None]:
# plot target vs features using mosaic plot
plt_para_save = plt.rcParams['figure.figsize'] # remember plot settings

for f in features_cat_4plot:
    plt.rcParams['figure.figsize'] = (16,6) # increase plot size for mosaics
    mosaic(df_train, [f, 'Survived'], title='Target vs ' + f)
    plt.show()
    
# reset plot size again
plt.rcParams['figure.figsize'] = plt_para_save

#### Strong impact of sex, we will later see that this is our most important feature.

In [None]:
# check our new cabin prefix feature as well
plt_para_save = plt.rcParams['figure.figsize'] # remember plot settings

plt.rcParams['figure.figsize'] = (16,6) # increase plot size for mosaics
mosaic(df_train, ['CabinPrefix', 'Survived'], title='Target vs CabinPrefix')
plt.show()
    
# reset plot size again
plt.rcParams['figure.figsize'] = plt_para_save

#### Ok, also the cabin prefix seems to make a measurable difference!

#### Let's check the names (most frequent only):

In [None]:
# plot target vs features using mosaic plot
plt_para_save = plt.rcParams['figure.figsize'] # remember plot settings

plt.rcParams['figure.figsize'] = (16,6) # increase plot size for mosaics

name_list = df_train.FirstName.value_counts()[0:15].index.tolist()
df_temp = df_train[df_train.FirstName.isin(name_list)]
mosaic(df_temp, ['FirstName', 'Survived'], title='Target vs FirstName (Top 15)')
plt.show()

name_list = df_train.LastName.value_counts()[0:15].index.tolist()
df_temp = df_train[df_train.LastName.isin(name_list)]
mosaic(df_temp, ['LastName', 'Survived'], title='Target vs LastName (Top 15)')
plt.show()

# reset plot size again
plt.rcParams['figure.figsize'] = plt_para_save

<a id='6'></a>
# Build GBM Model

In [None]:
# select predictors
predictors = features_num + features_cat_4plot
predictors = predictors + ['Ticket', 'CabinPrefix', 'FirstName', 'LastName']
print('Number of predictors: ', len(predictors))
print(predictors)

In [None]:
# start H2O
h2o.init(max_mem_size='12G', nthreads=4) # Use maximum of 12 GB RAM and 4 cores

In [None]:
# upload data frames in H2O environment
t1 = time.time()
train_hex = h2o.H2OFrame(df_train)
test_hex = h2o.H2OFrame(df_test)
t2 = time.time()
print('Elapsed time [s]: ', np.round(t2-t1,2))

# force categorical target
train_hex['Survived'] = train_hex['Survived'].asfactor()

In [None]:
# fit Gradient Boosting model
n_cv = 5

fit_GBM = H2OGradientBoostingEstimator(ntrees=100,
                                       max_depth=7,
                                       min_rows=15,
                                       learn_rate=0.1, # default: 0.1
                                       sample_rate=0.8,
                                       col_sample_rate=0.4,
                                       nfolds=n_cv,
                                       score_each_iteration=True,
                                       stopping_metric='auc',
                                       stopping_rounds=5,
                                       stopping_tolerance=0.0001,
                                       seed=999)
# train model
t1 = time.time()
fit_GBM.train(x=predictors,
              y='Survived',
              training_frame=train_hex)
t2 = time.time()
print('Elapsed time [s]: ', np.round(t2-t1,2))

In [None]:
# show cross validation metrics
fit_GBM.cross_validation_metrics_summary()

In [None]:
# show scoring history - training vs cross validations
for i in range(n_cv):
    cv_model_temp = fit_GBM.cross_validation_models()[i]
    df_cv_score_history = cv_model_temp.score_history()
    my_title = 'CV ' + str(1+i) + ' - Scoring History [AUC]'
    plt.scatter(df_cv_score_history.number_of_trees,
                y=df_cv_score_history.training_auc, 
                c='blue', label='training')
    plt.scatter(df_cv_score_history.number_of_trees,
                y=df_cv_score_history.validation_auc, 
                c='darkorange', label='validation')
    plt.title(my_title)
    plt.xlabel('Number of Trees')
    plt.ylabel('AUC')
    plt.ylim(0.8,0.9)
    plt.legend()
    plt.grid()
    plt.show()

In [None]:
# variable importance
fit_GBM.varimp_plot()

In [None]:
# alternative variable importance using SHAP => see direction as well as severity of feature impact
t1 = time.time()
fit_GBM.shap_summary_plot(train_hex);
t2 = time.time()
print('Elapsed time [s]: ', np.round(t2-t1,2))

### Check performance on training data / cross validations

In [None]:
# training performance
perf_train = fit_GBM.model_performance(train=True)
perf_train.plot()

In [None]:
# cross validation performance
perf_cv = fit_GBM.model_performance(xval=True)
perf_cv.plot()

In [None]:
# predict on train set (extract probabilities only)
pred_train_GBM = fit_GBM.predict(train_hex)['p1']
pred_train_GBM = pred_train_GBM.as_data_frame().p1

# plot train set predictions (probabilities)
plt.figure(figsize=(8,4))
plt.hist(pred_train_GBM, bins=100)
plt.title('Predictions on Train Set - GBM')
plt.grid()
plt.show()

In [None]:
# calibration
n_actual = sum(df_train.Survived)
n_pred_GBM = sum(pred_train_GBM)

print('Actual Frequency    :', n_actual)
print('Predicted Frequency :', n_pred_GBM)
print('Calibration Ratio   :', n_pred_GBM / n_actual)

In [None]:
# convert to 0/1
binary_threshold_GBM = 0.485945 # chose such that actual frequency is (approximately) met
pred_train_GBM_binary = np.where(pred_train_GBM > binary_threshold_GBM, 1, 0)
print('Actual Frequency      :', n_actual)
print('Calibrated Prediction :', sum(pred_train_GBM_binary))

In [None]:
# confusion matrix at selected threshold
pd.crosstab(df_train.Survived, pred_train_GBM_binary)

<a id='7'></a>
# Build Random Forest Model

In [None]:
# Random Forest model
n_cv = 5

fit_DRF = H2ORandomForestEstimator(nfolds=n_cv,
                                  distribution='bernoulli',
                                  ntrees=100,
                                  mtries=-1, # automatic selection
                                  max_depth=20,
                                  score_each_iteration=True,
                                  stopping_metric='auc',
                                  stopping_rounds=5,
                                  stopping_tolerance=0.0001,
                                  seed=999)

# train model
t1 = time.time()
fit_DRF.train(x=predictors,
            y='Survived',
            training_frame=train_hex)
t2 = time.time()
print('Elapsed time [s]: ', np.round(t2-t1,2))

In [None]:
# show cross validation metrics
fit_DRF.cross_validation_metrics_summary()

In [None]:
# show scoring history - training vs cross validations
for i in range(n_cv):
    cv_model_temp = fit_DRF.cross_validation_models()[i]
    df_cv_score_history = cv_model_temp.score_history()
    my_title = 'CV ' + str(1+i) + ' - Scoring History [AUC]'
    plt.scatter(df_cv_score_history.number_of_trees,
                y=df_cv_score_history.training_auc, 
                c='blue', label='training')
    plt.scatter(df_cv_score_history.number_of_trees,
                y=df_cv_score_history.validation_auc, 
                c='darkorange', label='validation')
    plt.title(my_title)
    plt.xlabel('Number of Trees')
    plt.ylabel('AUC')
    plt.ylim(0.7,0.9)
    plt.legend()
    plt.grid()
    plt.show()

In [None]:
# variable importance
fit_DRF.varimp_plot()

In [None]:
# alternative variable importance using SHAP => see direction as well as severity of feature impact
t1 = time.time()
fit_DRF.shap_summary_plot(train_hex);
t2 = time.time()
print('Elapsed time [s]: ', np.round(t2-t1,2))

In [None]:
# training performance
perf_train = fit_DRF.model_performance(train=True)
perf_train.plot()

In [None]:
# cross validation performance
perf_cv = fit_DRF.model_performance(xval=True)
perf_cv.plot()

In [None]:
# predict on train set (extract probabilities only)
pred_train_DRF = fit_DRF.predict(train_hex)['p1']
pred_train_DRF = pred_train_DRF.as_data_frame().p1

# plot train set predictions (probabilities)
plt.figure(figsize=(6,4))
plt.hist(pred_train_DRF, bins=100)
plt.title('Predictions on Train Set - Random Forest')
plt.grid()
plt.show()

In [None]:
# calibration
n_actual = sum(df_train.Survived)
n_pred_DRF = sum(pred_train_DRF)

print('Actual Frequency    :', n_actual)
print('Predicted Frequency :', n_pred_DRF)
print('Calibration Ratio   :', n_pred_DRF / n_actual)

In [None]:
# convert to 0/1
binary_threshold_DRF = 0.4709 # chose such that actual frequency is (approximately) met
pred_train_DRF_binary = np.where(pred_train_DRF > binary_threshold_DRF, 1, 0)
print('Actual Frequency      :', n_actual)
print('Calibrated Prediction :', sum(pred_train_DRF_binary))

In [None]:
# confusion matrix at selected threshold
pd.crosstab(df_train.Survived, pred_train_DRF_binary)

<a id='8'></a>
# Predict on Test Set & Submission

### GBM

In [None]:
# predict on test set (extract probabilities only)
pred_test_GBM = fit_GBM.predict(test_hex)['p1']
pred_test_GBM = pred_test_GBM.as_data_frame().p1

# plot test set predictions (probabilities)
plt.figure(figsize=(8,4))
plt.hist(pred_test_GBM, bins=100)
plt.title('Predictions on Test Set - GBM')
plt.grid()
plt.show()

In [None]:
# convert to binary again - aggregate probabilities first
pred_test_GBM_sum = pred_test_GBM.sum()
print('GBM - Sum of probs:', np.round(pred_test_GBM_sum,2))

In [None]:
# we select threshold such that counts are approximately equal to sum of probs (expected frequency)
binary_threshold_GBM_test = 0.44319
pred_test_GBM_binary = np.where(pred_test_GBM > binary_threshold_GBM_test, 1, 0)
pd.Series(pred_test_GBM_binary).value_counts()
print('GBM - Number of Survived - Test Set (binary):', sum(pred_test_GBM_binary))

In [None]:
# GBM submission
df_sub_GBM = df_sub.copy()
df_sub_GBM.Survived = pred_test_GBM_binary
display(df_sub_GBM.head())
# save to file
df_sub_GBM.to_csv('submission_GBM.csv', index=False)

In [None]:
# save probabilities as well
pred_test_GBM.to_csv('probs_GBM.csv', index=False)

### Random Forest

In [None]:
# predict on test set (extract probabilities only)
pred_test_DRF = fit_DRF.predict(test_hex)['p1']
pred_test_DRF = pred_test_DRF.as_data_frame().p1

# plot test set predictions (probabilities)
plt.figure(figsize=(8,4))
plt.hist(pred_test_DRF, bins=100)
plt.title('Predictions on Test Set - Random Forest')
plt.grid()
plt.show()

In [None]:
# convert to binary again - aggregate probabilities first
pred_test_DRF_sum = pred_test_DRF.sum()
print('DRF - Sum of probs:', np.round(pred_test_DRF_sum,2))

In [None]:
# we select threshold such that counts are approximately equal to sum of probs (expected frequency)
binary_threshold_DRF_test = 0.44357
pred_test_DRF_binary = np.where(pred_test_DRF > binary_threshold_DRF_test, 1, 0)
pd.Series(pred_test_DRF_binary).value_counts()
print('DRF - Number of Survived - Test Set (binary):', sum(pred_test_DRF_binary))

In [None]:
# DRF submission
df_sub_DRF = df_sub.copy()
df_sub_DRF.Survived = pred_test_DRF_binary
display(df_sub_DRF.head())
# save to file
df_sub_DRF.to_csv('submission_DRF.csv', index=False)

In [None]:
# save probabilities as well
pred_test_DRF.to_csv('probs_DRF.csv', index=False)

### Blend

In [None]:
# combine predictions in one data frame
df_preds_train = pd.DataFrame({'GBM': pred_train_GBM.values, 'DRF': pred_train_DRF.values})
df_preds_test = pd.DataFrame({'GBM': pred_test_GBM.values, 'DRF': pred_test_DRF.values})

In [None]:
# scatter plot of two prediction sets - TRAIN set
sns.jointplot(data=df_preds_train, x='GBM', y='DRF',
              joint_kws={'s' : 2},
              alpha=0.25)
plt.show()

In [None]:
# scatter plot of two prediction sets - TEST set
sns.jointplot(data=df_preds_test, x='GBM', y='DRF',
              joint_kws={'s' : 2},
              alpha=0.25)
plt.show()

In [None]:
# correlation (on test set)
df_preds_test.corr(method='pearson')

In [None]:
# blend two model results on probability level
w_GBM = 0.8
w_DRF = 1-w_GBM
df_preds_train['blend'] = w_GBM*df_preds_train.GBM + w_DRF*df_preds_train.DRF
df_preds_test['blend'] = w_GBM*df_preds_test.GBM + w_DRF*df_preds_test.DRF

In [None]:
# plot test set predictions (probabilities)
plt.figure(figsize=(8,4))
plt.hist(df_preds_test.blend, bins=100)
plt.title('Predictions on Test Set - Blend')
plt.grid()
plt.show()

In [None]:
# convert to 0/1
print('Actual Frequency:', n_actual)
# recalc threshold (for training)
binary_threshold_BLEND = 0.488345 # chose such that actual frequency is (approximately) met
pred_train_BLEND_binary = np.where(df_preds_train.blend > binary_threshold_BLEND, 1, 0)
print('Number of Survived (binary):', sum(pred_train_BLEND_binary))

In [None]:
# confusion matrix at selected threshold
pd.crosstab(df_train.Survived, pred_train_BLEND_binary)

In [None]:
# convert to binary again - aggregate probabilities first
pred_test_BLEND_sum = df_preds_test.blend.sum()
print('Blend - Sum of probs:', np.round(pred_test_BLEND_sum,2))

In [None]:
# we select threshold such that counts are approximately equal to sum of probs (expected frequency)
binary_threshold_BLEND_test = 0.44189
pred_test_BLEND_binary = np.where(df_preds_test.blend > binary_threshold_BLEND_test, 1, 0)
pd.Series(pred_test_BLEND_binary).value_counts()
print('Blend - Number of Survived - Test Set (binary):', sum(pred_test_BLEND_binary))

In [None]:
# blend submission
df_sub_BLEND = df_sub.copy()
df_sub_BLEND.Survived = pred_test_BLEND_binary
display(df_sub_BLEND.head())
# save to file
df_sub_BLEND.to_csv('submission_BLEND.csv', index=False)

<a id='9'></a>
# Explanations for GBM model

### Let's look a little bit behind the scenes of our GBM model predictions:

In [None]:
# pick an example (from training data)
my_row = 8
train_hex[my_row,:]

In [None]:
# what did we predict?
print('Prediction (binary):', pred_train_GBM_binary[my_row])
print('Prediction (prob.) :', pred_train_GBM[my_row])

In [None]:
# explain prediction by decomposing it into individual contributions
fit_GBM.shap_explain_row_plot(row_index=my_row, frame=train_hex);

#### Interpretation: We have already seen that males had a much lower chance of survival and sex is also here the most important factor. On the positive side we have Parch=1 (# of parents / children aboard) and Pclass=1 (1st class ticket). Impact of name is negligible.