In this notebook we'll explore feature importance using SHAP values. SHAP values are the most mathematically consistent way for getting feature importances, and they work particulalry nicely with the tree-based models. Unfortunately, calculating SHAP values is an **extremely** resource intensive process. However, starting with XGBoost 1.3 it is possible to calcualte these values on GPUs, whcih speeds up the process by a factor of 20X - 50X compared to calculating the same on a CPU. Furthermore, it is also possible to calculate SHAP values for feature interactions. The GPU speedup for those is even more dramatic - it takes a few minutes, as opposed to days or even longer on a CPU.

In [None]:
%matplotlib inline

At this point the Kaaggle Docker environment does not support XGBoost 1.3+, so we'll have to install it manually.

In [None]:
!pip install --upgrade xgboost
import xgboost as xgb
xgb.__version__


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import gc
import matplotlib.pyplot as plt
import shap

# load JS visualization code to notebook
shap.initjs()

In [None]:


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train = pd.read_csv('../input/tabular-playground-series-jan-2021/train.csv')
test  = pd.read_csv('../input/tabular-playground-series-jan-2021/test.csv')
sub = pd.read_csv('../input/tabular-playground-series-jan-2021/sample_submission.csv')

In [None]:
test.shape

In [None]:
columns = test.columns[1:]
target = train['target'].values

In [None]:
train_oof = np.zeros((300000,))
test_preds = 0
train_oof.shape

In [None]:
## Best hyperparameters from the following notebook: https://www.kaggle.com/hamzaghanmi/xgboost-hyperparameter-tuning-using-optuna

Best_trial= {'lambda': 0.0030282073258141168, 
             'alpha': 0.01563845128469084, 
             'colsample_bytree': 0.55,
             'subsample': 0.7,
             # 'n_estimators': 4000, 
             'learning_rate': 0.01,
             'max_depth': 15,
             'random_state': 2020, 
             'min_child_weight': 257,
             'tree_method':'gpu_hist',
             'predictor': 'gpu_predictor'}

In [None]:
test = xgb.DMatrix(test[columns])

In [None]:
NUM_FOLDS = 8
kf = KFold(n_splits=NUM_FOLDS, shuffle=True, random_state=0)

for f, (train_ind, val_ind) in tqdm(enumerate(kf.split(train, target))):
        #print(f'Fold {f}')
        train_df, val_df = train.iloc[train_ind][columns], train.iloc[val_ind][columns]
        train_target, val_target = target[train_ind], target[val_ind]
        
        train_df = xgb.DMatrix(train_df, label=train_target)
        val_df = xgb.DMatrix(val_df, label=val_target)
        
        model =  xgb.train(Best_trial, train_df, 1500)
        temp_oof = model.predict(val_df)
        temp_test = model.predict(test)

        train_oof[val_ind] = temp_oof
        test_preds += temp_test/NUM_FOLDS
        
        print(mean_squared_error(temp_oof, val_target, squared=False))

In [None]:
0.6959799893005467


In [None]:
mean_squared_error(train_oof, target, squared=False)


In [None]:
np.save('train_oof', train_oof)
np.save('test_preds', test_preds)

Next, we calculate the SHAP values for the test set.

In [None]:
%%time
shap_preds = model.predict(test, pred_contribs=True)

As you can see, it only took less than two minutes to calcualte these values. On the CPU in Kaggle environment it would take many hours. 

Now let's do some plots of these values.

In [None]:
test  = pd.read_csv('../input/tabular-playground-series-jan-2021/test.csv')


In [None]:
# summarize the effects of all the features
shap.summary_plot(shap_preds[:,:-1], test[columns])


In [None]:
shap.summary_plot(shap_preds[:,:-1], test[columns], plot_type="bar")


Next, we'll calculate SHAP values for featue interactions. There will be 15x15x200,000 + 200,000 numbers that need to be computed. 

In [None]:
test = xgb.DMatrix(test[columns])

In [None]:
%%time
shap_interactions = model.predict(test, pred_interactions=True)

It took less than 10 minutes to calculate these values. On CPU this woudl take a few days to compute.

Now let's take a look at what are the top interactions in this dataset.

In [None]:
def plot_top_k_interactions(feature_names, shap_interactions, k):
    # Get the mean absolute contribution for each feature interaction
    aggregate_interactions = np.mean(np.abs(shap_interactions[:, :-1, :-1]), axis=0)
    interactions = []
    for i in range(aggregate_interactions.shape[0]):
        for j in range(aggregate_interactions.shape[1]):
            if j < i:
                interactions.append(
                    (feature_names[i] + "-" + feature_names[j], aggregate_interactions[i][j] * 2))
    # sort by magnitude
    interactions.sort(key=lambda x: x[1], reverse=True)
    interaction_features, interaction_values = map(tuple, zip(*interactions))
    plt.bar(interaction_features[:k], interaction_values[:k])
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.show()

In [None]:
plot_top_k_interactions(columns, shap_interactions, 10)


Interesting, it seems that the cont13 feature interacts a lot with the others in the dataset.



We'll now try to create additional features on teh basis of these featue interactions, and see how well the new model performs.

In [None]:
del shap_interactions, shap_preds
gc.collect()
gc.collect()

In [None]:
train = pd.read_csv('../input/tabular-playground-series-jan-2021/train.csv')
test  = pd.read_csv('../input/tabular-playground-series-jan-2021/test.csv')
sub = pd.read_csv('../input/tabular-playground-series-jan-2021/sample_submission.csv')

In [None]:
train['cont13_cont4'] = train['cont13']*train['cont4']
train['cont13_cont11'] = train['cont13']*train['cont11']
train['cont13_cont7'] = train['cont13']*train['cont7']
train['cont13_cont2'] = train['cont13']*train['cont2']
train['cont13_cont10'] = train['cont13']*train['cont10']

test['cont13_cont4'] = test['cont13']*test['cont4']
test['cont13_cont11'] = test['cont13']*test['cont11']
test['cont13_cont7'] = test['cont13']*test['cont7']
test['cont13_cont2'] = test['cont13']*test['cont2']
test['cont13_cont10'] = test['cont13']*test['cont10']

In [None]:
train.shape

In [None]:
columns = test.columns[1:]
target = train['target'].values

In [None]:
train_oof_2 = np.zeros((300000,))
test_preds_2 = 0
train_oof_2.shape

In [None]:
test = xgb.DMatrix(test[columns])

In [None]:
Best_trial= {'lambda': 0.0030282073258141168, 
             'alpha': 0.01563845128469084, 
             'colsample_bytree': 0.55,
             'subsample': 0.7,
             # 'n_estimators': 4000, 
             'learning_rate': 0.01,
             'max_depth': 15,
             'random_state': 2020, 
             'min_child_weight': 257,
             'tree_method':'gpu_hist',
             'predictor': 'gpu_predictor'}

In [None]:
kf = KFold(n_splits=NUM_FOLDS, shuffle=True, random_state=0)

for f, (train_ind, val_ind) in tqdm(enumerate(kf.split(train, target))):
        #print(f'Fold {f}')
        train_df, val_df = train.iloc[train_ind][columns], train.iloc[val_ind][columns]
        train_target, val_target = target[train_ind], target[val_ind]
        
        train_df = xgb.DMatrix(train_df, label=train_target)
        val_df = xgb.DMatrix(val_df, label=val_target)
        
        model =  xgb.train(Best_trial, train_df, 1600)
        temp_oof = model.predict(val_df)
        temp_test = model.predict(test)

        train_oof_2[val_ind] = temp_oof
        test_preds_2 += temp_test/NUM_FOLDS
        
        print(mean_squared_error(temp_oof, val_target, squared=False))

In [None]:
0.6961520115102843

In [None]:
mean_squared_error(train_oof_2, target, squared=False)


In [None]:
mean_squared_error(0.6*train_oof+0.4*train_oof_2, target, squared=False)


In [None]:
np.save('train_oof_2', train_oof_2)
np.save('test_preds_2', test_preds_2)

In [None]:
0.6959956268557288

In [None]:
sub['target'] = test_preds
sub.to_csv('submission.csv', index=False)

In [None]:
sub['target'] = test_preds_2
sub.to_csv('submission_2.csv', index=False)

In [None]:
sub['target'] = 0.6*test_preds+0.4*test_preds_2
sub.to_csv('submission_average.csv', index=False)