# SHAP

SHAP (SHapley Additive exPlanations) is an open source library that applies the Shapley Value of cooperative game theory to machine learning. To calculate the Shapley Value as it is, the number of combinations increases as the number of variables increases, and the amount of calculation becomes enormous. By devising a calculation method, SHAP makes it possible to handle the Sharpe Ray value in machine learning with a realistic calculation time.

Please note that the data is very small, because this note is for shap practice..

In [None]:
import numpy as np 
import pandas as pd 
import os
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

import lightgbm as lgb
import seaborn as sns

# Train data read(parquet format)

https://www.kaggle.com/robikscube/fast-data-loading-and-low-mem-with-parquet-files/notebook

In [None]:
%%time
train_df = pd.read_parquet('../input/ubiquant-parquet/train_low_mem.parquet')
train_df

# Reduce Memory Usage

In [None]:
%%time
#https://www.kaggle.com/hasanbasriakcay/ubiquan-market-preds-pycaret-model-comparisons

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

train_df = reduce_mem_usage(train_df)
train_df

# Reduce data

In [None]:
train=train_df[:93054] #time id:0-40
valid=train_df[93054:116009]  #time id:40-50

In [None]:
X_train=train.drop(['row_id','time_id','investment_id','target'],axis=1)
y_train=train['target']
X_valid=valid.drop(['row_id','time_id','investment_id','target'],axis=1)
y_valid=valid['target']

# Training

In [None]:
%%time
# hyperparams from: https://www.kaggle.com/valleyzw/ubiquant-lgbm-optimization
params = {
    'learning_rate':0.1,
    "objective": "regression",
    "metric": "rmse",
    'boosting_type': "gbdt",
    'verbosity': -1,
    'n_jobs': -1, 
    'seed': 42,
    'lambda_l1': 9.667875964358046, 
    'lambda_l2': 0.00020351345924829076, 
    'num_leaves': 114, 
    'feature_fraction': 0.6255828560615961, 
    'bagging_fraction': 0.9993039198507003, 
    'bagging_freq': 5, 
    'max_depth': 14, 
    'max_bin': 241, 
    'min_data_in_leaf': 219,
    'n_estimators': 1000, 
}



X_train=train.drop(['row_id','time_id','investment_id','target'],axis=1)
y_train=train['target']
X_valid=valid.drop(['row_id','time_id','investment_id','target'],axis=1)
y_valid=valid['target']

lgb_model=lgb.LGBMRegressor(**params)
lgb_model.fit(X_train,
          y_train,
          eval_set=[(X_train,y_train),(X_valid,y_valid)],
          verbose=1000,
          early_stopping_rounds=15)
pred=lgb_model.predict(X_valid)

In [None]:
fi=lgb_model.feature_importances_

lgb_imp = pd.DataFrame() 
lgb_imp['feature'] = X_train.columns
lgb_imp['importance'] = fi

plt.figure(figsize=(10,50))
sns.barplot(x="importance", y="feature",data=lgb_imp.sort_values(by="importance",ascending=False))
plt.title('LightGBM Features (avg over folds)')
plt.tight_layout()

In [None]:
import gc

del train_df,train,valid
gc.collect()

### Let's get to the point.

# SHAP

In [None]:
!pip install shap

In [None]:
import shap

TreeExplainer is a class for efficiently finding SHAP values for decision tree algorithms.

In [None]:
%%time
explainer=shap.TreeExplainer(lgb_model)
explainer

In [None]:
%%time
shap_values=explainer.shap_values(X_train)

## Sample data check

Let's look at the SHAP Value in the sample data.

SHAP Value gives the same dimension and number of elements as the input data.

The larger the shap value, the greater the impact on the prediction.

In [None]:
shap_values[0][:10]

# Visualization

### summary_plot

'summary_plot' can be used to illustrate which explanatory variables have a large impact on the results.


It can be seen that the order is generally similar to that of feature_importances in lightgbm.

In [None]:
%%time
shap.summary_plot(shap_values=shap_values,
                  features=X_train,
                  feature_names=X_train.columns,
                  plot_type='bar')

In [None]:
%%time
shap.summary_plot(shap_values=shap_values,
                  features=X_train,
                  feature_names=X_train.columns)

### force_plot

'force_plot' can be used to visualize the contribution of each explanatory variable

In [None]:
#Do not run it as it will result in OOM
#shap.force_plot(base_value=explainer.expected_value,
#                shap_values=shap_values,
#                features=X_train,
#                feature_names=X_train.columns)

### dependence_plot

'dependence_plot' can be used to create a scatter plot of the relationship between possible values and SHAP Value for a particular feature.

In [None]:
%%time
shap.dependence_plot(ind='f_0',
                     interaction_index=None,
                     shap_values=shap_values,
                     features=X_train,
                     feature_names=X_train.columns)

'dependence_plot' also allows you to specify a different feature for interaction_index. 

For example, you can specify 'f_1'.

In [None]:
%%time
shap.dependence_plot(ind='f_0',
                     interaction_index='f_1',
                     shap_values=shap_values,
                     features=X_train,
                     feature_names=X_train.columns)

### Waterfall Plot

Waterfall Plot, unlike previous versions, focuses on a specific prediction and visualizes it. For example, let's specify the first row of the teacher data.

In [None]:
#For some reason, I get errors in kaggle notebook, so I don't run it.

#shap.waterfall_plot(expected_value=explainer.expected_value,
#                    shap_values=shap_values[0],
#                    features=X_train.iloc[0],
#                    feature_names=X_train.columns)

## Reference(japanese article)

https://blog.amedama.jp/entry/shap-lightgbm

https://www.kaggle.com/lucamassaron/feature-selection-by-boruta-shap-for-ubiquant

https://qiita.com/shin_mura/items/cde01198552eda9146b7