# Individual Shap Values

This notebook presentd the usage of the `estimate_individual_shapley_values()` function. It is based on the algorithm described in [interpretable-ml-book](https://christophm.github.io/interpretable-ml-book/shapley.html#estimating-the-shapley-value) and the implementation presented [here](https://medium.com/mlearning-ai/machine-learning-interpretability-shapley-values-with-pyspark-16ffd87227e3).

## Session Setup

In [1]:
import numpy as np
import pandas as pd
from pyspark.sql import SparkSession

from pyspark_ds_toolbox.ml.feature_importance.shap_values import estimate_shap_values



## lendo o dataset base

In [2]:
def read_data(file): 
    return pd.read_stata("https://raw.github.com/scunning1975/mixtape/master/" + file)

df = read_data('nsw_mixtape.dta')
df = pd.concat((df, read_data('cps_mixtape.dta')))
df['treat'] = df['treat'].astype(int)
#df.reset_index(level=0, inplace=True)
#df.rename(columns={'index':'id'}, inplace=True)
df['id'] = np.arange(len(df))

df.head()

Unnamed: 0,data_id,treat,age,educ,black,hisp,marr,nodegree,re74,re75,re78,id
0,Dehejia-Wahba Sample,1,37.0,11.0,1.0,0.0,1.0,1.0,0.0,0.0,9930.045898,0
1,Dehejia-Wahba Sample,1,22.0,9.0,0.0,1.0,0.0,1.0,0.0,0.0,3595.894043,1
2,Dehejia-Wahba Sample,1,30.0,12.0,1.0,0.0,0.0,0.0,0.0,0.0,24909.449219,2
3,Dehejia-Wahba Sample,1,27.0,11.0,1.0,0.0,0.0,1.0,0.0,0.0,7506.145996,3
4,Dehejia-Wahba Sample,1,33.0,8.0,1.0,0.0,0.0,1.0,0.0,0.0,289.789886,4


In [3]:
spark = SparkSession.builder\
                .appName('Spark-Toolbox') \
                .master('local[1]') \
                .config('spark.executor.memory', '3G') \
                .config('spark.driver.memory', '3G') \
                .config('spark.memory.offHeap.enabled', 'true') \
                .config('spark.memory.offHeap.size', '3G') \
                .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/12/14 17:01:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
sdf = spark.createDataFrame(df)
sdf.show(3)



+--------------------+-----+----+----+-----+----+----+--------+----+----+--------+---+
|             data_id|treat| age|educ|black|hisp|marr|nodegree|re74|re75|    re78| id|
+--------------------+-----+----+----+-----+----+----+--------+----+----+--------+---+
|Dehejia-Wahba Sample|    1|37.0|11.0|  1.0| 0.0| 1.0|     1.0| 0.0| 0.0|9930.046|  0|
|Dehejia-Wahba Sample|    1|22.0| 9.0|  0.0| 1.0| 0.0|     1.0| 0.0| 0.0|3595.894|  1|
|Dehejia-Wahba Sample|    1|30.0|12.0|  1.0| 0.0| 0.0|     0.0| 0.0| 0.0|24909.45|  2|
+--------------------+-----+----+----+-----+----+----+--------+----+----+--------+---+
only showing top 3 rows



## Regression

In [5]:
shap_values = estimate_shap_values(
    sdf=sdf,
    id_col='id',
    target_col='re78',
    cat_features = ['data_id'],
    sort_metric='rmse',
    problem_type='regression',
    subset_size = 1000,
    max_mem_size = '2G',
    max_models=8,
    max_runtime_secs=15,
    nfolds=5,
    seed=90
)

In [6]:
#shap_values.show(20)

## Classification

In [7]:
shap_values_classification = estimate_shap_values(
    sdf=sdf,
    id_col='id',
    target_col='treat',
    cat_features = ['data_id'],
    sort_metric='aucpr',
    problem_type='classification',
    subset_size = 1000,
    max_mem_size = '2G',
    max_models=8,
    max_runtime_secs=15,
    nfolds=5,
    seed=90
)

In [8]:
#shap_values_classification.show(20)