# Individual Shap Values

This notebook presentd the usage of the `estimate_individual_shapley_values()` function. It is based on the algorithm described in [interpretable-ml-book](https://christophm.github.io/interpretable-ml-book/shapley.html#estimating-the-shapley-value) and the implementation presented [here](https://medium.com/mlearning-ai/machine-learning-interpretability-shapley-values-with-pyspark-16ffd87227e3).

## Session Setup

In [1]:
import pandas as pd
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.regression import GBTRegressor

In [2]:
from pyspark_ds_toolbox.ml.data_prep import get_features_vector
from pyspark_ds_toolbox.ml.eval import get_p1, estimate_individual_shapley_values

In [3]:
spark = SparkSession.builder\
                .appName('Ml-Pipes') \
                .master('local[1]') \
                .config('spark.executor.memory', '3G') \
                .config('spark.driver.memory', '3G') \
                .config('spark.memory.offHeap.enabled', 'true') \
                .config('spark.memory.offHeap.size', '3G') \
                .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

21/12/05 00:55:31 WARN Utils: Your hostname, matrix.local resolves to a loopback address: 127.0.0.1; using 10.0.0.105 instead (on interface en0)
21/12/05 00:55:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/12/05 00:55:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [4]:
def read_data(file): 
    return pd.read_stata("https://raw.github.com/scunning1975/mixtape/master/" + file)

df = read_data('nsw_mixtape.dta')
df = pd.concat((df, read_data('cps_mixtape.dta')))
df.reset_index(level=0, inplace=True)

df = spark.createDataFrame(df.drop(columns=['data_id']))\
    .withColumn('age2', F.col('age')**2)\
    .withColumn('age3', F.col('age')**3)\
    .withColumn('educ2', F.col('educ')**2)\
    .withColumn('educ_re74', F.col('educ')*F.col('re74'))\
    .withColumn('u74', F.when(F.col('re74')==0, 1).otherwise(0))\
    .withColumn('u75', F.when(F.col('re75')==0, 1).otherwise(0))

features=['age', 'age2', 'age3', 'educ', 'educ2', 'marr', 'nodegree', 'black', 'hisp', 're74', 're75', 'u74', 'u75', 'educ_re74']
df_assembled = get_features_vector(df=df, num_features=features)

In [5]:
train_size=0.8
train, test = df_assembled.randomSplit([train_size, (1-train_size)], seed=12345)

row_of_interest = df_assembled.filter(F.col('index')==3).first()

## Using in a Regression Problem

In [6]:
# Regression
model_regressor = GBTRegressor(labelCol='re78')
p_regression = Pipeline(stages=[model_regressor]).fit(train)

In [7]:
sdf_shap_regression = estimate_individual_shapley_values(
    spark=spark,
    df = df_assembled,
    id_col='index',
    model = p_regression,
    problem_type='regression',
    row_of_interest = row_of_interest,
    feature_names = features,
    features_col='features',
    print_shap_values=False
)

sdf_shap_regression.show(5)



+-----+-------+------------------+
|index|feature|              shap|
+-----+-------+------------------+
|    3|    age| 1531.361699634618|
|    3|   age2|101.16142960996297|
|    3|   age3| 102.3255325098817|
|    3|   educ|-745.7361496945412|
|    3|  educ2|109.55455718459126|
+-----+-------+------------------+
only showing top 5 rows



In [8]:
print('Estimated re78 from shap values decomposition:')
print(df_assembled.select('re78').toPandas().re78.mean() + sdf_shap_regression.select(F.sum('shap')).collect()[0][0])

print('Observed re78:')
v = df_assembled.filter(F.col('index')==3).select('re78').collect()[0][0]
print(v)

Estimated re78 from shap values decomposition:
10996.32116570435
Observed re78:
7506.14599609375


## Using in a Classification Problem

In [9]:
# Regression
model_classifier = GBTClassifier(labelCol='treat')
p_classification = Pipeline(stages=[model_classifier]).fit(train)

df_predicted = p_classification.transform(test)\
    .withColumn('p1', get_p1(F.col('probability')))

In [10]:
sdf_shap_classification = estimate_individual_shapley_values(
    spark=spark,
    df = df_predicted,
    id_col='index',
    model = p_classification,
    problem_type='classification',
    row_of_interest = row_of_interest,
    feature_names = features,
    features_col='features',
    print_shap_values=False
)

sdf_shap_classification.show(5)



+-----+-------+--------------------+
|index|feature|                shap|
+-----+-------+--------------------+
|    3|    age|-0.06047205754473...|
|    3|   age2|-3.97297344350999...|
|    3|   age3|4.368036381176134E-5|
|    3|   educ|-0.00462527661687...|
|    3|  educ2|-1.27617599632531...|
+-----+-------+--------------------+
only showing top 5 rows



In [11]:
print('Estimated treat prob from shap values decomposition:')
print(df_predicted.select('p1').toPandas().p1.mean() + sdf_shap_classification.select(F.sum('shap')).collect()[0][0])


print('Estimated treat prob directly from model:')
v = df_predicted.filter(F.col('index')==3).select('p1').collect()[0][0]
print(v)

Estimated treat prob from shap values decomposition:
0.0523881502510601
Estimated treat prob directly from model:
0.043640851974487305
