# Individual Shap Values

This notebook presentd the usage of the `estimate_individual_shapley_values()` function. It is based on the algorithm described in [interpretable-ml-book](https://christophm.github.io/interpretable-ml-book/shapley.html#estimating-the-shapley-value) and the implementation presented [here](https://medium.com/mlearning-ai/machine-learning-interpretability-shapley-values-with-pyspark-16ffd87227e3).

## Session Setup

In [1]:
import pandas as pd
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.regression import GBTRegressor

In [2]:
from pyspark_ds_toolbox.ml.data_prep import get_features_vector
from pyspark_ds_toolbox.ml.eval import get_p1, estimate_individual_shapley_values

In [3]:
spark = SparkSession.builder\
                .appName('Ml-Pipes') \
                .master('local[1]') \
                .config('spark.executor.memory', '3G') \
                .config('spark.driver.memory', '3G') \
                .config('spark.memory.offHeap.enabled', 'true') \
                .config('spark.memory.offHeap.size', '3G') \
                .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

21/12/06 11:05:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [4]:
def read_data(file): 
    return pd.read_stata("https://raw.github.com/scunning1975/mixtape/master/" + file)

df = read_data('nsw_mixtape.dta')
df = pd.concat((df, read_data('cps_mixtape.dta')))
df.reset_index(level=0, inplace=True)

df = spark.createDataFrame(df.drop(columns=['data_id']))\
    .withColumn('age2', F.col('age')**2)\
    .withColumn('age3', F.col('age')**3)\
    .withColumn('educ2', F.col('educ')**2)\
    .withColumn('educ_re74', F.col('educ')*F.col('re74'))\
    .withColumn('u74', F.when(F.col('re74')==0, 1).otherwise(0))\
    .withColumn('u75', F.when(F.col('re75')==0, 1).otherwise(0))

features=['age', 'age2', 'age3', 'educ', 'educ2', 'marr', 'nodegree', 'black', 'hisp', 're74', 're75', 'u74', 'u75', 'educ_re74']
df_assembled = get_features_vector(df=df, num_features=features)

In [5]:
train_size=0.8
train, test = df_assembled.randomSplit([train_size, (1-train_size)], seed=12345)

## Using in a Regression Problem

In [6]:
# Regression
model_regressor = GBTRegressor(labelCol='re78')
p_regression = Pipeline(stages=[model_regressor]).fit(train)
sdf_prediction_regression = p_regression.transform(test)

row_of_interest_reg = sdf_prediction_regression.orderBy(F.col('prediction').desc()).first()



In [7]:
sdf_shap_regression = estimate_individual_shapley_values(
    spark=spark,
    df = sdf_prediction_regression,
    id_col='index',
    model = p_regression,
    column_of_interest='prediction',
    problem_type='regression',
    row_of_interest = row_of_interest_reg,
    feature_names = features,
    features_col='features',
    print_shap_values=False
)

sdf_shap_regression.show(5)



+-----+-------+-------------------+
|index|feature|               shap|
+-----+-------+-------------------+
| 6197|    age|-7356.5805590092805|
| 6197|   age2| 18.757125808299836|
| 6197|   age3| 11.650084636037183|
| 6197|   educ|  3990.985778910689|
| 6197|  educ2| -33.71631236170338|
+-----+-------+-------------------+
only showing top 5 rows



In [8]:
print('Estimated re78 from shap values decomposition:')
v = sdf_prediction_regression.select('re78').toPandas().re78.mean() + sdf_shap_regression.select(F.sum('shap')).collect()[0][0]
print(v)

print('Observed re78:')
v = row_of_interest_reg['re78']
print(v)

Estimated re78 from shap values decomposition:




24747.969288207067
Observed re78:
25564.669921875




## Using in a Classification Problem

In [9]:
# Regression
model_classifier = GBTClassifier(labelCol='treat')
p_classification = Pipeline(stages=[model_classifier]).fit(train)

sdf_prediction_classification = p_classification.transform(test)\
    .withColumn('p1', get_p1(F.col('probability')))

row_of_interest_clas = sdf_prediction_classification.orderBy(F.col('p1').desc()).first()



In [10]:
sdf_shap_classification = estimate_individual_shapley_values(
    spark=spark,
    df = sdf_prediction_classification,
    id_col='index',
    model = p_classification,
    column_of_interest='p1',
    problem_type='classification',
    row_of_interest = row_of_interest_clas,
    feature_names = features,
    features_col='features',
    print_shap_values=False
)

sdf_shap_classification.show(5)



+-----+-------+--------------------+
|index|feature|                shap|
+-----+-------+--------------------+
|13922|    age| -27.899380519168098|
|13922|   age2| -0.1111173935016898|
|13922|   age3|0.021388909934701925|
|13922|   educ|  4.1131150480755405|
|13922|  educ2|5.151388818447742E-4|
+-----+-------+--------------------+
only showing top 5 rows



In [11]:
print('Estimated treat prob from shap values decomposition:')
print(sdf_prediction_classification.select('p1').toPandas().p1.mean() + sdf_shap_classification.select(F.sum('shap')).collect()[0][0])


print('Estimated treat prob directly from model:')
v = row_of_interest_clas['p1']
print(v)

Estimated treat prob from shap values decomposition:




0.6212793648137955
Estimated treat prob directly from model:
0.6212793588638306


