# 0.Introduction

**The goal of this notebook is to discuss Shapley values and resulting SHAP**. This algorithm is the answer for the arguably most annoying shortcoming of machine elarning: 'interpretability'. I will discuss how we can use SHAP and provide some theoretical background. As the goal of the notebook is Shapley structure, I will not discuss granularly data preparation nor modeling.

**Content:**
1. Data preparation
2. Model: Sequantial Model Assembling + Gradient Boosting
3. Shapley values

The main sources of knowledge for this notebook are:
1. [https://christophm.github.io/interpretable-ml-book/](https://christophm.github.io/interpretable-ml-book/)
2. [https://en.wikipedia.org/wiki/Shapley_value](https://en.wikipedia.org/wiki/Shapley_value)

**For the readability I hide some lines of code, please uncover it if desired.**

# 1.Data preparation

This is the regular step performed before modeling. I aim on loading the data, checking and cleaning. 

**1.1.Libraries**

First I choose some libraries which can be useful:

In [None]:
import numpy as np
from numpy.random import seed
import pandas as pd 
import matplotlib
import seaborn as sns
import holoviews as hv
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as p3

import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
import xgboost as xgb

from keras import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split
import tensorflow as tf
import os
import collections

import shap

shap.initjs()

RandomState = 123
seed(RandomState)
tf.random.set_seed(RandomState)

print('Libraries correctly loaded')

**1.2.Data cleaning**

Second, include the data:

In [None]:
df_path = "../input/us-airbnb-open-data/AB_US_2020.csv"
usecols=['id','name','latitude','longitude','room_type','price','minimum_nights','number_of_reviews',
         'last_review','reviews_per_month','calculated_host_listings_count','availability_365','city']

df = pd.read_csv(df_path,usecols=usecols,index_col='id')

print('Number of rows: '+ format(df.shape[0]) +', number of features: '+ format(df.shape[1]))

In [None]:
df.head(5)

In [None]:
Missing_Percentage = (df.isnull().sum()).sum()/np.product(df.shape)*100
print("The number of missing entries before cleaning: " + str(round(Missing_Percentage,5)) + " %")

In [None]:
df.info()

Some corrections:

In [None]:
df['name'] = df['name'].fillna('')
df['last_review'] = df['last_review'].fillna('01/01/00')
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)

In [None]:
df.info()

The data is cleaned. Let's define states' variable:

In [None]:
states_dic = {'Asheville':'NC','Austin':'TX','Boston':'MA','Broward County':'FL','Cambridge':'MA','Chicago':'IL','Clark County':'NV','Columbus':'OH','Denver':'CO','Hawaii':'HI','Jersey City':'NJ',
             'Los Angeles':'SC','Nashville':'TN','New Orleans':'MS','New York City':'NY','Oakland':'CA','Pacific Grove':'CA','Portland':'OR','Rhode Island':'RI','Salem':'MA','San Clara Country':'CA',
             'Santa Cruz County':'CA','San Diego':'CA','San Francisco':'CA','San Mateo County':'CA','Seattle':'WA','Twin Cities MSA':'MN','Washington D.C.':'DC'}

df['state'] = df['city'].apply(lambda x : states_dic[x])

# 2.Model: Sequantial Model Assembling + Gradient Boosting

In this chapter I use the approach from [this notebook](https://www.kaggle.com/thomaskonstantin/u-s-airbnb-analysis-and-price-prediction#notebook-container). Please upvote this guy if you like. There will be some differences with my approach:
* I cap the extreme values for price applying max for quantile 0.95, but I don't remove another extreme level data as I find it relevant for analysis
* The data is divded into train and test and results are produced on the basis of test

Aforementioned capping:

In [None]:
upper_bound = 0.95
df.loc[df['price'] >= df['price'].quantile(upper_bound), ['price']] = df['price'].quantile(upper_bound)

In [None]:
df_final = df.drop(columns= ['price'])
Target = df.price

**2.1.NLP cleaning**

Some cleaning functions which I defined in my [another NLP notebook](https://www.kaggle.com/jjmewtw/total-bible-text-study-eda-cluster-bert-nlp). They offer complex cleaning offer:

In [None]:
ps = PorterStemmer()

def lower_column_t(data):
    values = data['name']
    values = values.lower()
    data['name'] = values
    return data

def clean_interpunction(data):
    values = data['name']
    values = values.replace('.','')
    values = values.replace(';','')
    values = values.replace(':','')
    values = values.replace(',','')
    values = values.replace("'","")
    values = values.replace('"','')
    values = values.replace('/',' ')
    values = values.replace('-',' ')
    values = values.replace('+',' ')
    values = values.replace('#',' ')
    values = values.replace('!','')
    values = values.replace('(',' ')
    values = values.replace(')',' ')
    values = values.replace('*',' ')
    values = values.replace('|',' ')
    values = values.replace('&',' and ')
    values = values.replace('@',' at ')
    data['name'] = values
    return data

def stem(a):
    p = nltk.PorterStemmer()
    b = []
    for line in a:

        split_line = line.split(' ')
        length=len(split_line)
        new_line = []

        for word in range(length):
            if word == 0:
                new_line.append(str(p.stem(split_line[word])))
            else:
                new_line[0] = new_line[0] + ' ' + (str(p.stem(split_line[word])))

        b.append(new_line[0])

    return b

def lem(a):
    p = nltk.WordNetLemmatizer()
    b = []
    for line in a:

        split_line = line.split(' ')
        length=len(split_line)
        new_line = []

        for word in range(length):
            if word == 0:
                new_line.append(str(p.lemmatize(split_line[word], pos="v")))
            else:
                new_line[0] = new_line[0] + ' ' + (str(p.lemmatize(split_line[word], pos="v")))

        b.append(new_line[0])

    return b

def tokenize(a):  
    b = []
    for line in a:
        b.append(word_tokenize(line))
                 
    return b

def flatten(a):
    b = []
    for line in a:
        b = ' '.join(line)
    
    return b

def count_words(a):
    b=0
    for line in a:
        b = b + sum([i.strip(string.punctuation).isalpha() for i in line.split()])
        
    return b

def generate_ngrams(text, n_gram=1):
    token = [token for token in text.lower().split(' ') if token != '' if token not in sw]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [' '.join(ngram) for ngram in ngrams]

Cleaning is applied on the variable 'name'. The results:

In [None]:
df_final_prep = df_final

df_final_prep = df_final_prep.apply(lower_column_t, axis=1)
df_final_prep = df_final_prep.apply(clean_interpunction, axis=1)
df_final_prep['name']=stem(df_final_prep.name)

df_final_prep

**2.2.Sequantial Model Assembling**

As mentioned the data is divided into train and test:

In [None]:
x_train,x_test,y_train,y_test = train_test_split(df_final_prep,Target,test_size=0.2,random_state=RandomState)

Applying vocabulary counter on the train set, top ten words:

In [None]:
vocab = collections.Counter(' '.join(x_train['name']).split(' '))

vocab.most_common(10)

And the number of words in total vocabulary:

In [None]:
MAX_LENGTH = max(x_train['name'].apply(lambda x: len(x)))
VOCAB_SIZE = len(vocab.keys())
VECTOR_SPACE = 100
VOCAB_SIZE

Applying the sequential model as in the reference notebook, but on the basis of train data set keeping the same size of evaluation set.

In [None]:
encoded_docs = [tf.keras.preprocessing.text.one_hot(d,VOCAB_SIZE) for d in x_train.name]

padded_docs = tf.keras.preprocessing.sequence.pad_sequences(encoded_docs,maxlen=MAX_LENGTH,padding='post')

n = 1000

padded_docs_eval = padded_docs[0:n]
padded_docs = padded_docs[n:]
Y = y_train[n:]
Y_eval = y_train[:n]

FCNN_MODEL = Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE,VECTOR_SPACE,input_length=MAX_LENGTH),
    tf.keras.layers.Flatten(),
    Dense(activation='relu',units=5),
    Dense(activation='relu',units=1)
])

FCNN_MODEL.compile(optimizer='adam', loss='mse', metrics=['mae'])

tf.keras.utils.plot_model(FCNN_MODEL,show_shapes=True)

Applying the model with 4 epochs and 75 for batch size. This can be further calibrated in your own version of this notebook.

In [None]:
history = FCNN_MODEL.fit(padded_docs, Y,validation_data=(padded_docs_eval,Y_eval), epochs=4, batch_size=75)

And the predictions for train data set:

In [None]:
encoded_docs = [tf.keras.preprocessing.text.one_hot(d,VOCAB_SIZE) for d in x_train.name]

dosc_prep = tf.keras.preprocessing.sequence.pad_sequences(encoded_docs,maxlen=MAX_LENGTH,padding='post')

predictions = FCNN_MODEL.predict(dosc_prep)
predictions = predictions.reshape(-1)
predictions

This is the enhanced train data set. You can see that dummy transformation was applied for state variable, and room type one. I kept all the nuemric variables. I think the most iinteresting factor here is variable 'Name predicted'. This is the value predicted by Sequential model only on the basis of string variable 'name'. I added it to the data set:

In [None]:
df_train_2 = x_train

FactorsToDrop = ['name','last_review','city']

df_train_2 = df_train_2.drop(columns = FactorsToDrop)
df_train_2 = pd.get_dummies(df_train_2)
df_train_2.insert(0, "Actual value", y_train, True)
df_train_2.insert(1, "Name predicted", predictions, True)
df_train_2

**2.3.Gradient Boosting Machine**

The next idea is to define Gradient Boosting Regressors taking two different approaches: data set will contain 'Name predicted', second one will not. I will look how they perform between each other:

In [None]:
RF_withName = xgb.XGBRegressor(random_state=RandomState)
RF_withoutName = xgb.XGBRegressor(random_state=RandomState)

RF_withName_fit = RF_withName.fit(df_train_2.drop(columns = 'Actual value'),df_train_2['Actual value'])
RF_withoutName_fit = RF_withoutName.fit(df_train_2.drop(columns = ['Actual value','Name predicted']),df_train_2['Actual value'])

The table with resulting predictions on the test data set:

In [None]:
encoded_docs = [tf.keras.preprocessing.text.one_hot(d,VOCAB_SIZE) for d in x_test.name]

dosc_prep = tf.keras.preprocessing.sequence.pad_sequences(encoded_docs,maxlen=MAX_LENGTH,padding='post')

predictions = FCNN_MODEL.predict(dosc_prep)
predictions = predictions.reshape(-1)

x_test_2 = x_test

FactorsToDrop = ['name','last_review','city']

x_test_2 = x_test_2.drop(columns = FactorsToDrop)
x_test_2 = pd.get_dummies(x_test_2)
x_test_2.insert(0, "Name predicted", predictions, True)

Results = pd.DataFrame({'Prediction only with Name':predictions,'Prediction RF with Name':RF_withName_fit.predict(x_test_2),'Prediction RF without Name':RF_withoutName_fit.predict(x_test_2.drop(columns = 'Name predicted'))})
Results.insert(0, "Actual value", y_test.values, True)
Results

**2.4.Root Mean Square Error**

Let's look at RMSE:

In [None]:
def f_rmse(predictions, targets):
    return np.sqrt(np.mean((predictions-targets)**2))

Model_Average_RMSE =     f_rmse(Results['Actual value'], y_train.mean())   
OnlyName_RMSE =      f_rmse(Results['Actual value'].values, Results['Prediction only with Name'].values)    
RF_withName_RMSE =      f_rmse(Results['Actual value'].values, Results['Prediction RF with Name'].values)    
RF_withoutName_RMSE =      f_rmse(Results['Actual value'].values, Results['Prediction RF without Name'].values)  

Final = pd.DataFrame({'RMSE': [Model_Average_RMSE,OnlyName_RMSE,RF_withName_RMSE,RF_withoutName_RMSE],'Name': ['Model_Average','OnlyName','GBM_withName','GBM_withoutName']})

plt.plot(Final['Name'],Final['RMSE'])
plt.ylabel('RMSE results')
plt.show()

Model average is the worst, no surprise but good quality check. If it was not a case, it would mean that our predictions are worse than bad. If we use only name factor, we can see already the improvement by approximately 30%, GBM without 'name' gives sligtly better performance than 'name' alone. The best is GBM with name, which leads to more than 40% improvement. This is quite okay, but as I said this model can be much imporved by calibration. Both Sequential and GBM as none of them was calibrated. You can try it by yourself.

# 3.Shapley values

I will now focus on the central poit of this notebook - Shapley approach. First, if you wanted to find here some theoretical background, I am providing it just in a sec. In other case just jump to results.

**3.1.Theoretical background**

What are Shapley values? The Shapley value is a solution concept in cooperative game theory. To each cooperative game it assigns a unique distribution (among the players) of a total surplus generated by the coalition of all players. The Shapley value is characterized by a collection of desirable properties.

Let's assume that coalitional games is defined:
* set of players N
* and function v which maps subsets of players to numbers: $v: 2^N \rightarrow R$
* coalition of players S (subset of N)

The use of function $v(S)$: called the worth of coalition S, describes the total expected sum of payoffs the members of S can obtain by cooperation. The Shapley value is one way to distribute the total gains to the players, assuming that they all collaborate.

This can be given by this complicated formula:

$\phi_i(v)=\frac1{|N|!}\sum_{S\subseteq N\setminus\{i\}}|S|!(|N|-|S|-1)!\left(v(S\cup\{i\})-v(S)\right)$

what can be better understood as:

$\phi_i(v) = \frac{1}{number \: of \: players} \sum_{coalitions \: excluding \: i} \frac{marginal \: contribution \: of \: i \: to \: coalition}{number \: of \: coalitions \: excluding \: i}$

Let's break it down:

For example, we have apartment located in Texas, 50m2 with low number of reviews. Let's assume prediction equals to 100 US dollar. The average prediction for all apartment in the market is 120. We would like to understand how much features contributed such that we have this value.

This is a place where we can directly apply Shapley approach. The "game" is the prediction task for a single instance of the dataset. The "gain" is the actual prediction for this instance minus the average prediction for all instances. The "players" are the feature values of the instance that collaborate to receive the gain (= predict a certain value).

Algorithm will then gradually exclude given combinations, for example marking the driver of biggest differences the fact that flat is in Texas istead of New York. It can also say that the flat's size or low number of reviews is responsible. 

How to calculate it in practice?

The Shapley value is the average marginal contribution of a feature value across all possible coalitions. [More granular description](https://christophm.github.io/interpretable-ml-book/shapley.html).

What is SHAP?

The goal of SHAP is to explain the prediction of an instance x by computing the contribution of each feature to the prediction. The SHAP explanation method computes Shapley values from coalitional game theory. The feature values of a data instance act as players in a coalition. Shapley values tell us how to fairly distribute the "payout" (= the prediction) among the features. A player can be an individual feature value, e.g. for tabular data. 

**3.2.Results**

I will look at some flats from our AiRbnb data set and analyze what is the opition of algorithm about them. To remind if you omitted chapter 1 and 2: we have two algorithms:
* Sequantial Model Assembling only on 'name' + Gradient Boosting Machine on all data
* Gradient Boosting Machine on all data

Functions from shap package are quite heavy computiation-wise. I choose just 5000 flats from the test set with given predictions:

In [None]:
shp_df = x_test_2.sample(n=5000, replace=False, weights=None, random_state=RandomState)
shp_df_2 = shp_df.drop(columns = ['Name predicted'])

Here, I define:
* tree explainer (TreeSHAP): Lundberg proposed TreeSHAP in 2018, as a variant of SHAP for tree-based machine learning models such as decision trees, random forests and gradient boosted trees. TreeSHAP was introduced as a fast, model-specific alternative to KernelSHAP, but it turned out that it can produce unintuitive feature attributions.
* Shap values given on the basis of this explainer

In [None]:
explainer_withName = shap.TreeExplainer(RF_withName_fit, feature_dependence="feature_perturbation")
shap_values_withName = explainer_withName.shap_values(shp_df)

explainer_withoutName = shap.TreeExplainer(RF_withoutName_fit, feature_dependence="feature_perturbation")
shap_values_withoutName = explainer_withoutName.shap_values(shp_df_2)

Let's look at some statistics from the test data set. It will be interesting to observe how chosen by me records perform aginst the general values.

In [None]:
x_test_2[['Name predicted','latitude','longitude', 'minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count','availability_365']].describe()

I choose the flat 505 from randomly drawn sample from the test set.

In [None]:
i = 505

id_505 = ((shp_df.iloc[i:(i+1)].reset_index()).id).values[0]

print("The index of this first guy: " + format(id_505))

We have the flat '20 minutes' from Manhattan. The room is in New Jersey. It is just a room, havig ittle number of reviews (avg is 34). But the flat is trending as it receives around 2.5 reviews monthly vs quantile 75% at level 1.62. The price is very low: 60 US dollar.

In [None]:
df.loc[id_505]

Second flat is located in Austin, Texas.

In [None]:
j = 515

id_515 = ((shp_df.iloc[j:(j+1)].reset_index()).id).values[0]

print("The index of the second guy: " + format(id_515))

As mentioned in the title, it is 'new listing', called also 'peaceful retreat'. This is entire home/apartment. Not many reviews on the account, also not many new reviews. The price is high: 378 $.

In [None]:
df.loc[id_515]

Results for New Jersey guy using GBM with Name:

In [None]:
shap.force_plot(explainer_withName.expected_value, shap_values_withName[i], features=shp_df.iloc[i], feature_names=shp_df.columns)

Results for New Jersey guy using GBM wihout Name:

In [None]:
shap.force_plot(explainer_withoutName.expected_value, shap_values_withoutName[i], features=shp_df_2.iloc[i], feature_names=shp_df_2.columns)

The price estimation difference is only 5$. In the first algorithm 'name' factor plays great role loweirng the price by more than 50 dollar! It is interesting that title with 'Manhattan' led to lower price. Probably word 'room' lowers a lot. Indeed room type plays a big role in both estimations, but in first case it lowers, in second it increases the price. It seems to confirm my assumption then. In contrary, reviews per month lower the price for both of them. 

Time for Texas guy:

First algorithm with name on the board. Multiple increasing factors. 'Name' leads to approx. 320 dollar increase, vast number:

In [None]:
shap.force_plot(explainer_withName.expected_value, shap_values_withName[j], features=shp_df.iloc[j], feature_names=shp_df.columns)

Value is significantly lower (110$ lower). Probably because of information from the name variable:

In [None]:
shap.force_plot(explainer_withoutName.expected_value, shap_values_withoutName[j], features=shp_df_2.iloc[j], feature_names=shp_df_2.columns)

Let's look at 10 different values for algorithm with name:

In [None]:
i = 505
j = 510

shap.force_plot(explainer_withName.expected_value, shap_values_withName[i:j], features=shp_df.iloc[i:j], feature_names=shp_df.columns)

And same 10 records for algorithm without name:

In [None]:
i = 505
j = 510

shap.force_plot(explainer_withoutName.expected_value, shap_values_withoutName[i:j], features=shp_df_2.iloc[i:j], feature_names=shp_df_2.columns)

Another interesting graph is the SHAP value plot for all biggest factors. Far the most relevant is our 'name' factor: 

In [None]:
shap.summary_plot(shap_values_withName, features=shp_df, feature_names=shp_df.columns)

The SHAP value for two factors at once:

In [None]:
shap.dependence_plot('Name predicted', shap_values_withName, shp_df)

In [None]:
shap.dependence_plot('number_of_reviews', shap_values_withName, shp_df)

If you think that this notebook helpd a bit in SHAPing your knowledge, please upvote it.