# Tabular data: post-hoc explanations with German credit dataset

In [91]:
import pandas as pd
import numpy as np

from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from xailib.data_loaders.dataframe_loader import prepare_dataframe

from xailib.explainers.lime_explainer import LimeXAITabularExplainer
from xailib.explainers.lore_explainer import LoreTabularExplainer
from xailib.explainers.shap_explainer_tab import ShapXAITabularExplainer

from xailib.models.sklearn_classifier_wrapper import sklearn_classifier_wrapper

# Learning and explaining German Credit Dataset

## Loading and preparation of data

In this notebook we are going to use the German Credit dataset for the training of a machine learning model to explain using different explanators. 

The German credit dataset classifies people as high or low credit risk. It is a small dataset, which describes people by means of several features, such as age, job (most of them categorical).

We start by reading from a CSV file the dataset to analyze. The table is loaded by means of the ```DataFrame``` class from the ```pandas``` library.

Among all the attributes of the table, we select the ```default``` column that contains the observed class for the corresponding row.

In [92]:
source_file = 'datasets/german_credit.csv'
class_field = 'default'
# Load and transform dataset 
df = pd.read_csv(source_file, skipinitialspace=True, na_values='?', keep_default_na=True)

The first thing to do is to analyze the dataset under analysis. We can do so by exploiting the .info() function as well as plots for visualizing distributions.

In [93]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   default                     1000 non-null   int64 
 1   account_check_status        1000 non-null   object
 2   duration_in_month           1000 non-null   int64 
 3   credit_history              1000 non-null   object
 4   purpose                     1000 non-null   object
 5   credit_amount               1000 non-null   int64 
 6   savings                     1000 non-null   object
 7   present_emp_since           1000 non-null   object
 8   installment_as_income_perc  1000 non-null   int64 
 9   personal_status_sex         1000 non-null   object
 10  other_debtors               1000 non-null   object
 11  present_res_since           1000 non-null   int64 
 12  property                    1000 non-null   object
 13  age                         1000 non-null   int64

From the .info() function we can see that this dataset does not contain any null feature value. In addition, we can notice that we have a lot of categorical variables, which we have to handle.

After the data is loaded in memory, we need to extract metadata information to automatically handle the content withint the table.

The method ```prepare_dataframe``` scans the table and extract the following information:
 * ```df```: is a trasformed version of the original dataframe, where discrete attributes are transformed into numerical attributes by using one hot encoding strategy;
 * ```feature_names```: is a list containint the names of the features after the transformation;
 * ```class_values```: the list of all the possible values for the ```class_field``` column;
 * ```numeric_columns```: a list of the original features that contain numeric (i.e. continuous) values;
 * ```rdf```: the original dataframe, before the transformation;
 * ```real_feature_names```: the list of the features of the dataframe before the transformation;
 * ```features_map```: it is a dictionary pointing each feature to the original one before the transformation.

In [94]:
df, feature_names, class_values, numeric_columns, rdf, real_feature_names, features_map = prepare_dataframe(df, class_field)

### Learning a Random Forest classfier

We train a RF classifier by using the ```sklearn``` library. We start by splitting the dataset into a train and test subsets. 

In [95]:
test_size = 0.3
random_state = 42
X_train, X_test, Y_train, Y_test = train_test_split(df[feature_names], df[class_field],
                                                        test_size=test_size,
                                                        random_state=random_state,
                                                        stratify=df[class_field])



Then we train the model on the training set. 
Once the model has been learned, we use a wrapper class to get access to the model for ```XAI lib```

In [96]:
bb = RandomForestClassifier(n_estimators=20, random_state=random_state)
bb.fit(X_train.values, Y_train.values)
bbox = sklearn_classifier_wrapper(bb)   

In [97]:
Y_pred = bb.predict(X_test)
print(classification_report(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.76      0.90      0.82       210
           1       0.57      0.32      0.41        90

    accuracy                           0.72       300
   macro avg       0.66      0.61      0.62       300
weighted avg       0.70      0.72      0.70       300



Select a new instance to be classfied by the model and print the predicted class.

In [98]:
inst = X_train.iloc[147].values
print('Instance ',inst)
print('True class ',Y_train.iloc[8])
print('Predicted class ',bb.predict(inst.reshape(1, -1)))

Instance  [ 15 975   2   3  25   2   1   0   1   0   0   0   1   0   0   0   0   0
   0   0   0   0   0   1   0   0   0   1   0   0   0   0   0   1   0   0
   0   1   0   0   0   0   1   1   0   0   0   0   1   0   0   1   0   0
   1   0   0   1   0   0   1]
True class  0
Predicted class  [0]


## Explaining the prediction
We use the explanators of ```XAI lib``` to provide an explantion for the classified instance ```inst```.
Every explainer of ```XAI lib``` takes in input the blackbox to be explained with the corresponding feature names, and a configuration object to initialize the explainer.

### SHAP explainer
SHAP provides several explanator methods: only one of them is agnostic (kernel), while the others are tailored for some specific kind of machine learning model.
In ```config``` you can change the parameters: the explainer (linear, kernel, tree etc.) and the data to pass (all the train set or simply a part of it, or the centroids of a clustering algorithm).

In [99]:
explainer = ShapXAITabularExplainer(bbox, feature_names)
config = {'explainer' : 'tree', 'X_train' : X_train.iloc[0:100].values}
explainer.fit(config)

In [100]:
exp = explainer.explain(inst)
# print(exp.exp)

In [101]:
exp.plot_features_importance()

### LORE explainer
Also in this case, in ```config``` you can define the parameters for the explanation method. For LORE, there are several parameters you can set:
1. neigh generation ('rndgen', 'random', 'genetic', 'geneticp', 'rndgenp') 
2. the size of the neigh to generate
3. the number of neigh gen to run 
4. other parameters, specific for the different kinds of neigh generations

In [111]:
explainer = LoreTabularExplainer(bbox)
config = {'neigh_type':'rndgen', 'size':1000, 'ocr':0.1, 'ngen':15}
explainer.fit(df, class_field, config)
exp = explainer.explain(inst)
print(exp)

<xailib.explainers.lore_explainer.LoreTabularExplanation object at 0x7f983b214d10>


In [112]:
exp.plotRules()

In [113]:
exp.plotCounterfactualRules()

### LIME explainer

In [114]:
limeExplainer = LimeXAITabularExplainer(bbox)
config = {'feature_selection': 'lasso_path'}
limeExplainer.fit(df, class_field, config)
lime_exp = limeExplainer.explain(inst)
print(lime_exp.exp.as_list())

[('duration_in_month', 0.037705948024633694), ('account_check_status=< 0 DM', 0.03582536256475303), ('account_check_status=no checking account', -0.03510634218570146), ('credit_history=critical account/ other credits existing (not at this bank)', -0.03035451539595378), ('age', -0.02482164713361578), ('savings=... < 100 DM', 0.024151438610680913), ('other_installment_plans=none', -0.019875790012019456), ('present_emp_since=... < 1 year ', 0.01867940374862871), ('account_check_status=0 <= ... < 200 DM', 0.01785842620633365), ('credit_amount', 0.0153082061555472)]


In [115]:
# limeExplainer.plot_lime_values(lime_exp.as_list(), 5, 10)
lime_exp.plot_features_importance()

## Learning a different model

### Learning a Logistic Regressor

We train a Logistic Regression by using the ```sklearn``` library. We transform the dataset by using a ```Scaler``` to normalize all the attributes.


In [116]:
scaler = preprocessing.MinMaxScaler().fit(X_train)
X_scaled = scaler.transform(X_train)

bb = LogisticRegression(C=0.5, penalty='l2', solver='sag', max_iter=500, intercept_scaling=0.4 )
bb.fit(X_scaled, Y_train.values)
# pass the model to the wrapper to use it in the XAI lib
bbox = sklearn_classifier_wrapper(bb)

In [117]:
# select a record to explain
inst = X_scaled[182]
print('Instance ',inst)
print('Predicted class ',bb.predict(inst.reshape(1, -1)))

Instance  [0.78571429 0.75805728 1.         1.         0.30357143 0.
 0.         1.         0.         0.         0.         1.
 0.         0.         0.         0.         0.         0.
 1.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 1.         0.         0.         0.         0.         1.
 0.         0.         0.         1.         1.         0.
 0.         0.         0.         0.         1.         0.
 1.         0.         1.         0.         0.         1.
 0.         0.         0.         0.         1.         0.
 1.        ]
Predicted class  [1]


## Explaining the prediction
We use the same explainators as for the previous model. In this case, a few adjustments are necessary for the initialization of the explanators. For example, SHAP needs a specific configuration for the linear model we are using.
### SHAP Explainer

In [118]:
explainer = ShapXAITabularExplainer(bbox, feature_names)
config = {'explainer' : 'linear', 'X_train' : X_scaled[0:100], 'feature_pert' : 'interventional'}
explainer.fit(config)

In [119]:
exp = explainer.explain(inst)
print(exp)

<xailib.explainers.shap_explainer_tab.ShapXAITabularExplanation object at 0x7f983b837a50>


In [120]:
exp.plot_features_importance()

### LORE explainer

In [121]:
explainer = LoreTabularExplainer(bbox)
config = {'neigh_type':'geneticp', 'size':1000, 'ocr':0.1, 'ngen':10}
explainer.fit(df, class_field, config)
exp = explainer.explain(inst)
print(exp)

<xailib.explainers.lore_explainer.LoreTabularExplanation object at 0x7f983b82d910>


In [122]:
exp.plotRules()

In [123]:
exp.plotCounterfactualRules()

### LIME explainer
Also for LIME there are some parameters you can set. In particular, feature-selection can be lasso_path, forward_selection, auto or none.

In [124]:
limeExplainer = LimeXAITabularExplainer(bbox)
config = {'feature_selection': 'lasso_path'}
limeExplainer.fit(df, class_field, config)
lime_exp = limeExplainer.explain(inst)
print(lime_exp.exp.as_list())

[('credit_amount', 0.22754138576339772), ('people_under_maintenance', -0.027117635540170257), ('duration_in_month', 0.022674246194392393), ('personal_status_sex=male : single', 0.020241483432918494), ('present_res_since', 0.016735059112425545), ('foreign_worker=yes', -0.016647768623923757), ('purpose=(vacation - does not exist?)', 0.01619588340323919), ('job=management/ self-employed/ highly qualified employee/ officer', 0.014349312012875655), ('purpose=car (used)', -0.01317583542986617), ('job=unskilled - resident', 0.01293136897938979)]


In [125]:
lime_exp.plot_features_importance()