# Getting Started with tabular data!
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/understandable-machine-intelligence-lab/Quantus/main?labpath=tutorials%2FTutorial_Getting_Started_with_Tabular_Data.ipynb)


This notebook shows how to get started with Quantus using tabular data. For this purpose, we use the classic Titanic tabular dataset (Frank E. Harrell Jr., Thomas Cason):

https://www.openml.org/d/40945

The model in this notebook is taken from "Getting started with Captum - Titanic Data Analysis" provided by Captum:

https://captum.ai/tutorials/Titanic_Basic_Interpret

In [1]:
from IPython.display import clear_output

In [36]:
!pip install quantus torch captum tensorflow-datasets
 
clear_output()

In [2]:
import pathlib
import numpy as np
import pandas as pd
import tensorflow_datasets as tfds
from tensorflow.image import grayscale_to_rgb
from sklearn.model_selection import train_test_split

import quantus
from captum.attr import IntegratedGradients

import torch
import torch.nn as nn
torch.manual_seed(27)

clear_output()

np.random.seed(27)

## 1) Preliminaries

### 1.1 Load datasets

We load the dataset using the tensorflow-datasets library. Alternatively, it can be downloaded directly from the OpenML website: https://www.openml.org/d/40945

In [3]:
# Load datasets
(ds, _), ds_info = tfds.load(
    'titanic',
    split=["train", "train[:1]"],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)
df = tfds.as_dataframe(ds, ds_info)

2022-11-21 13:58:56.177374: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


In [4]:
df = df[['features/age', 'features/embarked', 'features/fare', 'features/parch', 'features/pclass', 'features/sex',
       'features/sibsp', 'survived']]

In [5]:
# Data statistics
df.describe()

Unnamed: 0,features/age,features/embarked,features/fare,features/parch,features/pclass,features/sex,features/sibsp,survived
count,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0
mean,23.6766,1.495034,33.269279,0.385027,1.294882,0.355997,0.498854,0.381971
std,17.86619,0.81613,51.747562,0.86556,0.837836,0.478997,1.041658,0.486055
min,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0
25%,7.0,1.0,7.8958,0.0,1.0,0.0,0.0,0.0
50%,24.0,2.0,14.4542,0.0,2.0,0.0,0.0,0.0
75%,35.0,2.0,31.275,0.0,2.0,1.0,1.0,1.0
max,80.0,3.0,512.329224,9.0,2.0,1.0,8.0,1.0


In [24]:
# One-hot encode categorical variables
df_enc = pd.get_dummies(df, columns = ['features/embarked', 'features/pclass', 'features/sex']).sample(frac=1)

In [25]:
# Pandas dataframes to numpy arrays
X = df_enc.drop(['survived'], axis=1).values
Y = df_enc["survived"].values

In [26]:
# Create train and test set
train_features, test_features, train_labels, test_labels = train_test_split(X, Y, test_size = 0.3)

### 1.2 Train a model

The model is based on "Getting started with Captum - Titanic Data Analysis" provided by Captum:

https://captum.ai/tutorials/Titanic_Basic_Interpret

In [27]:
class TitanicSimpleNNModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(13, 12)
        self.sigmoid1 = nn.Sigmoid()
        self.linear2 = nn.Linear(12, 8)
        self.sigmoid2 = nn.Sigmoid()
        self.linear3 = nn.Linear(8, 2)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        lin1_out = self.linear1(x)
        sigmoid_out1 = self.sigmoid1(lin1_out)
        sigmoid_out2 = self.sigmoid2(self.linear2(sigmoid_out1))
        return self.softmax(self.linear3(sigmoid_out2))

In [28]:
net = TitanicSimpleNNModel()

criterion = nn.CrossEntropyLoss()
num_epochs = 200

optimizer = torch.optim.Adam(net.parameters(), lr=0.1)
input_tensor = torch.from_numpy(train_features).type(torch.FloatTensor)
label_tensor = torch.from_numpy(train_labels)
for epoch in range(num_epochs):    
    output = net(input_tensor)
    loss = criterion(output, label_tensor)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if epoch % 20 == 0:
        print ('Epoch {}/{} => Loss: {:.2f}'.format(epoch+1, num_epochs, loss.item()))

Epoch 1/200 => Loss: 0.70
Epoch 21/200 => Loss: 0.56
Epoch 41/200 => Loss: 0.51
Epoch 61/200 => Loss: 0.49
Epoch 81/200 => Loss: 0.48
Epoch 101/200 => Loss: 0.48
Epoch 121/200 => Loss: 0.47
Epoch 141/200 => Loss: 0.46
Epoch 161/200 => Loss: 0.47
Epoch 181/200 => Loss: 0.46


In [29]:
out_probs = net(input_tensor).detach().numpy()
out_classes = np.argmax(out_probs, axis=1)
print("Train Accuracy:", sum(out_classes == train_labels) / len(train_labels))

Train Accuracy: 0.8482532751091703


In [30]:
test_input_tensor = torch.from_numpy(test_features).type(torch.FloatTensor)
out_probs = net(test_input_tensor).detach().numpy()
out_classes = np.argmax(out_probs, axis=1)
print("Test Accuracy:", sum(out_classes == test_labels) / len(test_labels))

Test Accuracy: 0.7786259541984732


### 1.3 Generate explanations

In this example, we rely on the `captum` library. We use the Integrated Gradients method.

In [31]:
ig = IntegratedGradients(net)

In [32]:
test_input_tensor.requires_grad_()
attr, delta = ig.attribute(test_input_tensor,target=1, return_convergence_delta=True)
attr = attr.detach().numpy()

## 2) Quantative evaluation using Quantus

We can evaluate our explanations on a variety of quantuative criteria but as a motivating example we test the ModelParameterRandomisation scores by Adebayo et al., 2018. This metric measures the distance between the original attribution and a newly computed attribution throughout the process of cascadingly/independently randomizing the model parameters of one layer at a time.

In [33]:
# Define metric for evaluation.
metric_init = quantus.ModelParameterRandomisation(
    similarity_func=quantus.similarity_func.correlation_spearman,
    return_sample_correlation=True,
    return_aggregate=True,
    aggregate_func=np.mean,
    layer_order="independent",
    disable_warnings=True,
    normalise=True,
    abs=True,)

In [34]:
# Return ModelParameterRandomisation scores for Integrated Gradients.
scores_intgrad = metric_init(model=net, 
                            x_batch=test_features,
                            y_batch=test_labels,
                            a_batch=None,
                            explain_func=quantus.explain,
                            explain_func_kwargs={"method": "IntegratedGradients", "reduce_axes": ()})

ValueError: a_batch and x_batch must have same number of batches (1 != 393)

In [None]:
print(f"ModelParameterRandomisation scores by Adebayo et al., 2018\n"       
      f"\n • Integrated Gradient = ",scores_intgrad)