# Counterfactuals guided by prototypes on Boston housing dataset

This notebook goes through an example of [prototypical counterfactuals](../methods/CFProto.ipynb) using [k-d trees](https://en.wikipedia.org/wiki/K-d_tree) to build the prototypes. Please check out [this notebook](./cfproto_mnist.ipynb) for a more in-depth application of the method on MNIST using (auto-)encoders and trust scores.

In this example, we will train a simple neural net to predict whether house prices in the Boston area are above the median value or not. We can then find a counterfactual to see which variables need to be changed to increase or decrease a house price above or below the median value.

In [1]:
import tensorflow as tf
tf.get_logger().setLevel(40) # suppress deprecation messages
tf.compat.v1.disable_v2_behavior() # disable TF2 behaviour as alibi code still relies on TF1 constructs 
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import random
from sklearn.datasets import load_boston
from alibi.explainers.cfproto import CounterfactualProto
from sklearn.preprocessing import StandardScaler, MinMaxScaler

print('TF version: ', tf.__version__)
print('Eager execution enabled: ', tf.executing_eagerly()) # False

TF version:  2.5.1
Eager execution enabled:  False


## Load and prepare Boston housing dataset

In [2]:
def load_GMSC(filename):

    df = pd.read_csv(filename)
    df = df.drop("Unnamed: 0", axis=1) # drop id column
    df = df.loc[df["DebtRatio"] <= df["DebtRatio"].quantile(0.975)]
    df = df.loc[(df["RevolvingUtilizationOfUnsecuredLines"] >= 0) & (df["RevolvingUtilizationOfUnsecuredLines"] < 13)]
    df = df.loc[df["NumberOfTimes90DaysLate"] <= 17] 
    dependents_mode = df["NumberOfDependents"].mode()[0] # impute with mode
    df["NumberOfDependents"] = df["NumberOfDependents"].fillna(dependents_mode)
    income_median = df["MonthlyIncome"].median()
    df["MonthlyIncome"] = df["MonthlyIncome"].fillna(income_median)
    mean = df["MonthlyIncome"].mean()
    std = df["MonthlyIncome"].std()
    df.loc[df["MonthlyIncome"].isnull()]["MonthlyIncome"] = np.random.normal(loc=mean, scale=std, size=len(df.loc[df["MonthlyIncome"].isnull()]))

    y = df['SeriousDlqin2yrs']
    idx1 = np.argwhere(y.values == 0).squeeze()
    idx2 = np.argwhere(y.values == 1).squeeze()
    idx3 = idx1[0:len(idx2)]
    idx4 = np.concatenate((idx2, idx3))
    np.random.seed(0)
    np.random.shuffle(idx4)
    idx5 = list(set(np.arange(len(df))).difference(idx4))

    data1 = df.iloc[idx4]
    data2 = df.iloc[idx5]

    return data1, data2


In [3]:
def train_test_split(data):

    x = data.iloc[:, 1:]
    y = data.iloc[:, 0]

    x_ = x.to_numpy().astype(np.float32)
    y_ = y.to_numpy().astype(np.float32)
    y_ = y_[:, np.newaxis]

    x_train, y_train = x_[0: int(len(x_) * 0.5)], y_[0: int(len(y_) * 0.5)]
    x_query, y_query = x_[int(len(x_) * 0.5): int(len(x_) * 0.75)], y_[int(len(y_) * 0.5):int(len(y_) * 0.75)]
    x_test, y_test = x_[int(len(x_) * 0.75): ], y_[int(len(y_) * 0.75):]

    return x_train, y_train, x_query, y_query, x_test, y_test

In [5]:
data1, data2 = load_GMSC('../cs-training.csv')
train_x, train_y, query_x, query_y, test_x, test_y = train_test_split(data1)
scaler = StandardScaler()
strain_x = scaler.fit_transform(train_x)
squery_x = scaler.transform(query_x)
stest_x = scaler.transform(test_x)

In [6]:
otrain_y = to_categorical(train_y)
oquery_y = to_categorical(query_y)
otest_y = to_categorical(test_y)

In [7]:
nn = load_model('nn_GMSC.h5')
print(nn.evaluate(squery_x, oquery_y))
print(nn.evaluate(stest_x, otest_y))

`Model.state_updates` will be removed in a future version. This property should not be used in TensorFlow 2.0, as `updates` are applied automatically.


[0.4701878186175948, 0.7844971]
[0.49762366521687906, 0.76854354]


In [8]:
query_p = np.argmax(nn.predict(squery_x), axis = 1)
test_p = np.argmax(nn.predict(stest_x), axis = 1)

In [9]:
print(test_p)

[1 1 1 ... 0 0 1]


In [10]:
query_tmp = np.concatenate((query_p[:, np.newaxis], query_x), axis = 1)
test_tmp = np.concatenate((test_p[:, np.newaxis], test_x), axis = 1)

In [11]:
np.save("gmsc_query.npy", query_tmp)
np.save("gmsc_test.npy", test_tmp)