# Counterfactuals guided by prototypes on Boston housing dataset

This notebook goes through an example of [prototypical counterfactuals](../methods/CFProto.ipynb) using [k-d trees](https://en.wikipedia.org/wiki/K-d_tree) to build the prototypes. Please check out [this notebook](./cfproto_mnist.ipynb) for a more in-depth application of the method on MNIST using (auto-)encoders and trust scores.

In this example, we will train a simple neural net to predict whether house prices in the Boston area are above the median value or not. We can then find a counterfactual to see which variables need to be changed to increase or decrease a house price above or below the median value.

In [4]:
import tensorflow as tf
tf.get_logger().setLevel(40) # suppress deprecation messages
tf.compat.v1.disable_v2_behavior() # disable TF2 behaviour as alibi code still relies on TF1 constructs 
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import random
from sklearn.datasets import load_boston
from alibi.explainers.cfproto import CounterfactualProto
from sklearn.preprocessing import StandardScaler, MinMaxScaler

print('TF version: ', tf.__version__)
print('Eager execution enabled: ', tf.executing_eagerly()) # False

TF version:  2.5.1
Eager execution enabled:  False


## Load and prepare Boston housing dataset

In [5]:
df = pd.read_csv("./heloc_dataset_v1.csv")
x_cols = list(df.columns.values)
for col in x_cols:
    df[col][df[col].isin([-7, -8, -9])] = 0 
# Get the column names for the covariates and the dependent variable
df = df[(df[x_cols].T != 0).any()]
df['RiskPerformance'] = df['RiskPerformance'].map({'Good':1, 'Bad':0})
df = df.astype(np.float32)
columns = ['RiskPerformance', 'MSinceMostRecentInqexcl7days', 'ExternalRiskEstimate', 'NetFractionRevolvingBurden', 'NumSatisfactoryTrades', 'NumInqLast6M', 
        'NumBank2NatlTradesWHighUtilization', 'AverageMInFile', 'NumRevolvingTradesWBalance', 'MaxDelq2PublicRecLast12M', 'PercentInstallTrades']

df = df[columns]
random.seed(0)
a = list(range(len(df)))
random.shuffle(a)
length = len(a)
print(a[0:10])


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


[8301, 8446, 810, 5497, 2904, 10099, 9183, 2535, 9545, 8724]


In [6]:
train_x, train_y = df.iloc[a[0:int(len(a) * 0.5)], 1:].values, df.iloc[a[0:int(len(a) * 0.5)], 0].values
query_x, query_y = df.iloc[a[int(len(a) * 0.5):int(len(a) * 0.75)], 1:].values, df.iloc[a[int(len(a) * 0.5):int(len(a) * 0.75)], 0].values
test_x, test_y = df.iloc[a[int(len(a) * 0.75):], 1:].values, df.iloc[a[int(len(a) * 0.75):], 0].values

scaler = StandardScaler()
strain_x = scaler.fit_transform(train_x)
squery_x = scaler.transform(query_x)
stest_x = scaler.transform(test_x)

In [7]:
otrain_y = to_categorical(train_y)
oquery_y = to_categorical(query_y)
otest_y = to_categorical(test_y)

In [8]:
nn = load_model('nn_heloc_10.h5')
print(nn.evaluate(squery_x, oquery_y))
print(nn.evaluate(stest_x, otest_y))

`Model.state_updates` will be removed in a future version. This property should not be used in TensorFlow 2.0, as `updates` are applied automatically.


[0.5546283500828205, 0.7200765]
[0.5688649906369058, 0.71931165]


In [9]:
query_p = np.argmax(nn.predict(squery_x), axis = 1)
test_p = np.argmax(nn.predict(stest_x), axis = 1)

In [10]:
print(test_p)

[1 1 1 ... 1 1 0]


In [11]:
query_tmp = np.concatenate((query_x, query_p[:, np.newaxis]), axis = 1)
test_tmp = np.concatenate((test_x, test_p[:, np.newaxis]), axis = 1)

In [13]:
np.save("heloc_query.npy", query_tmp)
np.save("heloc_test.npy", test_tmp)