In [92]:
import pandas as pd
import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC


from face import FACE
from agnostic_face import FACE as AGNOSTIC_FACE


import pickle
import matplotlib.pyplot as plt

## Notes

Edge weight = feasbility + actionability

where,

feasbility: statistical - how easy is it to get from data point A to data point B. Modelled by some distance measure (and density estimation)

actionability: something subjective - how easy is it for individual A to become individual B
(modelled by user specified constraints)

distance measure: feasability, start with categorical variables and apply to continuous

### Counterfactual Generation: The Importance of Distance Measure

In its simplest form, a counterfactual generation algorithm receives as input: an instance to be explained, the set of n available training examples (instances) alongside their associated labels and the associated label of the instance to be explained. The counterfactual generation algorithm uses a distance function to determine how close input to be explained is to each training instance, and searches for the *optimal* instance with the opposite label.

In recent explainable AI literature, counterfactual generation algorithms have been expanded to explain model behaviour. In this case, the counterfactual generation algorithm is initialised as above with the additional input of the machine learning model trained on the training instances. As the counterfactuals are generated to explain the behaviour of the machine learing model the model predictions are used as instance labels instead of the ground truth labels. 

The importance of distance measure selection has been widely studied across many machine learning domains where it has been shown that the choice of measure drastically impacts the bias and generalisation capability of a machine learning algorithm.

TODO: WHY IS DISTANCE MEASURE IMPORTANT FOR COUNTERFACTUALS

### Distance Measures and Feature Types

The input instances are represented by a vector of features which can include:

Categorical features - describe variables that can take a fixed number of values. These features can be further subdivided into nominal features or ordinal features where ordinal features differ to nominal features by having a  natural ordering over the set of values 
Continuous features - describe variables that can have an infinite number of values whith a natural ordering.

Popular distance measures on Continuous Features:
<ul>
    <li> Euclidean distance (L2 norm):
        $ d(x,y) = \sqrt{(x-y)^2} $ </li>
    <li> Manhattan distance (L1 norm): 
        $ d(x,y) = |x-y| $ </li>
    Both can be generalised by the Minkowski distance:
    <li> Minkowski distance:
        $d(x,y) = (|x-y|^p)^\frac{1}{p} $ </li>
    which, in the limit as $p -> \inf$ gives the Chebyshev metric
    <li> Chebyshev distance:
        $d(x,y) = max(x_{i},y_{i}$ </li> </ul>

L norms each have desirable characteristics when applied as similarity measures on continuous features due to the metric space implicity defined by numerical features. When the objects are defined by a set of numerical attributes, there are natural definitions of distance based on geometric analogies. However, in the case of categorical data, there is a lack of metric space and there is no single ordering for the categorical values. For example consider how to define distance between different occupations. By defining a metric space over discrete features we implictly encode a specific set of assumptions. Distance measures applied to categorical features include: 

<ul>
    <li> One-hot encoding of categorical variables to treat as a binary (numerical) variable </li>
    <li> Value Difference Metric (uses probabilities over features) </li>
    
</ul>


The importance of distance measure selection has been widely studied across many machine learning domains where it has been shown that the choice of measure drastically impacts the bias and generalisation capability of a machine learning algorithm. There have been multiple distance measures proposed in the literature with varying properties with the most popoular measure being the Euclidean distance measure.  

The most popular distance measures are targeted towards continuous features. There have been considerably less distance measures proposed to handle categorical features and even less distance measures proposed to handle a mixture of both data types. Many real-world applications have both nominal and linear attributes, including, for
example, over half of the datasets in the UCI Machine Learning Database Repository. This motivates the need for distance measures that are capable of appropriately handling different kinds of features. 


TODO Counterfactual mixed features


### Approaches to Mixed Feature Distances

The most common approach to calculating distances between mixed feature types in counterfactual generation is by first transforming categorical features into binary features via one-hot-encoding such that they are able to be handled by conitnuous distance measures, principally the Euclidean distance.  

TODO what is one-hot encoding. 

### Vanilla Counetrfactual Generation: The Effect of One-Hot-Encoding and Euclidean Distance on FACE Counterfactuals

We run through an example of counterfactual generation for the adult dataset - widely used to evaluate counterfactuals. We use the FACE algorithm with the Euclidean distance measure. The adult dataset is preprocessed according to existing approaches to counterfactual generation. This included one-hot-encoding cateorical features

In [93]:
#### Import Data

In [94]:
data = pd.read_csv("adult.csv")
data = data[(data != '?').all(axis=1)].reset_index()
data

Unnamed: 0,index,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,5,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45217,48837,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
45218,48838,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
45219,48839,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
45220,48840,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


#### Handling different types of feature

How do we take into acount the dependency between features - is a woman who is 45 and pregnant more similar to a 25 year old who is also pregnant or to a 45 year old woman who is not pregnant? 


#### Select subset of 15 instances from dataset

In [95]:
subset_data = data.iloc[10:30]
subset_data

Unnamed: 0,index,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
10,12,26,Private,82091,HS-grad,9,Never-married,Adm-clerical,Not-in-family,White,Female,0,0,39,United-States,<=50K
11,14,48,Private,279724,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,3103,0,48,United-States,>50K
12,15,43,Private,346189,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,50,United-States,>50K
13,16,20,State-gov,444554,Some-college,10,Never-married,Other-service,Own-child,White,Male,0,0,25,United-States,<=50K
14,17,43,Private,128354,HS-grad,9,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,30,United-States,<=50K
15,18,37,Private,60548,HS-grad,9,Widowed,Machine-op-inspct,Unmarried,White,Female,0,0,20,United-States,<=50K
16,20,34,Private,107914,Bachelors,13,Married-civ-spouse,Tech-support,Husband,White,Male,0,0,47,United-States,>50K
17,21,34,Private,238588,Some-college,10,Never-married,Other-service,Own-child,Black,Female,0,0,35,United-States,<=50K
18,23,25,Private,220931,Bachelors,13,Never-married,Prof-specialty,Not-in-family,White,Male,0,0,43,Peru,<=50K
19,24,25,Private,205947,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,40,United-States,<=50K


### Feature Selection

To make our counterfactuals easy to understand include the following features: 

<ul>
    <li> age (Continuous) </li>
    <li> education (Ordinal) </li>
    <li> occupation (Nominal) </li>

</ul>


In [96]:
ordinal_indexes = [0,2]
nominal_indexes = [1]

In [113]:
data_selected = subset_data[['age','education','occupation','income']]
data_selected = data_selected.reset_index().drop("index",axis="columns")
data_selected

Unnamed: 0,age,education,occupation,income
0,26,HS-grad,Adm-clerical,<=50K
1,48,HS-grad,Machine-op-inspct,>50K
2,43,Masters,Exec-managerial,>50K
3,20,Some-college,Other-service,<=50K
4,43,HS-grad,Adm-clerical,<=50K
5,37,HS-grad,Machine-op-inspct,<=50K
6,34,Bachelors,Tech-support,>50K
7,34,Some-college,Other-service,<=50K
8,25,Bachelors,Prof-specialty,<=50K
9,25,Bachelors,Prof-specialty,<=50K


#### Preprocess data

Treat age and education as continuous and one-hot-encode occupation

In [98]:
data_encoded = pd.get_dummies(data_selected.occupation, prefix='occupation')
data_encoded["age"] = data_selected["age"]
data_encoded["education"] = data_selected["education"]
data_encoded["outcome"] = data_selected["income"]

education_dummies = {"education": {
    'Preschool' : 0,
    '1st-4th': 1,
    '5th-6th': 2,
    '7th-8th':3,
    '9th':4,
    '10th':5,
    '11th':6,
    '12th':7,
    'HS-grad':8,
    'Some-college':9,
    'Prof-school':10,
    'Assoc-acdm':11,
    'Assoc-voc':12,
    'Bachelors':13,
    'Masters':14,
    'Doctorate':15,
}} 
data_encoded = data_encoded.replace(education_dummies)
data_encoded["outcome"] = data_encoded["outcome"].astype('category')
data_encoded["outcome"] = data_encoded["outcome"].cat.codes
data_encoded

Unnamed: 0,occupation_Adm-clerical,occupation_Craft-repair,occupation_Exec-managerial,occupation_Machine-op-inspct,occupation_Other-service,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,age,education,outcome
0,1,0,0,0,0,0,0,0,0,26,8,0
1,0,0,0,1,0,0,0,0,0,48,8,1
2,0,0,1,0,0,0,0,0,0,43,14,1
3,0,0,0,0,1,0,0,0,0,20,9,0
4,1,0,0,0,0,0,0,0,0,43,8,0
5,0,0,0,1,0,0,0,0,0,37,8,0
6,0,0,0,0,0,0,0,0,1,34,13,1
7,0,0,0,0,1,0,0,0,0,34,9,0
8,0,0,0,0,0,1,0,0,0,25,13,0
9,0,0,0,0,0,1,0,0,0,25,13,0


In [99]:
data_encoded_Y = data_encoded['outcome']
data_encoded_X = data_encoded.drop(["outcome"], axis="columns")

#### Run the FACE Algorithm to generate the counterfactuals for the following instance

In [100]:
example = data_encoded_X.iloc[[1]]
example

Unnamed: 0,occupation_Adm-clerical,occupation_Craft-repair,occupation_Exec-managerial,occupation_Machine-op-inspct,occupation_Other-service,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,age,education
1,0,0,0,1,0,0,0,0,0,48,8


#### Generate counterfactual graph and print the path from example to nearest counterfactual

In [101]:
ce = AGNOSTIC_FACE(data_encoded_X, data_encoded_Y, ordinal_indexes, nominal_indexes, dist_metric="euclidean")
eg = example.values.reshape(1, -1)
path = ce.generate_counterfactual(eg,1)
path

euclidean
20 nodes have been added to graph.
78 edges have been added to graph.


Unnamed: 0,occupation_Adm-clerical,occupation_Craft-repair,occupation_Exec-managerial,occupation_Machine-op-inspct,occupation_Other-service,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,age,education
0,0,0,0,1,0,0,0,0,0,48,8
1,1,0,0,0,0,0,0,0,0,43,8


#### Evaluate Counterfactual

In [102]:
### alternative counterfactual instances

alternative_counterfactuals = data_encoded.loc[data_encoded.outcome == 0]
alternative_counterfactuals

Unnamed: 0,occupation_Adm-clerical,occupation_Craft-repair,occupation_Exec-managerial,occupation_Machine-op-inspct,occupation_Other-service,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,age,education,outcome
0,1,0,0,0,0,0,0,0,0,26,8,0
3,0,0,0,0,1,0,0,0,0,20,9,0
4,1,0,0,0,0,0,0,0,0,43,8,0
5,0,0,0,1,0,0,0,0,0,37,8,0
7,0,0,0,0,1,0,0,0,0,34,9,0
8,0,0,0,0,0,1,0,0,0,25,13,0
9,0,0,0,0,0,1,0,0,0,25,13,0
11,1,0,0,0,0,0,0,0,0,22,8,0
12,0,0,0,1,0,0,0,0,0,23,8,0
13,0,1,0,0,0,0,0,0,0,54,8,0


In [103]:
alternative_counterfactuals.loc[alternative_counterfactuals.occupation_Sales == 1]

Unnamed: 0,occupation_Adm-clerical,occupation_Craft-repair,occupation_Exec-managerial,occupation_Machine-op-inspct,occupation_Other-service,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,age,education,outcome
17,0,0,0,0,0,0,0,1,0,24,13,0


In [104]:
path

Unnamed: 0,occupation_Adm-clerical,occupation_Craft-repair,occupation_Exec-managerial,occupation_Machine-op-inspct,occupation_Other-service,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,age,education
0,0,0,0,1,0,0,0,0,0,48,8
1,1,0,0,0,0,0,0,0,0,43,8


## Our Proposed Counterfactual Distance Measure 

From Wilson et al. (https://arxiv.org/pdf/cs/9701101.pdf) who argue "that any value stored in a computer is discrete at some level. We adopt an alternative distance measure where we first transform continuous features into categorical features via binning. From this we obtain an input feature vector including nominal and ordinal variables.



It is a challenging task to reasonably define a distance of
mixed-categorical data because the relationship among categories of ordinal and nominal attributes exists in different
ways, which yields different types of intercategory distance

#### Transform orignal data into categorical features 

In [105]:
data_categorised = pd.DataFrame()
data_categorised['age'] = pd.qcut(data_selected[['age'][0]], 5, labels=False).values
data_categorised['education'] = data_encoded['education'].values
data_categorised["raw_occupation"] = data_selected["occupation"].astype('category').values
data_categorised["occupation"] = data_categorised["raw_occupation"].cat.codes
data_categorised["outcome"] = data_encoded["outcome"].values

data_categorised


Unnamed: 0,age,education,raw_occupation,occupation,outcome
0,1,8,Adm-clerical,0,0
1,4,8,Machine-op-inspct,3,1
2,3,14,Exec-managerial,2,1
3,0,9,Other-service,4,0
4,3,8,Adm-clerical,0,0
5,3,8,Machine-op-inspct,3,0
6,2,13,Tech-support,8,1
7,2,9,Other-service,4,0
8,1,13,Prof-specialty,5,0
9,1,13,Prof-specialty,5,0


In [106]:
education_order_dict = {
    'Preschool' : 0,
    '1st-4th': 1,
    '5th-6th': 2,
    '7th-8th':3,
    '9th':4,
    '10th':5,
    '11th':6,
    '12th':7,
    'HS-grad':8,
    'Some-college':9,
    'Prof-school':10,
    'Assoc-acdm':11,
    'Assoc-voc':12,
    'Bachelors':13,
    'Masters':14,
    'Doctorate':15,
} 

In [107]:
data_categorised_X = data_categorised[['age','occupation','education']]

data_categorised_Y = data_categorised['outcome']

In [108]:
### Generate Counterfactuals for example 

In [109]:
example = data_categorised_X.iloc[[1]]

In [110]:
from agnostic_face import FACE as AGNOSTIC_FACE
ce = AGNOSTIC_FACE(data_categorised_X, data_categorised_Y, ordinal_indexes, nominal_indexes, "categorical")

eg = example.values.reshape(1, -1)
path = ce.generate_counterfactual(eg, 1)

path

categorical
20 nodes have been added to graph.
190 edges have been added to graph.


Unnamed: 0,age,occupation,education
0,4,3,8
1,4,1,8


In [111]:
alternative_counterfactuals = data_categorised.loc[data_encoded.outcome == 0]
alternative_counterfactuals

Unnamed: 0,age,education,raw_occupation,occupation,outcome
0,1,8,Adm-clerical,0,0
3,0,9,Other-service,4,0
4,3,8,Adm-clerical,0,0
5,3,8,Machine-op-inspct,3,0
7,2,9,Other-service,4,0
8,1,13,Prof-specialty,5,0
9,1,13,Prof-specialty,5,0
11,0,8,Adm-clerical,0,0
12,0,8,Machine-op-inspct,3,0
13,4,8,Craft-repair,1,0


In [116]:
evaluate = data_selected.reset_index()
evaluate.iloc[[1,4,13]]

Unnamed: 0,index,age,education,occupation,income
1,1,48,HS-grad,Machine-op-inspct,>50K
4,4,43,HS-grad,Adm-clerical,<=50K
13,13,54,HS-grad,Craft-repair,<=50K


### Improving Distance Measure with Constraint Based Rules

#### Distance Metrics - The Case for breaking Metric Properties

A metric on a set $X$ is a function:
$$ d : X × X → R+ $$
For all $x, y, z$ in $X$, this function is required to satisfy the following conditions:
<ul>
    <li> 1. $ d(x, y) ≥ 0 $ </li>
    <li> 2. $d(x, y) = 0$  iff   $x = y$  </li>
    <li> 3. $ d(x, y) = d(y, x) $   </li>
    <li> 4. $ d(x, z) ≤ d(x, y) + d(y, z) $ </li>
</ul>
    
    
Why might these be violated when generating feasible counterfactual explanations?
<ul> 
    <li> 1. Non-negativity </li>
    <li> 2. identity of indiscernibles </li>
    <li> 3. Symmetry - Easy to increase a feature but impossible to decrease (i.e. age) </li>
    <li> 4. Triange - Easier to go "the long way round" rather than direct jump (i.e education level - impossible to go from kindergarten to PhD)</li>
    
Conjunction of 1 and 2 produce positive definiteness - important for convexity 

Extensions to Distance Metrics:
    <ul>
        <li> Pseudometrics satisfy all properties apart from 2 such that $d(x,x) = 0 $ and <it> possibly </it> $d(x,y)=0$. Similarly, metametrics satisfy all properties other than 2 such that $d(x,x)$ is not necessarily 0. </li> 
        <li> Quasimetrics obey by all properties other than 3. </li>
        <li> Semimetrics obey by all propoeties other than 4. </li>
        