## Notes

Edge weight = feasbility + actionability

where,

feasbility: statistical - how easy is it to get from data point A to data point B. Modelled by some distance measure (and density estimation)


actionability: something subjective - how easy is it for individual A to become individual B
(modelled by user specified constraints)

distance measure: feasability, start with categorical variables and apply to continuous

In [1]:
import pandas as pd
import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv("adult.csv")

In [3]:
data = data[(data != '?').all(axis=1)].reset_index()

In [4]:
data

Unnamed: 0,index,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,5,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45217,48837,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
45218,48838,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
45219,48839,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
45220,48840,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


## Feasbility: A feature considered distance function 

### On Single Features:

#### Distance Metrics - The Case for breaking Metric Properties

A metric on a set $X$ is a function:
$$ d : X × X → R+ $$
For all $x, y, z$ in $X$, this function is required to satisfy the following conditions:
<ul>
    <li> 1. $ d(x, y) ≥ 0 $ </li>
    <li> 2. $d(x, y) = 0$  iff   $x = y$  </li>
    <li> 3. $ d(x, y) = d(y, x) $   </li>
    <li> 4. $ d(x, z) ≤ d(x, y) + d(y, z) $ </li>
</ul>
    
    
Why might these be violated when generating feasible counterfactual explanations?
<ul> 
    <li> 1. Non-negativity </li>
    <li> 2. identity of indiscernibles </li>
    <li> 3. Symmetry - Easy to increase a feature but impossible to decrease (i.e. age) </li>
    <li> 4. Triange - Easier to go "the long way round" rather than direct jump (i.e education level - impossible to go from kindergarten to PhD)</li>
    
Conjunction of 1 and 2 produce positive definiteness - important for convexity 

Extensions to Distance Metrics:
    <ul>
        <li> Pseudometrics satisfy all properties apart from 2 such that $d(x,x) = 0 $ and <it> possibly </it> $d(x,y)=0$. Similarly, metametrics satisfy all properties other than 2 such that $d(x,x)$ is not necessarily 0. </li> 
        <li> Quasimetrics obey by all properties other than 3. </li>
        <li> Semimetrics obey by all propoeties other than 4. </li>
        

#### Handling different types of feature

Features may be discrete or continuous. 

Popular distance measures on continuous data:
<ul>
    <li> Euclidean distance (L2 norm):
        $ d(x,y) = \sqrt{(x-y)^2} $ </li>
    <li> Manhattan distance (L1 norm): 
        $ d(x,y) = |x-y| $ </li>
    Both can be generalised by the Minkowski distance:
    <li> Minkowski distance:
        $d(x,y) = (|x-y|^p)^\frac{1}{p} $ </li>
    which, in the limit as $p -> \inf$ gives the Chebyshev metric
    <li> Chebyshev distance:
        $d(x,y) = max(x_{i},y_{i}$ </li> </ul>

L norms each have desirable characteristics when applied as similarity measures on continuous features due to the metric space implicity defined by numerical features. When the objects are defined by a set of numerical attributes, there are natural definitions of distance based on geometric analogies. However, in the case of categorical data, there is a lack of metric space and there is no single ordering for the categorical values. For example consider how to define distance between different occupations. By defining a metric space over discrete features we implictly encode a specific set of assumptions. Distance measures applied to categorical features include: 

<ul>
        <li> One-hot encoding of categorical variables to treat as a binary (numerical) variable </li>
    <li> Value Difference Metric (uses probabilities over features) </li>
    
</ul>

How to compute similarity between data points characterised by heterogenous features? 

<ul> 
    <li> Current Approaches: literature </li>
    <li> Naive Approach: Could one hot encode categorical and then consider different distance measures for each feature? We may wish to use Euclidean distance for age and Minkowski distance for all enccoded features representing occupationn <li>
    <li> Better approaches?
        <ul>
            <li> Heterogeneous Euclidean-Overlap Metric: if discrete, returns 0 if same class, 1 otherwise. If continous, Euclidean distance  </li>
            <li> Heterogeneous Value Difference Metric (HVDM): alternative approach that uses a different algorithm for discrete and continous data.   </li>
            <li> 
        </ul>
    </li>
</ul>

How do we take into acount the dependency between features - is a woman who is 45 and pregnant more similar to a 25 year old who is also pregnant or to a 45 year old woman who is not pregnant? 


#### Feasability and density 

Should similarity take into account the density of the experimental data? Is a data point more similar to another if there are lots of similar datapoints in that area? 
How does this work in higher dimensions 

### Naive implementation - one distance measure for discrete features and one for Euclidean 

### Select subset of 50 points from dataset

In [5]:
subset_data = data.sample(n=50)
subset_data

Unnamed: 0,index,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
10716,11571,43,Private,395997,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,55,United-States,<=50K
17946,19423,47,Local-gov,328610,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
36649,39577,34,Private,27153,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,40,United-States,<=50K
14933,16149,47,Private,175600,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
23804,25744,23,State-gov,298871,Some-college,10,Married-civ-spouse,Adm-clerical,Husband,Asian-Pac-Islander,Male,0,0,40,Vietnam,<=50K
1699,1848,29,State-gov,214881,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,40,Honduras,>50K
37548,40545,58,Local-gov,259216,9th,5,Divorced,Other-service,Not-in-family,White,Female,0,0,40,United-States,<=50K
42161,45522,33,Private,149184,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,1977,50,United-States,>50K
21107,22852,26,Private,165510,Bachelors,13,Never-married,Farming-fishing,Own-child,White,Male,0,0,40,United-States,<=50K
12815,13839,43,Self-emp-not-inc,83411,Bachelors,13,Married-civ-spouse,Sales,Husband,White,Male,0,2415,40,United-States,>50K


### Test FACE on adult

In [6]:
import pickle
from face import FACE
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

## Feature Selection

Just use 3 for now 

<ul>
    <li> age (Continuous) </li>
    <li> education (Ordinal) </li>
    <li> occupation (Nominal) </li>

</ul>


## Entropy Based Distance Measure Background 

In [7]:
data = data[(data != '?').all(axis=1)].reset_index()

In [8]:
data_selected = subset_data[['age','education','occupation','income']]

In [9]:
### Discretise (Bin) Age

In [10]:
data_selected['discretised_age'] = pd.qcut(data_selected[['age'][0]], 5, labels=False).values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [11]:
## Encode Nominal variables

In [12]:
data_selected["occupation"] = data_selected["occupation"].astype('category')
data_selected["encoded_occupation"] = data_selected["occupation"].cat.codes
data_selected["income"] = data_selected["income"].astype('category')
data_selected["encoded_income"] = data_selected["income"].cat.codes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using

In [13]:
data_selected

Unnamed: 0,age,education,occupation,income,discretised_age,encoded_occupation,encoded_income
10716,43,HS-grad,Transport-moving,<=50K,2,10,0
17946,47,HS-grad,Craft-repair,<=50K,3,1,0
36649,34,HS-grad,Transport-moving,<=50K,1,10,0
14933,47,HS-grad,Craft-repair,<=50K,3,1,0
23804,23,Some-college,Adm-clerical,<=50K,0,0,0
1699,29,Bachelors,Prof-specialty,>50K,1,7,1
37548,58,9th,Other-service,<=50K,4,6,0
42161,33,Prof-school,Prof-specialty,>50K,1,7,1
21107,26,Bachelors,Farming-fishing,<=50K,0,3,0
12815,43,Bachelors,Sales,>50K,2,8,1


In [14]:
### Encode Ordinal variables

In [15]:
data_selected["education"] = data_selected["education"].astype('category')
data_selected["education"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


HS-grad         13
Some-college    13
Bachelors        9
Masters          4
11th             2
12th             2
Assoc-acdm       2
5th-6th          1
9th              1
Assoc-voc        1
Doctorate        1
Prof-school      1
Name: education, dtype: int64

In [16]:
education_order_dict = {
    'Preschool' : 0,
    '1st-4th': 1,
    '5th-6th': 2,
    '7th-8th':3,
    '9th':4,
    '10th':5,
    '11th':6,
    '12th':7,
    'HS-grad':8,
    'Some-college':9,
    'Prof-school':10,
    'Assoc-acdm':11,
    'Assoc-voc':12,
    'Bachelors':13,
    'Masters':14,
    'Doctorate':15,
} 

In [17]:
education_order_list = []
for x in data_selected['education']:
    education_order_list.append(education_order_dict[x])

In [18]:
data_selected['encoded_education'] = education_order_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [19]:
data_selected

Unnamed: 0,age,education,occupation,income,discretised_age,encoded_occupation,encoded_income,encoded_education
10716,43,HS-grad,Transport-moving,<=50K,2,10,0,8
17946,47,HS-grad,Craft-repair,<=50K,3,1,0,8
36649,34,HS-grad,Transport-moving,<=50K,1,10,0,8
14933,47,HS-grad,Craft-repair,<=50K,3,1,0,8
23804,23,Some-college,Adm-clerical,<=50K,0,0,0,9
1699,29,Bachelors,Prof-specialty,>50K,1,7,1,13
37548,58,9th,Other-service,<=50K,4,6,0,4
42161,33,Prof-school,Prof-specialty,>50K,1,7,1,10
21107,26,Bachelors,Farming-fishing,<=50K,0,3,0,13
12815,43,Bachelors,Sales,>50K,2,8,1,13


In [20]:
data_df = data_selected[['discretised_age','encoded_occupation','encoded_education']].reset_index()
data_df = data_df[['discretised_age','encoded_occupation', 'encoded_education']]
y = data_selected['encoded_income'].reset_index()
y['indexes'] = y.index

In [21]:
#svm = SVC(probability=True)
from sklearn.neighbors import KNeighborsClassifier


svm = KNeighborsClassifier(3)
svm.fit(data_df.values, y['encoded_income'].values)
print(svm.score(data_df.values, y['encoded_income'].values))

0.82


In [22]:
print(svm.predict(data_df.values))

[0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 0 1 0 0 1 0 0 0
 0 0 0 0 0 0 1 1 0 1 0 0 0]


In [23]:
ordinal_indexes = [0,2]
nominal_indexes = [1]

In [24]:
ce = FACE(data_df, svm, ordinal_indexes, nominal_indexes, "categorial")

In [25]:
eg = data_df.iloc[0].values.reshape(1, -1)
path, prob = ce.generate_counterfactual(eg)

50 nodes have been added to graph.
435 edges have been added to graph.


In [26]:
path

Unnamed: 0,discretised_age,encoded_occupation,encoded_education
0,2,10,8
1,2,9,9


In [27]:
#### Importance of Distance Measure in Counterfactual Generation 

In its simplest form, a counterfactual generation algorithm receives as input: an instance to be explained, the set of n available training examples (instances) alongside their associated labels and the associated label of the instance to be explained. The counterfactual generation algorithm uses a distance function to determine how close input to be explained is to each training instance, and searches for the *optimal* instance with the opposite label.

In recent explainable AI literature, counterfactual generation algorithms have been expanded to explain model behaviour. In this case, the counterfactual generation algorithm is initialised as above with the additional input of the machine learning model trained on the training instances. As the counterfactuals are generated to explain the behaviour of the machine learing model the model predictions are used as instance labels instead of the ground truth labels. 

The importance of distance measure selection has been widely studied across many machine learning domains where it has been shown that the choice of measure drastically impacts the bias and generalisation capability of a machine learning algorithm.

WHY IS DISTANCE MEASURE IMPORTANT FOR COUNTERFACTUALS

The 


In [None]:
#### Distance Measures and Feature Types 

The input instances are represented by a vector of features which can include:

Categorical features - describe variables that can take a fixed number of values. These features can be further subdivided into nominal features or ordinal features where ordinal features differ to nominal features by having a  natural ordering over the set of values 
Continuous features - describe variables that can have an infinite number of values whith a natural ordering.

The importance of distance measure selection has been widely studied across many machine learning domains where it has been shown that the choice of measure drastically impacts the bias and generalisation capability of a machine learning algorithm. There have been multiple distance measures proposed in the literature with varying properties with the most popoular measure being the Euclidean distance measure.  

The most popular distance measures are targeted towards continuous features. There have been considerably less distance measures proposed to handle categorica features and even less distance measures proposed to handle a mixture of both types of feature. Many real-world applications have both nominal and linear attributes, including, for
example, over half of the datasets in the UCI Machine Learning Database Repository. This motivates the need for distance measures that are capable of appropriately handling different kinds of features. 

 

In [28]:
### One hot encoding & Euclidean: Why is this a problem 

The most common approach to calculating distances between mixed feature types in counterfactual generation is by first transforming categorical features into binary features via one-hot-encoding such that they are able to be handled by conitnuous distance measures, principally the Euclidean distance.  

TODO what is one-hot encoding. 

In [None]:
#### One Hot Encoding and the Adult dataset 

In [2]:
encoded_data = pd.read_csv("encoded_adult.csv")
encoded_data.head()

NameError: name 'pd' is not defined

In [None]:
#### Select subset of data to generate counterfactuals for 

In [1]:
subset_data = encoded_data.sample(n=10)

NameError: name 'encoded_data' is not defined

By treating categorical variables as continuous we run into a world of problems. 

In [None]:
### Introducing Our Approach 

From Wilson et al. (https://arxiv.org/pdf/cs/9701101.pdf) who argue "that any value stored in a computer is discrete at some level. We adopt an alternative distance measure where we first transform continuous features into categorical features via binning. From this we obtain an input feature vector including nominal and ordinal variables. 

In [None]:
#### Motivating our method 

It is a challenging task to reasonably define a distance of
mixed-categorical data because the relationship among categories of ordinal and nominal attributes exists in different
ways, which yields different types of intercategory distance

In [None]:
#### Example showing success of our method 