## Notes

Edge weight = feasbility + actionability

where,

feasbility: statistical - how easy is it to get from data point A to data point B. Modelled by some distance measure (and density estimation)


actionability: something subjective - how easy is it for individual A to become individual B
(modelled by user specified constraints)

distance measure: feasability, start with categorical variables and apply to continuous

In [7]:
import pandas as pd
import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv("adult.csv")

In [3]:
data = data[(data != '?').all(axis=1)].reset_index()

In [4]:
data

Unnamed: 0,index,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,5,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45217,48837,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
45218,48838,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
45219,48839,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
45220,48840,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


## Feasbility: A feature considered distance function 

### On Single Features:

#### Distance Metrics - The Case for breaking Metric Properties

A metric on a set $X$ is a function:
$$ d : X × X → R+ $$
For all $x, y, z$ in $X$, this function is required to satisfy the following conditions:
<ul>
    <li> 1. $ d(x, y) ≥ 0 $ </li>
    <li> 2. $d(x, y) = 0$  iff   $x = y$  </li>
    <li> 3. $ d(x, y) = d(y, x) $   </li>
    <li> 4. $ d(x, z) ≤ d(x, y) + d(y, z) $ </li>
</ul>
    
    
Why might these be violated when generating feasible counterfactual explanations?
<ul> 
    <li> 1. Non-negativity </li>
    <li> 2. identity of indiscernibles </li>
    <li> 3. Symmetry - Easy to increase a feature but impossible to decrease (i.e. age) </li>
    <li> 4. Triange - Easier to go "the long way round" rather than direct jump (i.e education level - impossible to go from kindergarten to PhD)</li>
    
Conjunction of 1 and 2 produce positive definiteness - important for convexity 

Extensions to Distance Metrics:
    <ul>
        <li> Pseudometrics satisfy all properties apart from 2 such that $d(x,x) = 0 $ and <it> possibly </it> $d(x,y)=0$. Similarly, metametrics satisfy all properties other than 2 such that $d(x,x)$ is not necessarily 0. </li> 
        <li> Quasimetrics obey by all properties other than 3. </li>
        <li> Semimetrics obey by all propoeties other than 4. </li>
        

#### Handling different types of feature

Features may be discrete or continuous. 

Popular distance measures on continuous data:
<ul>
    <li> Euclidean distance (L2 norm):
        $ d(x,y) = \sqrt{(x-y)^2} $ </li>
    <li> Manhattan distance (L1 norm): 
        $ d(x,y) = |x-y| $ </li>
    Both can be generalised by the Minkowski distance:
    <li> Minkowski distance:
        $d(x,y) = (|x-y|^p)^\frac{1}{p} $ </li>
    which, in the limit as $p -> \inf$ gives the Chebyshev metric
    <li> Chebyshev distance:
        $d(x,y) = max(x_{i},y_{i}$ </li> </ul>

L norms each have desirable characteristics when applied as similarity measures on continuous features due to the metric space implicity defined by numerical features. When the objects are defined by a set of numerical attributes, there are natural definitions of distance based on geometric analogies. However, in the case of categorical data, there is a lack of metric space and there is no single ordering for the categorical values. For example consider how to define distance between different occupations. By defining a metric space over discrete features we implictly encode a specific set of assumptions. Distance measures applied to categorical features include: 

<ul>
        <li> One-hot encoding of categorical variables to treat as a binary (numerical) variable </li>
    <li> Value Difference Metric (uses probabilities over features) </li>
    
</ul>

How to compute similarity between data points characterised by heterogenous features? 

<ul> 
    <li> Current Approaches: literature </li>
    <li> Naive Approach: Could one hot encode categorical and then consider different distance measures for each feature? We may wish to use Euclidean distance for age and Minkowski distance for all enccoded features representing occupationn <li>
    <li> Better approaches?
        <ul>
            <li> Heterogeneous Euclidean-Overlap Metric: if discrete, returns 0 if same class, 1 otherwise. If continous, Euclidean distance  </li>
            <li> Heterogeneous Value Difference Metric (HVDM): alternative approach that uses a different algorithm for discrete and continous data.   </li>
            <li> 
        </ul>
    </li>
</ul>

How do we take into acount the dependency between features - is a woman who is 45 and pregnant more similar to a 25 year old who is also pregnant or to a 45 year old woman who is not pregnant? 


#### Feasability and density 

Should similarity take into account the density of the experimental data? Is a data point more similar to another if there are lots of similar datapoints in that area? 
How does this work in higher dimensions 

### Naive implementation - one distance measure for discrete features and one for Euclidean 

### Select subset of 50 points from dataset

In [5]:
subset_data = data.sample(n=100)
subset_data

Unnamed: 0,index,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
11882,12831,35,Private,301911,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,Asian-Pac-Islander,Male,0,0,50,Japan,>50K
29824,32219,53,Self-emp-not-inc,158284,HS-grad,9,Married-civ-spouse,Sales,Husband,White,Male,0,0,70,United-States,<=50K
16623,17982,18,Private,46247,Some-college,10,Never-married,Other-service,Own-child,White,Female,0,0,15,United-States,<=50K
20548,22259,32,Private,153471,HS-grad,9,Never-married,Farming-fishing,Not-in-family,White,Male,0,0,35,United-States,<=50K
28354,30635,25,Private,159603,HS-grad,9,Never-married,Other-service,Unmarried,White,Female,0,0,34,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40149,43354,51,Private,25031,Some-college,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,10,United-States,>50K
9504,10294,40,Private,211938,10th,6,Never-married,Transport-moving,Not-in-family,White,Male,0,0,40,United-States,<=50K
28624,30932,43,Local-gov,174491,HS-grad,9,Divorced,Tech-support,Not-in-family,Black,Female,0,0,40,United-States,<=50K
31642,34176,44,Self-emp-not-inc,75065,12th,8,Married-civ-spouse,Exec-managerial,Husband,Asian-Pac-Islander,Male,0,0,60,Vietnam,<=50K


### Test FACE on adult

In [6]:
import pickle
from face import FACE
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

In [7]:
adult = pd.read_csv('../Tom/adult.csv', index_col=0, nrows=1e3)
x_adult = adult[['age', 'education', 'sex', 'weekly-hours']].values
y_adult = adult[['compensation']].values.squeeze()
scaler = StandardScaler().fit(x_adult)
x_adult = scaler.transform(x_adult)
svm = SVC(probability=True)
svm.fit(x_adult, y_adult)
print(svm.score(x_adult, y_adult))

0.799


In [8]:
data = pd.DataFrame(np.load('adult/data.npy'), columns=['age', 'education', 'sex', 'weekly-hours'])
with open('adult/model.pkl', 'rb') as f:
    clf = pickle.load(f)
with open('adult/scaler.pkl', 'rb') as f:
    scaler = pickle.load(f)

ce = FACE(data, clf, dist_threshold=0.5, density_threshold=0.005, pred_threshold=0.6)
ce.graph.number_of_nodes()

  w_ij[q] = -np.log(self.kde((self.data.values[node_from] + self.data.values[node_to]) / 2) * dist)


785

In [9]:
eg = data.iloc[0].values.reshape(1, -1)
path, prob = ce.generate_counterfactual(eg)
pd.DataFrame(scaler.inverse_transform(path), columns=['age', 'education', 'sex', 'weekly-hours'])

Unnamed: 0,age,education,sex,weekly-hours
0,39.0,1.0,1.0,40.0
1,42.0,1.0,1.0,45.0


In [10]:
pd.DataFrame(scaler.inverse_transform(eg), columns=['age', 'education', 'sex', 'weekly-hours'])

Unnamed: 0,age,education,sex,weekly-hours
0,39.0,1.0,1.0,40.0


## Feature Selection

Just use 3 for now 

<ul>
    <li> age (Continuous) </li>
    <li> education (Ordinal) </li>
    <li> occupation (Nominal) </li>

</ul>


## Entropy Based Distance Measure Background 

In [11]:
data = data[(data != '?').all(axis=1)].reset_index()

In [12]:
data_selected = subset_data[['age','education','occupation','income']]

In [13]:
### Discretise (Bin) Age

In [14]:
data_selected['discretised_age'] = pd.qcut(data_selected[['age'][0]], 5, labels=False).values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [15]:
## Encode Nominal variables

In [16]:
data_selected["occupation"] = data_selected["occupation"].astype('category')
data_selected["encoded_occupation"] = data_selected["occupation"].cat.codes
data_selected["income"] = data_selected["income"].astype('category')
data_selected["encoded_income"] = data_selected["income"].cat.codes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using

In [17]:
data_selected

Unnamed: 0,age,education,occupation,income,discretised_age,encoded_occupation,encoded_income
11882,35,Bachelors,Exec-managerial,>50K,2,2,1
29824,53,HS-grad,Sales,<=50K,4,10,0
16623,18,Some-college,Other-service,<=50K,0,6,0
20548,32,HS-grad,Farming-fishing,<=50K,1,3,0
28354,25,HS-grad,Other-service,<=50K,0,6,0
...,...,...,...,...,...,...,...
40149,51,Some-college,Exec-managerial,>50K,4,2,1
9504,40,10th,Transport-moving,<=50K,3,12,0
28624,43,HS-grad,Tech-support,<=50K,3,11,0
31642,44,12th,Exec-managerial,<=50K,3,2,0


In [18]:
### Encode Ordinal variables

In [19]:
data_selected["education"] = data_selected["education"].astype('category')
data_selected["education"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


HS-grad         37
Some-college    21
Bachelors       12
Masters          6
10th             4
Assoc-acdm       4
Assoc-voc        4
12th             3
7th-8th          3
Prof-school      3
9th              2
Doctorate        1
Name: education, dtype: int64

In [20]:
education_order_dict = {
    'Preschool' : 0,
    '1st-4th': 1,
    '5th-6th': 2,
    '7th-8th':3,
    '9th':4,
    '10th':5,
    '11th':6,
    '12th':7,
    'HS-grad':8,
    'Some-college':9,
    'Prof-school':10,
    'Assoc-acdm':11,
    'Assoc-voc':12,
    'Bachelors':13,
    'Masters':14,
    'Doctorate':15,
} 

In [21]:
education_order_list = []
for x in data_selected['education']:
    education_order_list.append(education_order_dict[x])

In [22]:
data_selected['encoded_education'] = education_order_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [23]:
data_selected

Unnamed: 0,age,education,occupation,income,discretised_age,encoded_occupation,encoded_income,encoded_education
11882,35,Bachelors,Exec-managerial,>50K,2,2,1,13
29824,53,HS-grad,Sales,<=50K,4,10,0,8
16623,18,Some-college,Other-service,<=50K,0,6,0,9
20548,32,HS-grad,Farming-fishing,<=50K,1,3,0,8
28354,25,HS-grad,Other-service,<=50K,0,6,0,8
...,...,...,...,...,...,...,...,...
40149,51,Some-college,Exec-managerial,>50K,4,2,1,9
9504,40,10th,Transport-moving,<=50K,3,12,0,5
28624,43,HS-grad,Tech-support,<=50K,3,11,0,8
31642,44,12th,Exec-managerial,<=50K,3,2,0,7


In [336]:
import itertools
def R_ordinal(X,Y):
    positive_concordant = 0
    negative_concordant = 0
    equal_concordant = 0
    pairs = []
    for i in range(0,len(X)-1):
        j = i
        while j < len(Y)-1:
            j = j + 1
            if i == j:
                continue
            else:
                pairs.append([i,j])
    for index in pairs:
        if X[index[0]] == X[index[1]] and Y[index[0]] == Y[index[1]]:
            equal_concordant += 1
        elif X[index[0]] > X[index[1]] and Y[index[0]] > Y[index[1]]:
            positive_concordant += 1
        elif X[index[0]] < X[index[1]] and Y[index[0]] < Y[index[1]]:
            negative_concordant +=1
                
                
    first = max(positive_concordant,negative_concordant)
    second = min(positive_concordant,negative_concordant)
    standardiser = (len(X)*(len(X)-1)/2)
    return(equal_concordant + (first - second))/(len(pairs))
    

In [337]:
import itertools
def R_nominal(X,Y):
    non_concordant = 0
    pairs = []
    for i in range(0,len(X)-1):
        j = i
        while j < len(Y)-1:
            j = j + 1
            if i == j:
                continue
            else:
                pairs.append([i,j])
    for index in pairs:
        if X[index[0]] == X[index[1]] or Y[index[0]] == Y[index[1]]:
            non_concordant += 1
    return non_concordant/(len(pairs))
    

In [338]:
def calculate_entropy(df,X,Y,h,t,v_s):
    N = len(df[X])
    entropy = 0 
    for u in range(0,v_s):
        prob_h = len(df.loc[(df[X]==h) & (df[Y]==u)])/N
        prob_t = len(df.loc[(df[X]==t) & (df[Y]==u)])/N

        prob_th = prob_h + prob_t 
        if prob_th == 0:
            print("oh dear")
            prob_th = 0.00001
        entropy = entropy + (prob_th * np.log2(prob_th))
    
    entropy = - entropy
    return entropy
    
    #what happens when probability is 0 

In [339]:
# X,Y,h,T are dataframe indexes

def calculate_distance(df,X,Y,h,t):
    Y_name = df.columns[Y]
    X_name = df.columns[X]
    v_s = len(df[Y_name].value_counts())
    
    if t == h:
        dist = 0
        
    elif X in ordinal_indexes:
        lower_bound = min(t,h)
        upper_bound = max(t,h)
        entropy = 0 
        for category_index in range(lower_bound, upper_bound):
            entropy = entropy + calculate_entropy(df,X_name,Y_name,category_index, category_index+1,v_s)
        S_Y = -np.log2((1/v_s))
        dist = entropy/S_Y
            
    elif X in nominal_indexes:
        entropy = calculate_entropy(df,X_name,Y_name,h,t,v_s)
        S_Y = -np.log2((1/v_s))
        dist = entropy/S_Y
            
    
    else:
        print("index not recognised")
        return -1
    
    return dist
        
    
        

In [9]:
A1 = [0,0,1,2,2]
A2 = [0,1,1,2,2]
A3 = [0,1,2,1,0]

synthetic_df = pd.DataFrame()
synthetic_df['A1'] = A1 #ordinal
synthetic_df['A2'] = A2 #ordinal
synthetic_df['A3'] = A3 #nominal

In [341]:
R_nominal(synthetic_df['A3'],synthetic_df['A1'])

0.4

In [342]:
### turn each feature into a categorical
for column in synthetic_df.columns:
    synthetic_df[column] = synthetic_df[column].astype('category')
    #data_selected["education"].value_counts()
ordinal_indexes = [0,1]
nominal_indexes = [2]

In [343]:
calculate_distance(synthetic_df,2,1,2,1)

oh dear


0.6267170061658877

In [344]:
synthetic_df


Unnamed: 0,A1,A2,A3
0,0,0,0
1,0,1,1
2,1,1,2
3,2,2,1
4,2,2,0


In [364]:
def distance_algorithm(df, point1, point2):
    
    number_of_columns = len(df.columns)
    R_dict = {}
    for outer_column_index in range(0,number_of_columns):
        R_dict[outer_column_index] = {}
        for inner_column_index in range(0,number_of_columns):
            if outer_column_index in ordinal_indexes and inner_column_index in ordinal_indexes:
                #ordinal
                R_dict[outer_column_index][inner_column_index] = R_ordinal(df[df.columns[outer_column_index]],df[df.columns[inner_column_index]])
            else:
                # nominal
                R_dict[outer_column_index][inner_column_index] = R_nominal(df[df.columns[outer_column_index]],df[df.columns[inner_column_index]])  
                
    print(R_dict)
    distance_dict = {}
    for outer_column_index in range(0,number_of_columns): #r
        h = point1[outer_column_index]
        t = point2[outer_column_index]
        temp_distances = []
        for inner_column_index in range(0,number_of_columns):#s
            temp_distance = calculate_distance(df,outer_column_index, inner_column_index, h, t)
            temp_distances.append(temp_distance * R_dict[outer_column_index][inner_column_index])
        
        
        distance_dict[outer_column_index] = sum(temp_distances)/len(df.columns)
    
    print(distance_dict)
    overall_dist = np.sqrt(np.sum([x**2 for x in distance_dict.values()]))
    return overall_dist
    
                
    
    
    

In [369]:
distance_algorithm(synthetic_df,[2,2,0],[2,2,2])

{0: {0: 1.0, 1: 0.8, 2: 0.4}, 1: {0: 0.8, 1: 1.0, 2: 0.4}, 2: {0: 0.4, 1: 0.4, 2: 0.2}}
oh dear
{0: 0.0, 1: 0.0, 2: 0.27617689705926085}


0.27617689705926085

In [1]:
from face import FACE

In [2]:
y = [0,0,1,0,1,1,1,0,0,0,1,1,1,0]
ordinal_indexes = [0,1]
nominal_indexes = [2]

In [3]:
A1 = [0,0,1,2,2,0,1,1,1,2,2,0,1,2]
A2 = [0,1,1,2,2,0,1,1,2,2,0,0,1,1]
A3 = [0,1,2,1,0,2,2,0,1,1,0,1,2,2]

synthetic_df = pd.DataFrame()
synthetic_df['A1'] = A1 #ordinal
synthetic_df['A2'] = A2 #ordinal
synthetic_df['A3'] = A3 #nominal

NameError: name 'pd' is not defined

In [None]:
from sklearn.svm import SVC
svm = SVC(probability=True)
svm.fit(synthetic_df.values, y)
print(svm.score(synthetic_df.values, y))

In [None]:
ce = FACE(synthetic_df, svm, ordinal_indexes, nominal_indexes)

In [None]:
eg = synthetic_df.iloc[4].values.reshape(1, -1)
path, prob = ce.generate_counterfactual(eg)
pd.DataFrame(scaler.inverse_transform(path), columns=['A1', 'A2', 'A3'])

In [19]:
distance_algorithm(synthetic_df,[2,2,0],[2,2,2])

NameError: name 'distance_algorithm' is not defined