## What is a FrontDoor Causal Bootstrap Algorithm?

- It is an algorithm which resamples the dataset to create a new dataset using the frontdoor criterion
- The idea here is that a problem won't have only a single confounder but multiple instead
- which therefore may not be observable
- which is why we look at variables which sit on a causal path X -> Z -> Y known as **mediators**
- the front door criterion applies where you have a known mediator Z which is observed and there are no  confounders associating X -> Z and Z -> Y

**Example**
- Imagine you want to assess whether studying more hours (the “cause” X) leads to improved exam performance (the “effect” Y). 
- However there are many unobserved confounders such as "prior knowledge" and "innate ability" which
may not be observed so backdoor causal bootstrapping isn't possible
- However lets say we have a known variable "amount of practice problems done"
- we know that people who study more do more practice problems and people who do more practice problems get better grades
- and if we know there are no variables which influence both studying more and doing more practice problems
- and we also know there are no variables which influence both doing more practice problems and doing better in exams
- then we can apply front door causal bootstrapping to really understand the relation between the input and output variables

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from scipy.spatial.distance import cdist

df = pd.read_csv('heart_disease_preprocessed.csv')
df.head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,sex_Female,sex_Male,cp_Asymptomatic,cp_AtypicalAngina,...,restecg_STTAbnormality,exang_NoExAngina,exang_YesExAngina,slope_Downsloping,slope_Flat,slope_Upsloping,thal_FixedDefect,thal_Normal,thal_ReversibleDefect,heartdiseasepresence
0,0.950883,0.743598,-0.289108,0.040935,1.180495,-0.740979,0,1,0,0,...,0,1,0,1,0,0,1,0,0,0
1,1.397584,1.593663,0.78534,-1.757678,0.647625,2.527338,0,1,1,0,...,0,0,1,0,1,0,0,1,0,1
2,1.397584,-0.673176,-0.370199,-0.858371,1.3475,1.437899,0,1,1,0,...,0,0,1,0,1,0,0,0,1,1
3,-1.952676,-0.106466,0.055526,1.625427,1.77579,-0.740979,0,1,0,0,...,0,1,0,1,0,0,0,1,0,0
4,-1.505975,-0.106466,-0.877014,0.983065,0.569273,-0.740979,1,0,0,1,...,0,1,0,0,0,1,0,1,0,0


In [2]:
mediators = ['thalach', 'oldpeak', 'slope_Downsloping', 'slope_Flat', 'slope_Upsloping', 'cp_Asymptomatic', 'cp_AtypicalAngina', 'cp_NonAnginalPain', 'cp_TypicalAngina']

effect = 'heartdiseasepresence'  # binary/bounded exposure in your heart dataset

causes = [c for c in df.columns if c not in mediators + [effect]]

print('effect variable:', effect)
print('front-door/mediator set:', mediators)
print('cause variables:', causes)

effect variable: heartdiseasepresence
front-door/mediator set: ['thalach', 'oldpeak', 'slope_Downsloping', 'slope_Flat', 'slope_Upsloping', 'cp_Asymptomatic', 'cp_AtypicalAngina', 'cp_NonAnginalPain', 'cp_TypicalAngina']
cause variables: ['age', 'trestbps', 'chol', 'ca', 'sex_Female', 'sex_Male', 'fbs_<=120', 'fbs_>120', 'restecg_LVHypertrophy', 'restecg_NormalECG', 'restecg_STTAbnormality', 'exang_NoExAngina', 'exang_YesExAngina', 'thal_FixedDefect', 'thal_Normal', 'thal_ReversibleDefect']


**important functions**

- gaussian_kernel_matrix function is a kernel function applied to vectors made of the confounder values from the dataframe to measure similarity,
which can influence the probability of an associated row occuring in the resampled dataset 
- ensure_2d is a function to ensure that a dataframe is converted to a 2D numpy array to therefore allow doing things like applying the previous gaussian kernel function

In [3]:
def gaussian_kernel_matrix(A, B=None, bandwidth=1.0):
    """Return Gaussian kernel matrix K_ij = exp(-0.5 * ||a_i - b_j||^2 / h^2)."""
    if B is None:
        B = A
    dists = cdist(A, B, metric='euclidean')
    K = np.exp(-0.5 * (dists / bandwidth) ** 2)
    return K

def ensure_2d(df, cols):
    return df[cols].to_numpy(dtype=float).reshape(len(df), -1) # e

In [4]:
N = len(df) #number of samples
y = df[effect].to_numpy() #converting target column to numpy array
unique_y = np.unique(y) 

In [5]:
mediator_matrix = ensure_2d(df, mediators) # mediator columns as 2D array
fd_dfs = [] # this will hold multiple dataframes
bandwidth_Z = 1.0 # bandwidth for kernel on (Z, Y)
mediator_effect_matrix = np.hstack([mediator_matrix, y.reshape(-1, 1)])

print(mediator_matrix.shape) # this is basically a subset of the df with the mediators in the form of a 2d array
print(mediator_effect_matrix.shape) # this is a subset of the df with the mediators and effect variable in the form of a 2d array

(272, 9)
(272, 10)


In [6]:
kernel_mediators_effect = gaussian_kernel_matrix(mediator_effect_matrix, bandwidth=bandwidth_Z)
kernel_mediators_effect # this essentially gives a matrix of similarity scores between all rows based on the mediators and effect variable

array([[1.        , 0.01412966, 0.05402431, ..., 0.06543395, 0.01972073,
        0.00350431],
       [0.01412966, 1.        , 0.52241715, ..., 0.20449579, 0.92771519,
        0.00144345],
       [0.05402431, 0.52241715, 1.        , ..., 0.81365915, 0.53448351,
        0.00279739],
       ...,
       [0.06543395, 0.20449579, 0.81365915, ..., 1.        , 0.22205593,
        0.00237703],
       [0.01972073, 0.92771519, 0.53448351, ..., 0.22205593, 1.        ,
        0.00481037],
       [0.00350431, 0.00144345, 0.00279739, ..., 0.00237703, 0.00481037,
        1.        ]])

In [7]:
rng = np.random.default_rng(0)
for y_star in unique_y:
        # 1) we want to draw Z from p(z | do(y_star)) = p(z | y_star)
        mask_y_star = (y == y_star) # boolean mask where y equals y_star
        if mask_y_star.sum() == 0:
            continue
        # rows where y == y_star provide Z values
        mediator_matrix_unique_effect = mediator_matrix[mask_y_star] # we get subset of matrix where y == y_star
        # to make life simple we resample N times from those Z values uniformly
        random_row_indexes = rng.choice(np.arange(mediator_matrix_unique_effect.shape[0]), size=N, replace=True) # randomly pick N indexes from the Z_pool
        Z_draws = mediator_matrix_unique_effect[random_row_indexes] #this will be equivalent to the size of the original dataset

        # 2) for each drawn (z, y_star), weight original rows by kernel on (Z, Y) to draw X
        rows = []
        for z_val in Z_draws:
            # build query = (z_val, y_star) stack z_val and y_star horizontally
            query = np.hstack([z_val, [y_star]])[None, :] 
            # kernel similarity to all original rows
            K_q = np.exp(-0.5 * (cdist(query, mediator_effect_matrix, metric='euclidean') / bandwidth_Z) ** 2).ravel()
            # for each original row, compute similarity to the query (z_val, y_star)
            p = K_q / np.maximum(K_q.sum(), 1e-8)
            # convert the similarities to probabilities
            idx = rng.choice(np.arange(N), p=p)
            # pick one row index based on these probabilities
            row = df.iloc[[idx]].copy()
            # add intervention info
            row['do_' + effect] = y_star
            # in the interventional world, Y is set to y_star
            row[effect] = y_star
            rows.append(row)
        fd_df = pd.concat(rows, ignore_index=True)
        fd_dfs.append(fd_df)

In [8]:
if fd_dfs:
    df_frontdoor = pd.concat(fd_dfs, ignore_index=True)
else:
    df_frontdoor = pd.DataFrame()

df_frontdoor

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,sex_Female,sex_Male,cp_Asymptomatic,cp_AtypicalAngina,...,exang_NoExAngina,exang_YesExAngina,slope_Downsloping,slope_Flat,slope_Upsloping,thal_FixedDefect,thal_Normal,thal_ReversibleDefect,heartdiseasepresence,do_heartdiseasepresence
0,0.839207,-0.219808,-0.795924,-0.387306,-1.111053,-0.740979,0,1,0,1,...,1,0,0,0,1,0,1,0,0,0
1,-1.059273,0.346901,-0.228291,0.126583,-0.761115,-0.740979,1,0,1,0,...,0,1,0,1,0,0,1,0,0,0
2,0.392506,-0.106466,-1.018922,-0.772723,-0.208954,-0.740979,1,0,1,0,...,1,0,0,1,0,0,1,0,0,0
3,-2.176026,0.346901,-1.302739,1.411306,0.569273,-0.740979,1,0,1,0,...,1,0,0,0,1,0,1,0,0,0
4,-1.394299,-1.693254,0.359615,-1.158140,-0.208954,-0.740979,1,0,1,0,...,1,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
539,1.174233,0.176888,0.136617,-0.944020,1.451274,0.348460,0,1,1,0,...,1,0,0,1,0,0,0,1,1,1
540,1.509259,0.686927,-1.100013,-0.344482,1.732656,1.437899,0,1,1,0,...,1,0,0,1,0,0,0,1,1,1
541,0.169155,-0.673176,-0.228291,1.240010,0.017112,-0.740979,0,1,0,1,...,1,0,0,0,1,0,1,0,1,1
542,0.727532,-0.673176,0.258252,-0.387306,1.817975,0.348460,0,1,1,0,...,0,1,0,1,0,0,0,1,1,1


In [9]:
df_frontdoor = df_frontdoor.groupby("heartdiseasepresence", group_keys=False).sample(frac=0.5, random_state=42)
print(df_frontdoor["heartdiseasepresence"].value_counts())
df_frontdoor = df_frontdoor.drop(columns=["do_heartdiseasepresence"])
df_frontdoor

0    136
1    136
Name: heartdiseasepresence, dtype: int64


Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,sex_Female,sex_Male,cp_Asymptomatic,cp_AtypicalAngina,...,restecg_STTAbnormality,exang_NoExAngina,exang_YesExAngina,slope_Downsloping,slope_Flat,slope_Upsloping,thal_FixedDefect,thal_Normal,thal_ReversibleDefect,heartdiseasepresence
30,1.844285,-1.239886,0.359615,-0.815547,-1.111053,0.348460,1,0,0,0,...,0,1,0,0,0,1,0,1,0,0
116,-1.505975,-0.673176,-1.829827,1.411306,-1.111053,-0.740979,0,1,0,1,...,0,1,0,0,0,1,0,1,0,0
79,0.839207,-0.446492,-0.775651,0.597648,-1.111053,-0.740979,1,0,1,0,...,0,1,0,0,0,1,0,1,0,0
127,-0.277546,-0.219808,0.156889,0.512000,-1.111053,0.348460,0,1,1,0,...,0,0,1,0,0,1,0,0,1,0
196,1.509259,-0.786518,0.602887,0.083759,0.219335,0.348460,0,1,0,0,...,0,1,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
379,1.174233,0.346901,0.704250,1.068713,0.569273,0.348460,0,1,0,0,...,0,1,0,0,1,0,0,1,0,1
433,-0.054195,-0.389821,0.521796,0.126583,-0.332826,0.348460,0,1,0,0,...,0,1,0,1,0,0,0,1,0,1
324,0.057480,-0.106466,0.298797,0.255055,-1.111053,-0.740979,0,1,0,1,...,0,1,0,0,0,1,0,1,0,1
427,-0.054195,-1.239886,-0.167473,-0.986844,1.451274,0.348460,0,1,1,0,...,0,0,1,0,1,0,0,0,1,1


In [10]:
df_frontdoor.to_csv('heart_disease_preprocessed_frontdoor.csv', header=True, index=False)