## What is a FrontDoor Causal Bootstrap Algorithm?

- It is an algorithm which resamples the dataset to create a new dataset using the frontdoor criterion
- The idea here is that a problem won't have only a single confounder but multiple instead
- which therefore may not be observable
- which is why we look at variables which sit on a causal path X -> Z -> Y known as **mediators**
- the front door criterion applies where you have a known mediator Z which is observed and there are no  confounders associating X -> Z and Z -> Y

**Example**
- Imagine you want to assess whether studying more hours (the “cause” X) leads to improved exam performance (the “effect” Y). 
- However there are many unobserved confounders such as "prior knowledge" and "innate ability" which
may not be observed so backdoor causal bootstrapping isn't possible
- However lets say we have a known variable "amount of practice problems done"
- we know that people who study more do more practice problems and people who do more practice problems get better grades
- and if we know there are no variables which influence both studying more and doing more practice problems
- and we also know there are no variables which influence both doing more practice problems and doing better in exams
- then we can apply front door causal bootstrapping to really understand the relation between the input and output variables

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from scipy.spatial.distance import cdist

df = pd.read_csv('heart_disease_preprocessed.csv')
df.head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,sex_Female,sex_Male,cp_Asymptomatic,cp_AtypicalAngina,...,restecg_STTAbnormality,exang_NoExAngina,exang_YesExAngina,slope_Downsloping,slope_Flat,slope_Upsloping,thal_FixedDefect,thal_Normal,thal_ReversibleDefect,heartdiseasepresence
0,0.950883,0.743598,-0.289108,0.040935,1.180495,-0.740979,0,1,0,0,...,0,1,0,1,0,0,1,0,0,0
1,1.397584,1.593663,0.78534,-1.757678,0.647625,2.527338,0,1,1,0,...,0,0,1,0,1,0,0,1,0,1
2,1.397584,-0.673176,-0.370199,-0.858371,1.3475,1.437899,0,1,1,0,...,0,0,1,0,1,0,0,0,1,1
3,-1.952676,-0.106466,0.055526,1.625427,1.77579,-0.740979,0,1,0,0,...,0,1,0,1,0,0,0,1,0,0
4,-1.505975,-0.106466,-0.877014,0.983065,0.569273,-0.740979,1,0,0,1,...,0,1,0,0,0,1,0,1,0,0


In [2]:
# === DAG-aligned variable sets ===
effect = 'heartdiseasepresence'  # binary/bounded exposure in your heart dataset
mediators = ['ca']

# observed confounders that drive both mediator and outcome; keep if present
confounders = [c for c in ['age', 'sex_Female', 'sex_Male'] if c in df.columns]

# optional: descendants of mediator to exclude from X so we don't leak effects of Z back into causes
mediator_descendants = []  # fill with downstream-of-ca feature names if known

excluded = set(mediators + [effect] + mediator_descendants)
causes = [c for c in df.columns if c not in excluded]

print('effect variable:', effect)
print('mediator set:', mediators)
print('confounders (used to block backdoors Z->Y):', confounders)
print('cause variables (excluding mediator descendants):', causes)


effect variable: heartdiseasepresence
front-door/mediator set: ['ca']
cause variables: ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'sex_Female', 'sex_Male', 'cp_Asymptomatic', 'cp_AtypicalAngina', 'cp_NonAnginalPain', 'cp_TypicalAngina', 'fbs_<=120', 'fbs_>120', 'restecg_LVHypertrophy', 'restecg_NormalECG', 'restecg_STTAbnormality', 'exang_NoExAngina', 'exang_YesExAngina', 'slope_Downsloping', 'slope_Flat', 'slope_Upsloping', 'thal_FixedDefect', 'thal_Normal', 'thal_ReversibleDefect']


**important functions**

- gaussian_kernel_matrix function is a kernel function applied to vectors made of the confounder values from the dataframe to measure similarity,
which can influence the probability of an associated row occuring in the resampled dataset 
- ensure_2d is a function to ensure that a dataframe is converted to a 2D numpy array to therefore allow doing things like applying the previous gaussian kernel function

In [3]:
def gaussian_kernel_matrix(A, B=None, bandwidth=1.0):
    """Return Gaussian kernel matrix K_ij = exp(-0.5 * ||a_i - b_j||^2 / h^2)."""
    if B is None:
        B = A
    dists = cdist(A, B, metric='euclidean')
    K = np.exp(-0.5 * (dists / bandwidth) ** 2)
    return K

def ensure_2d(df, cols):
    return df[cols].to_numpy(dtype=float).reshape(len(df), -1) # e

In [4]:
N = len(df) #number of samples
y = df[effect].to_numpy() #converting target column to numpy array
unique_y = np.unique(y) 

In [5]:
mediator_matrix = ensure_2d(df, mediators)
confounder_matrix = ensure_2d(df, confounders) if confounders else np.zeros((len(df), 0))
condition_matrix = np.hstack([confounder_matrix, mediator_matrix, y.reshape(-1, 1)])

fd_dfs = []  # this will hold multiple dataframes
bandwidth_condition = 1.0  # bandwidth for kernel on (confounders, mediator, effect)

print(mediator_matrix.shape)  # mediator values
print(condition_matrix.shape)  # confounders + mediator + effect


(272, 1)
(272, 2)


In [6]:
condition_kernel = gaussian_kernel_matrix(condition_matrix, bandwidth=bandwidth_condition)
condition_kernel  # similarity across (confounders, mediator, effect)


array([[1.        , 0.00290609, 0.05648644, ..., 0.05648644, 0.33506234,
        0.33506234],
       [0.00290609, 1.        , 0.55242441, ..., 0.55242441, 0.09313039,
        0.09313039],
       [0.05648644, 0.55242441, 1.        , ..., 1.        , 0.55242441,
        0.55242441],
       ...,
       [0.05648644, 0.55242441, 1.        , ..., 1.        , 0.55242441,
        0.55242441],
       [0.33506234, 0.09313039, 0.55242441, ..., 0.55242441, 1.        ,
        1.        ],
       [0.33506234, 0.09313039, 0.55242441, ..., 0.55242441, 1.        ,
        1.        ]])

In [7]:
rng = np.random.default_rng(0)
for y_star in unique_y:
        mask_y_star = (y == y_star)  # boolean mask where y equals y_star
        if mask_y_star.sum() == 0:
            continue
        # draw confounder+mediator combos from rows with y == y_star
        pool = condition_matrix[mask_y_star]
        draw_idx = rng.choice(np.arange(pool.shape[0]), size=N, replace=True)
        draws = pool[draw_idx]

        rows = []
        for draw in draws:
            conf_part = draw[:len(confounders)] if confounders else np.array([])
            z_part = draw[len(conf_part):-1] if mediators else np.array([])
            query = np.hstack([conf_part, z_part, [y_star]])[None, :]
            # kernel similarity to all original rows on (confounders, mediator, effect)
            K_q = gaussian_kernel_matrix(query, B=condition_matrix, bandwidth=bandwidth_condition).ravel()
            p = K_q / np.maximum(K_q.sum(), 1e-8)
            idx = rng.choice(np.arange(N), p=p)
            row = df.iloc[[idx]].copy()
            # add intervention info (label-level intervention)
            row['do_' + effect] = y_star
            row[effect] = y_star
            rows.append(row)
        fd_df = pd.concat(rows, ignore_index=True)
        fd_dfs.append(fd_df)


In [8]:
if fd_dfs:
    df_frontdoor = pd.concat(fd_dfs, ignore_index=True)
else:
    df_frontdoor = pd.DataFrame()

df_frontdoor

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,sex_Female,sex_Male,cp_Asymptomatic,cp_AtypicalAngina,...,exang_NoExAngina,exang_YesExAngina,slope_Downsloping,slope_Flat,slope_Upsloping,thal_FixedDefect,thal_Normal,thal_ReversibleDefect,heartdiseasepresence,do_heartdiseasepresence
0,-0.612572,-0.106466,0.440706,0.597648,-1.111053,-0.740979,1,0,1,0,...,1,0,0,0,1,0,1,0,0,0
1,-0.389221,0.460243,1.028612,-1.158140,2.053291,2.527338,0,1,1,0,...,0,1,0,1,0,0,0,1,0,0
2,0.169155,-0.673176,-0.147200,0.854593,-1.111053,-0.740979,0,1,0,1,...,1,0,1,0,0,0,1,0,0,0
3,0.727532,-0.673176,0.258252,-0.387306,1.817975,0.348460,0,1,1,0,...,0,1,0,1,0,0,0,1,0,0
4,-0.947598,-1.749925,-1.018922,0.297879,-1.111053,-0.740979,0,1,0,1,...,1,0,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
539,0.615857,0.460243,0.927249,0.897417,0.402268,1.437899,0,1,1,0,...,1,0,0,1,0,0,0,1,1,1
540,0.280831,-0.106466,-2.356915,-1.457909,0.402268,0.348460,0,1,1,0,...,0,1,0,1,0,0,0,1,1,1
541,0.169155,-0.673176,-0.228291,1.240010,0.017112,-0.740979,0,1,0,1,...,1,0,0,0,1,0,1,0,1,1
542,0.727532,0.743598,1.211065,-0.130362,0.219335,-0.740979,1,0,1,0,...,0,1,0,1,0,0,0,1,1,1


In [9]:
df_frontdoor = df_frontdoor.groupby("heartdiseasepresence", group_keys=False).sample(frac=0.5, random_state=42)
print(df_frontdoor["heartdiseasepresence"].value_counts())
df_frontdoor = df_frontdoor.drop(columns=["do_heartdiseasepresence"])
df_frontdoor

0    136
1    136
Name: heartdiseasepresence, dtype: int64


Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,sex_Female,sex_Male,cp_Asymptomatic,cp_AtypicalAngina,...,restecg_STTAbnormality,exang_NoExAngina,exang_YesExAngina,slope_Downsloping,slope_Flat,slope_Upsloping,thal_FixedDefect,thal_Normal,thal_ReversibleDefect,heartdiseasepresence
30,-1.059273,-1.579912,-0.795924,-0.044713,1.549723,-0.740979,0,1,1,0,...,0,0,1,0,1,0,0,1,0,0
116,-1.059273,-1.239886,0.339343,-0.729899,0.402268,-0.740979,0,1,0,0,...,0,1,0,0,1,0,0,0,1,0
79,-2.176026,0.346901,-1.302739,1.411306,0.569273,-0.740979,1,0,1,0,...,0,1,0,0,0,1,0,1,0,0
127,-0.054195,1.026953,-0.309381,0.683296,0.722903,-0.740979,0,1,0,0,...,0,1,0,0,0,1,0,0,1,0
196,0.057480,1.593663,0.846158,-0.173186,0.017112,0.348460,0,1,1,0,...,0,0,1,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
379,0.839207,-0.673176,0.683977,-1.971798,0.569273,0.348460,0,1,0,1,...,0,1,0,0,1,0,0,0,1,1
433,1.397584,-0.389821,0.136617,0.597648,-0.761115,1.437899,0,1,1,0,...,0,1,0,0,1,0,0,0,1,1
324,0.950883,-1.353228,0.440706,0.854593,0.865141,1.437899,1,0,1,0,...,0,0,1,0,1,0,0,1,0,1
427,-0.054195,-1.239886,-0.167473,-0.986844,1.451274,0.348460,0,1,1,0,...,0,0,1,0,1,0,0,0,1,1


In [10]:
df_frontdoor.to_csv('heart_disease_preprocessed_frontdoor.csv', header=True, index=False)