## What is a Backdoor Causal Bootstrap Algorithm?

- It is an algorithm which resamples the dataset to create a new dataset
- using the backdoor-criterion, which is satisfied if a variable Z is not a descendant of a 'cause' variable X and closes every back door path between a 'cause' X and 'effect' Y variable
- these are known as **Confounders**
- e.g. instead of X -> Y,   X <- Z -> Y (back door path)
- so that ML models can learn causal relationships rather than just correlations

**Example**
- Imagine you want to assess whether drinking coffee (the “cause” X) leads to improved concentration at work (the “effect” Y). 
- But there’s a confounder: people who are sleep-deprived tend to drink more coffee and also tend to have worse concentration. 
- If you simply compare coffee drinkers vs non-drinkers, the effect of sleep-deprivation might bias your result. The back-door approach says: to get at the true effect of coffee on concentration, you need to adjust for (control for) the variable “sleep-deprivation” (call it Z) so that the path coffee ← sleep-deprivation → concentration is blocked. 
- After you control for how much sleep-deprivation someone has, you can compare coffee drinkers and non-drinkers on a “level playing field” of sleep-deprivation and infer more reliably how coffee itself affects concentration.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from scipy.spatial.distance import cdist

df = pd.read_csv('heart_disease_preprocessed.csv')
df.head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,sex_Female,sex_Male,cp_Asymptomatic,cp_AtypicalAngina,...,restecg_STTAbnormality,exang_NoExAngina,exang_YesExAngina,slope_Downsloping,slope_Flat,slope_Upsloping,thal_FixedDefect,thal_Normal,thal_ReversibleDefect,heartdiseasepresence
0,0.950883,0.743598,-0.289108,0.040935,1.180495,-0.740979,0,1,0,0,...,0,1,0,1,0,0,1,0,0,0
1,1.397584,1.593663,0.78534,-1.757678,0.647625,2.527338,0,1,1,0,...,0,0,1,0,1,0,0,1,0,1
2,1.397584,-0.673176,-0.370199,-0.858371,1.3475,1.437899,0,1,1,0,...,0,0,1,0,1,0,0,0,1,1
3,-1.952676,-0.106466,0.055526,1.625427,1.77579,-0.740979,0,1,0,0,...,0,1,0,1,0,0,0,1,0,0
4,-1.505975,-0.106466,-0.877014,0.983065,0.569273,-0.740979,1,0,0,1,...,0,1,0,0,0,1,0,1,0,0


**Below I will Identify some potential confounders**
- initial thought is that **age** and **sex** could be confounders 
- this is because age could increase the risk of high blood pressure/cholesterol  
- similarly sex because sex differences can be responsible for baseline levels of certain values

In [None]:
# === EDIT THESE IF NEEDED ===
effect = 'heartdiseasepresence'  # binary/bounded exposure in your heart dataset

# plausible confounders (we keep only those actually present)
"""
candidate_S = [
    'age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg',
    'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal'
]
"""

confounders = [
    'age', 'sex_Female', 'sex_Male'
]


causes = [c for c in df.columns if c not in confounders + [effect]]

print('effect variable:', effect)
print('back-door/confounder set:', confounders)
print('cause variable:', causes)

Y (effect variable): heartdiseasepresence
S (back-door/confounder set): ['age', 'sex_Female', 'sex_Male']
X (cause variable): ['trestbps', 'chol', 'thalach', 'oldpeak', 'ca', 'cp_Asymptomatic', 'cp_AtypicalAngina', 'cp_NonAnginalPain', 'cp_TypicalAngina', 'fbs_<=120', 'fbs_>120', 'restecg_LVHypertrophy', 'restecg_NormalECG', 'restecg_STTAbnormality', 'exang_NoExAngina', 'exang_YesExAngina', 'slope_Downsloping', 'slope_Flat', 'slope_Upsloping', 'thal_FixedDefect', 'thal_Normal', 'thal_ReversibleDefect']


**important functions**

- gaussian_kernel_matrix function is a kernel function applied to vectors made of the confounder values from the dataframe to measure similarity,
which can influence the probability of an associated row occuring in the resampled dataset 
- ensure_2d is a function to ensure that a dataframe is converted to a 2D numpy array to therefore allow doing things like applying the previous gaussian kernel function

In [3]:
def gaussian_kernel_matrix(A, B=None, bandwidth=1.0):
    """Return Gaussian kernel matrix K_ij = exp(-0.5 * ||a_i - b_j||^2 / h^2)."""
    if B is None:
        B = A
    dists = cdist(A, B, metric='euclidean')
    K = np.exp(-0.5 * (dists / bandwidth) ** 2)
    return K

def ensure_2d(df, cols):
    return df[cols].to_numpy(dtype=float).reshape(len(df), -1) # ensure 2D array

In [4]:
# Extract relevant information such as data size and unique target values

N = len(df) #number of samples
y = df[effect].to_numpy() #converting target column to numpy array
unique_y = np.unique(y) #finding unique values in target column

print(N)
print(y)
print(unique_y)

272
[0 1 1 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0
 0 1 1 0 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 0 0 1 0 0
 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 0 0 0 1 1 1 1 1 1
 0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 0
 0 0 0 1 0 1 0 1 0 1 1 1 0 1 1 0 1 0 0 1 1 1 1 1 0 1 0 0 0 1 0 0 0 0 0 1 1
 1 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 1 1 0 1 1 1 0 0 0 0 1 0 1
 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 1 0 1 0 1 1 1
 0 1 0 1 1 1 0 1 1 1 1 1 1]
[0 1]


In [5]:
# converting confounder columns to numpy 2D array

confounder_matrix = ensure_2d(df, confounders) #converting confounder columns to numpy 2D array
print(confounder_matrix.shape)
confounder_matrix

(272, 3)


array([[ 0.9508825 ,  0.        ,  1.        ],
       [ 1.39758378,  0.        ,  1.        ],
       [ 1.39758378,  0.        ,  1.        ],
       [-1.95267581,  0.        ,  1.        ],
       [-1.50597453,  1.        ,  0.        ],
       [ 0.16915526,  0.        ,  1.        ],
       [ 0.83920718,  1.        ,  0.        ],
       [ 0.28083058,  1.        ,  0.        ],
       [ 0.9508825 ,  0.        ,  1.        ],
       [-0.1658707 ,  0.        ,  1.        ],
       [ 0.28083058,  0.        ,  1.        ],
       [ 0.16915526,  1.        ,  0.        ],
       [ 0.16915526,  0.        ,  1.        ],
       [-1.17094857,  0.        ,  1.        ],
       [-0.27754602,  0.        ,  1.        ],
       [ 0.28083058,  0.        ,  1.        ],
       [-0.72424729,  0.        ,  1.        ],
       [-0.05419538,  0.        ,  1.        ],
       [-0.72424729,  1.        ,  0.        ],
       [ 1.06255782,  0.        ,  1.        ],
       [ 0.3925059 ,  1.        ,  0.   

In [6]:
# computing confounder similarity matrix using Gaussian kernel
# to get similarities between all pairs of samples based on confounders

bandwidth_S = 1.0 # controls maximum distance for data to still be considered "close"
confounder_similarity_matrix = gaussian_kernel_matrix(confounder_matrix, bandwidth=bandwidth_S)


print(confounder_similarity_matrix.shape)
confounder_similarity_matrix

(272, 272)


array([[1.        , 0.90504463, 0.90504463, ..., 0.85565141, 0.79892773,
        0.29390909],
       [0.90504463, 1.        , 1.        , ..., 0.99378371, 0.53602801,
        0.19719369],
       [0.90504463, 1.        , 1.        , ..., 0.99378371, 0.53602801,
        0.19719369],
       ...,
       [0.85565141, 0.99378371, 0.99378371, ..., 1.        , 0.47023707,
        0.17299055],
       [0.79892773, 0.53602801, 0.53602801, ..., 0.47023707, 1.        ,
        0.36787944],
       [0.29390909, 0.19719369, 0.19719369, ..., 0.17299055, 0.36787944,
        1.        ]])

In [7]:
def p_hat_y_given_S_backdoor(target_y):
    indicator = (y == target_y).astype(float) # list of 1s and 0s indicating which samples have y == target_y
    print(indicator)
    numer = confounder_similarity_matrix @ indicator  # for each sample, how many "close" samples have y == target_y
    print(numer)
    denom = confounder_similarity_matrix.sum(axis=1)  # total number of "close" samples for each sample
    print(denom)
    return numer / np.maximum(denom, 1e-8) #percentage of "close" samples with y == target_y for each sample

p_hat_backdoor = {val: p_hat_y_given_S_backdoor(val) for val in unique_y}
print(p_hat_backdoor)

[1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 0. 0. 0.
 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 0. 1. 1. 0. 1. 0. 0.
 0. 0. 1. 1. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0.
 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 1. 1. 1.
 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1.
 1. 0. 1. 1. 1. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0.
 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0.
 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1.
 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0.
 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0.
 0. 1. 0. 0. 0. 0. 0. 0.]
[45.12116    33.57688153 33.57688153 37.51028147 41.17286828 61.1980202
 48.20472876 56.03405663 45.12116    64.64635359 59.49897957 56.93189075
 61.1980202  57.6510469  6

- idea below is that samples which are rare and therefore unusual under the confounders are upweighted, to simulate the effect of a confounder not affecting the data
- likewise common occurences under confounders are downweighted with the intention that we want to show the generalised effect of the confounder as little as possible to simulate the confounder not affecting the data

In [8]:
weights_backdoor = {} # dictionary to hold weights for each unique y value
for y_star in unique_y:
    p_hat = p_hat_backdoor[y_star] # estimated probabilities for current y_star
    indicator = (y == y_star).astype(float) #[1.0, 0.0, ...] indicating which samples have y == y_star
    w = indicator / (N * np.maximum(p_hat, 1e-8)) # if p_hat is high, weight is low, and vice versa - so that there is higher weightage to uncertain samples
    # multiplication by N ensures weights sum to 1 across all samples
    # indicator is numerator because only samples with y == y_tar get non-zero weight, otherwise it is a different outcome therefore it does not contribute to estimating effect of cause on y_star
    weights_backdoor[y_star] = w

print(weights_backdoor)

{0: array([0.01031089, 0.        , 0.        , 0.00609717, 0.00524391,
       0.0090429 , 0.        , 0.00703214, 0.        , 0.        ,
       0.00924763, 0.00692329, 0.        , 0.00680896, 0.0082096 ,
       0.00924763, 0.        , 0.00862532, 0.00592842, 0.01044857,
       0.00713226, 0.        , 0.        , 0.        , 0.00751595,
       0.        , 0.        , 0.        , 0.00963973, 0.00680896,
       0.0065573 , 0.        , 0.        , 0.        , 0.00999764,
       0.        , 0.00739184, 0.00963973, 0.        , 0.        ,
       0.00800696, 0.        , 0.00751103, 0.01057302, 0.        ,
       0.00680896, 0.        , 0.        , 0.        , 0.        ,
       0.00862532, 0.00800696, 0.        , 0.00570033, 0.        ,
       0.00668539, 0.        , 0.        , 0.        , 0.00862532,
       0.        , 0.        , 0.00751103, 0.        , 0.        ,
       0.        , 0.        , 0.00751103, 0.        , 0.00630296,
       0.00743827, 0.        , 0.00695068, 0.00655963, 0.0

- Dataframes are then created where each one has one amongst all unique Y values
- the rows are chosen based on the weights normalised relative to all the other rows for the unique y value
- finally those dataframes are appended

In [9]:
rng = np.random.default_rng(0) # sets finite range for random number generation for reproducibility
bd_dfs = [] # this will hold multiple dataframes
for y_star in unique_y: # for each unique value in y
    w = weights_backdoor[y_star] # get the weights for that value
    if w.sum() == 0: 
        continue
    probs = w / w.sum() # normalise the weights relative to all other row weights
    sample_idx = rng.choice(np.arange(N), size=N, replace=True, p=probs)
    # pick N random row indexes (same size as df) based on probs for each row, with replacement (can pick same row multiple times)
    sampled = df.iloc[sample_idx].copy()
    # takes the rows of df at the sampled indexes and makes a new dataframe
    sampled['do_' + effect] = y_star
    # we create and additional column 'do_target' to indicate the intervention value
    sampled[effect] = y_star
    # we make all values in the target column equal to y_star
    bd_dfs.append(sampled) # 

In [10]:
df_backdoor = pd.concat(bd_dfs, ignore_index=True) # concatenate all sampled dataframes into one
df_backdoor.head() 

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,sex_Female,sex_Male,cp_Asymptomatic,cp_AtypicalAngina,...,exang_NoExAngina,exang_YesExAngina,slope_Downsloping,slope_Flat,slope_Upsloping,thal_FixedDefect,thal_Normal,thal_ReversibleDefect,heartdiseasepresence,do_heartdiseasepresence
0,1.509259,-0.673176,-0.735106,-1.457909,0.647625,-0.740979,1,0,0,0,...,1,0,0,1,0,0,1,0,0,0
1,-1.059273,-1.579912,-0.795924,-0.044713,1.549723,-0.740979,0,1,1,0,...,0,1,0,1,0,0,1,0,0,0
2,0.280831,0.460243,-1.120286,-0.044713,-0.465247,-0.740979,0,1,1,0,...,1,0,0,1,0,1,0,0,0,0
3,-1.505975,-0.106466,-0.877014,0.983065,0.569273,-0.740979,1,0,0,1,...,1,0,0,0,1,0,1,0,0,0
4,-1.394299,-0.673176,0.967794,0.554824,-1.111053,-0.740979,0,1,0,1,...,1,0,0,0,1,0,1,0,0,0


In [11]:
df_backdoor

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,sex_Female,sex_Male,cp_Asymptomatic,cp_AtypicalAngina,...,exang_NoExAngina,exang_YesExAngina,slope_Downsloping,slope_Flat,slope_Upsloping,thal_FixedDefect,thal_Normal,thal_ReversibleDefect,heartdiseasepresence,do_heartdiseasepresence
0,1.509259,-0.673176,-0.735106,-1.457909,0.647625,-0.740979,1,0,0,0,...,1,0,0,1,0,0,1,0,0,0
1,-1.059273,-1.579912,-0.795924,-0.044713,1.549723,-0.740979,0,1,1,0,...,0,1,0,1,0,0,1,0,0,0
2,0.280831,0.460243,-1.120286,-0.044713,-0.465247,-0.740979,0,1,1,0,...,1,0,0,1,0,1,0,0,0,0
3,-1.505975,-0.106466,-0.877014,0.983065,0.569273,-0.740979,1,0,0,1,...,1,0,0,0,1,0,1,0,0,0
4,-1.394299,-0.673176,0.967794,0.554824,-1.111053,-0.740979,0,1,0,1,...,1,0,0,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
539,1.397584,-0.673176,-0.208018,-3.342170,0.219335,-0.740979,0,1,1,0,...,1,0,0,1,0,0,1,0,1,1
540,-0.947598,0.460243,1.292156,-1.243788,0.865141,1.437899,0,1,1,0,...,0,1,0,1,0,0,0,1,1,1
541,0.504181,0.460243,-1.424375,0.554824,-1.111053,0.348460,0,1,1,0,...,0,1,0,0,1,0,0,1,1,1
542,-0.389221,0.460243,1.048884,1.025889,0.722903,-0.740979,0,1,1,0,...,0,1,0,0,1,0,0,1,1,1
