## What is a Backdoor Causal Bootstrap Algorithm?

- It is an algorithm which resamples the dataset to create a new dataset
- using the backdoor-criterion, which is satisfied if a variable Z is not a descendant of a 'cause' variable X and closes every back door path between a 'cause' X and 'effect' Y variable
- these are known as **Confounders**
- e.g. instead of X -> Y,   X <- Z -> Y (back door path)
- so that ML models can learn causal relationships rather than just correlations

**Example**
- Imagine you want to assess whether drinking coffee (the “cause” X) leads to improved concentration at work (the “effect” Y). 
- But there’s a confounder: people who are sleep-deprived tend to drink more coffee and also tend to have worse concentration. 
- If you simply compare coffee drinkers vs non-drinkers, the effect of sleep-deprivation might bias your result. The back-door approach says: to get at the true effect of coffee on concentration, you need to adjust for (control for) the variable “sleep-deprivation” (call it Z) so that the path coffee ← sleep-deprivation → concentration is blocked. 
- After you control for how much sleep-deprivation someone has, you can compare coffee drinkers and non-drinkers on a “level playing field” of sleep-deprivation and infer more reliably how coffee itself affects concentration.

# Algorithm Intuition 

- wᵢ = K_Y(yᵢ, y*) / (N × p̂(y* | Sᵢ))
- the idea is that the algorithm will put higher weight to samples which make confounders unlikely to predict target variable
- which simulates deconfounding

In [None]:
import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist

def _ensure_2d(arr):
    arr = np.asarray(arr)
    if arr.ndim == 1:
        arr = arr.reshape(-1, 1)
    return arr

# converts inputs into a 2d numpy array

def gaussian_kernel_matrix(A, B=None, bandwidth=1.0):
    """
    Gaussian/RBF kernel matrix: K_ij = exp(-0.5 * ||a_i - b_j||^2 / h^2)
    A: (n,d), B: (m,d)
    """
    A = _ensure_2d(A)
    if B is None:
        B = A
    else:
        B = _ensure_2d(B)
    dists = cdist(A, B, metric="euclidean")
    return np.exp(-0.5 * (dists / float(bandwidth)) ** 2)

# measures similarity of different rows based on Gaussian kernel

def kernel_y_vector(y_data, y_star, *, discrete=True, bandwidth_y=1.0):
    """
    K[y_i - y*] used in the paper.
    - discrete=True -> Kronecker delta (1 if equal else 0)
    - discrete=False -> Gaussian kernel on (y_i - y*)
    """
    y_data = np.asarray(y_data)
    if discrete:
        return (y_data == y_star).astype(float)
    return np.exp(-0.5 * ((y_data - y_star) / float(bandwidth_y)) ** 2)

# creates a vector of similarity values between each y value and a unique y value

def phat_y_given_S(y_data, S_data, y_star, *, discrete_y=True, bandwidth_S=1.0, bandwidth_y=1.0, eps=1e-12):
    """
    Nonparametric estimate of p_hat(y* | S_i) for each i, using kernel regression:
      p_hat(y*|S_i) = sum_j K_S(S_i, S_j) * K_Y(y_j, y*) / sum_j K_S(S_i, S_j)
    Returns: (N,) vector over i.
    """
    K_S = gaussian_kernel_matrix(S_data, bandwidth=bandwidth_S)  # (N,N)
    K_Y = kernel_y_vector(y_data, y_star, discrete=discrete_y, bandwidth_y=bandwidth_y)  # (N,)
    numer = K_S @ K_Y
    denom = K_S.sum(axis=1)
    return numer / np.maximum(denom, eps)

# K_S is the similarity matrix for all confounder rows in the dataset
# K_Y is the list of similarity scores of all y values to a unique y value
# [[row1 sim to row1, row2 sim to row1, ...], [row1 sim to row2, row2 sim to row2, ...], ...] * [y1_sim to unique, y2_sim to unique, y3_sim to unique] = 
# (y1 sim to unique * [row1 sim to row1, row2 sim to row1, ...]) + (y2 sim to unique * [row1 sim to row2, row2 sim to row2, ...]) + ...
# for each confounder row, we get a weighted sum of the similarity scores of all y values to the unique y value, weighted by how similar each confounder row is to the given confounder row
# K_S.sum = [total sim score for confounder row1 to all rows, total sim score for confounder row2 to all rows, ...]
# result = [total sim scores weighted by y similarities for row 1 / total sim score for confounder row 1, total sim scores weighted by y similarity for row 2 / total sim score for confounder row 2, ...]
# amongst all confounder rows, given the similarity to confounder rows, how much proportion of those also have similar target variables
# big score implies that amongst similar confounder rows, many also have similar target variables
# low score implies that amongst similar confounder rows, few have similar target variables
# return example is like [high score for confounder row 1, low score for confounder row 2, ...]

## Choose columns

- `y_col`: intervention / prediction target **Y**
- `s_cols`: back-door admissible adjustment set **S** (measured confounders)
- `x_cols`: features **X** used to train your model (typically exclude `s_cols` to avoid the model exploiting confounders)

The code works for discrete **Y** (classification). For continuous **Y** (regression), set `discrete_y=False` and provide `bandwidth_y`.


In [15]:
# --- USER INPUTS ---
# Load your data:
df = pd.read_csv("heart_disease_preprocessed.csv")  # change if needed

y_col = "heartdiseasepresence"
s_cols = [c for c in ["age", "sex_Female", "sex_Male"] if c in df.columns]

# By default, use all other columns except Y and S as X:
x_cols = [c for c in df.columns if c not in [y_col]]

# KDE / kernel settings
discrete_y = True      # True for classification targets
bandwidth_S = 1.0      # kernel bandwidth on S
bandwidth_y = 1.0      # only used if discrete_y=False

random_seed = 0

print("y_col:", y_col)
print("s_cols:", s_cols)
print("x_cols:", x_cols)
print("N:", len(df))

y_col: heartdiseasepresence
s_cols: ['age', 'sex_Female', 'sex_Male']
x_cols: ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'ca', 'sex_Female', 'sex_Male', 'cp_Asymptomatic', 'cp_AtypicalAngina', 'cp_NonAnginalPain', 'cp_TypicalAngina', 'fbs_<=120', 'fbs_>120', 'restecg_LVHypertrophy', 'restecg_NormalECG', 'restecg_STTAbnormality', 'exang_NoExAngina', 'exang_YesExAngina', 'slope_Downsloping', 'slope_Flat', 'slope_Upsloping', 'thal_FixedDefect', 'thal_Normal', 'thal_ReversibleDefect']
N: 272


In [None]:
def causal_bootstrap_backdoor(df, *, x_cols, y_col, s_cols,
                              discrete_y=True, bandwidth_S=1.0, bandwidth_y=1.0,
                              random_seed=0, eps=1e-12):
    """Back-door causal bootstrap (Algorithm 1).

    Returns a dataframe with columns x_cols + [y_col] representing samples from p(x|do(y)).
    Number of returned rows equals len(df).
    """
    N = len(df)
    y = df[y_col].to_numpy()
    S = df[s_cols].to_numpy(dtype=float) if len(s_cols) else np.zeros((N, 0))

    rng = np.random.default_rng(random_seed)

    # discrete Y: group by y* and generate n_y* = count(y=y*) samples (preserves marginal p(y))
    y_vals = np.unique(y) if discrete_y else y

    out_rows = []
    for y_star in y_vals:
        if discrete_y:
            n_star = int((y == y_star).sum())
            if n_star == 0:
                continue
        else:
            # regression: one sample per observed y value
            n_star = 1

        # p_hat(y* | S_i) for each i
        phat = phat_y_given_S(y, S, y_star, discrete_y=discrete_y,
                              bandwidth_S=bandwidth_S, bandwidth_y=bandwidth_y, eps=eps)

        # K[y_i - y*]
        Ky = kernel_y_vector(y, y_star, discrete=discrete_y, bandwidth_y=bandwidth_y)
        # getting similarities of all y values in relation to the unique y value, identifying
        # amongst all y values, how similar is each row to it to get a sense of number of similar y values to it

        # weights: w_i = Ky / (N * phat)
        w = Ky / (float(N) * np.maximum(phat, eps))
        # w is a list of weights, where the highest weight is the highest y similarity over the lowest similarity of confounder rows with similar y values
        # in other words, we will put more emphasis on rows where the confounder y values are similar but the actual confounder rows are dissimilar

        # sample indices with prob proportional to w
        w_sum = w.sum()
        if w_sum <= 0:
            continue
        p = w / w_sum
        idx = rng.choice(np.arange(N), size=n_star, replace=True, p=p)

        block = df.iloc[idx][x_cols].copy()
        block[y_col] = y_star
        out_rows.append(block)

    df_star = pd.concat(out_rows, ignore_index=True)

    # sanity: ensure same size as original
    if len(df_star) != N and discrete_y:
        # fallback: if rounding/counting mismatch, resample to N
        df_star = df_star.sample(n=N, replace=True, random_state=random_seed).reset_index(drop=True)

    return df_star

In [17]:
df_backdoor = causal_bootstrap_backdoor(
    df,
    x_cols=x_cols,
    y_col=y_col,
    s_cols=s_cols,
    discrete_y=discrete_y,
    bandwidth_S=bandwidth_S,
    bandwidth_y=bandwidth_y,
    random_seed=random_seed,
)

df_backdoor.shape

(272, 26)

In [18]:
df_backdoor

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,sex_Female,sex_Male,cp_Asymptomatic,cp_AtypicalAngina,...,restecg_STTAbnormality,exang_NoExAngina,exang_YesExAngina,slope_Downsloping,slope_Flat,slope_Upsloping,thal_FixedDefect,thal_Normal,thal_ReversibleDefect,heartdiseasepresence
0,1.509259,-0.673176,-0.735106,-1.457909,0.647625,-0.740979,1,0,0,0,...,0,1,0,0,1,0,0,1,0,0
1,-1.059273,-1.579912,-0.795924,-0.044713,1.549723,-0.740979,0,1,1,0,...,0,0,1,0,1,0,0,1,0,0
2,0.280831,0.460243,-1.120286,-0.044713,-0.465247,-0.740979,0,1,1,0,...,0,1,0,0,1,0,1,0,0,0
3,-1.505975,-0.106466,-0.877014,0.983065,0.569273,-0.740979,1,0,0,1,...,0,1,0,0,0,1,0,1,0,0
4,-1.394299,-0.673176,0.967794,0.554824,-1.111053,-0.740979,0,1,0,1,...,0,1,0,0,0,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
267,0.057480,0.006876,2.143605,-0.729899,0.402268,0.348460,0,1,1,0,...,0,0,1,0,1,0,0,0,1,1
268,0.057480,0.460243,-0.613470,-1.629205,2.510884,-0.740979,0,1,1,0,...,0,0,1,1,0,0,0,0,1,1
269,1.397584,1.593663,0.785340,-1.757678,0.647625,2.527338,0,1,1,0,...,0,0,1,0,1,0,0,1,0,1
270,-0.612572,-0.786518,-1.992008,-0.986844,0.017112,2.527338,0,1,0,0,...,0,1,0,0,0,1,0,1,0,1


In [19]:
# Save
out_path = "heart_disease_preprocessed_backdoor.csv"
df_backdoor.to_csv(out_path, index=False)