In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

import causal_assistant as ca

Suppose we are a patient, and we want to pick the best hospital we can.

We have access to data from two nearby hospitals: Hospital 0 and Hospital 1

In [None]:
patients = [
    # hospital 1, condition 0: 50% positive outcomes
    [0, 0, 0, 1],
    [0, 0, 1, 1],
    [0, 0, 0, 1],
    [0, 0, 1, 1],
    [0, 0, 0, 1],
    [0, 0, 1, 0],
    [0, 0, 0, 0],
    [0, 0, 1, 0],
    [0, 0, 0, 0],
    [0, 0, 1, 0],

    # hospital 2, condition 0: 33% positive outcomes
    [1, 0, 0, 0],
    [1, 0, 1, 0],
    [1, 0, 0, 1],

    # hospital 1, condition 1: 100% positive outcomes
    [0, 1, 1, 1],
    [0, 1, 0, 1],

    # hospital 2, condition 1: 90% positive outcomes
    [1, 1, 1, 1],
    [1, 1, 0, 1],
    [1, 1, 1, 1],
    [1, 1, 0, 1],
    [1, 1, 1, 1],
    [1, 1, 0, 1],
    [1, 1, 1, 1],
    [1, 1, 0, 1],
    [1, 1, 1, 1],
    [1, 1, 0, 0]
]

df = pd.DataFrame(patients, columns=["hospital", "condition", "group", "outcome"])
df.head()

So, which hospital is better?

We could run a basic statistical test to try and answer this question - which hospital has a higher percentage of positive outcomes?

In [None]:
# totals:
df.groupby("hospital").outcome.mean()

Great! Hospital 1 is better. Let's go there then :)

However, **this isn't actually the case** - look what happens when we split our analysis by condition:

In [None]:
# totals:
df.groupby(["hospital", "condition"]).outcome.mean().unstack()

In this subgroup analysis, we can see that Hospital 0 is actually better than Hospital 1 for both types of condition we have data for!

In [None]:
df.groupby(["hospital", "condition"]).outcome.mean().unstack().mean(axis="columns")


One potential cause of this imbalance is that the condition is a **confounder** for both the choice of hospital and the patient outcome - patients with the more serious 'Condition 0' may elect to pick Hospital 0 (i.e. it may be better suited to serious conditions), but they are also naturally less likely to have a positive outcome regardless of which hospital they went to.

So despite the obvious causal link from cause (hospital choice) to effect (outcome), there is also a pair of indirect links: condition effects both hospital choice and outcome!

We can use **causal inference** to analyse this problem:

In [None]:
# [h]ospital choice affects [o]utcome
# [c]ondition affects [h]ospital choice
# [c]ondition affects [o]utcome
causal_graph = """
    o;h;c;
    h->o;
    c->h;
    c->o;
"""

Causal Inference attempts to solve this problem by allowing us to ask a more nuanced question: What is the average outcome independent of condition? This is solved with an *interventional equation*, which we can compute below (using the package):

In [None]:
ca.analyse_graph(causal_graph, cause_var="h", effect_var="o")

In [None]:
# simulate hospital trips in an RCT, using the above interventional equation (do-calculus)
cb_df, _ = ca.bootstrap(
    causal_graph, cause_var="h", effect_var="o",
    h=df, o=df["outcome"], c=df["condition"]
)

In [None]:
cb_df.value_counts()

In [None]:
# totals:
cb_df.groupby("hospital").outcome.mean()

In [None]:
cb_df.groupby(["hospital", "condition"]).outcome.mean().unstack()

By resampling, causal bootstrapping has preserved the underlying distribution (subgroup analysis) while correcting the causal issues by breaking the condition->hospital link