In [1]:
import pandas as pd
data = pd.read_csv('hospital_data.csv')
print(data.shape)
data.head(10)

(10000, 3)


Unnamed: 0,Hospital,Treatment,Recovery
0,0,0,0.81
1,1,1,0.64
2,1,0,0.5
3,0,0,1.0
4,0,0,0.9
5,0,1,1.0
6,0,0,0.66
7,1,1,0.32
8,1,1,1.0
9,1,1,0.0


### Q1

In [2]:
average_recovery = data.groupby('Treatment')['Recovery'].mean()
print(average_recovery)
print('Based on the above results, treatment [0] which is the old method has a better recovery rate compared to treatment [1] which is the new method')

Treatment
0    0.772246
1    0.707093
Name: Recovery, dtype: float64
Based on the above results, treatment [0] which is the old method has a better recovery rate compared to treatment [1] which is the new method


### Q2

In [3]:
average_recovery_hospital = data.groupby(['Treatment', 'Hospital'])['Recovery'].mean()
print(average_recovery_hospital)

Treatment  Hospital
0          0           0.804058
           1           0.570209
1          0           0.848616
           1           0.650819
Name: Recovery, dtype: float64


In [4]:
# Treatment effect difference in Hospital 0
treatment_effect_hospital_0 = average_recovery_hospital[1, 0] - average_recovery_hospital[0, 0]

print(f"Treatment effect difference in Hospital 0: {treatment_effect_hospital_0:.3f}")
if treatment_effect_hospital_0 > 0:
    print("This shows that the new treatment (T = 1) is better than the old treatment (T = 0) in Hospital 0.")
else:
    print("This shows that the old treatment (T = 0) is better than the new treatment (T = 1) in Hospital 0.")


Treatment effect difference in Hospital 0: 0.045
This shows that the new treatment (T = 1) is better than the old treatment (T = 0) in Hospital 0.


In [5]:
# Treatment effect difference in Hospital 1
treatment_effect_hospital_1 = average_recovery_hospital[1, 1] - average_recovery_hospital[0, 1]

print(f"Treatment effect difference in Hospital 0: {treatment_effect_hospital_1:.3f}")
if treatment_effect_hospital_1 > 0:
    print("This shows that the new treatment (T = 1) is better than the old treatment (T = 0) in Hospital 1.")
else:
    print("This shows that the old treatment (T = 0) is better than the new treatment (T = 1) in Hospital 1.")

Treatment effect difference in Hospital 0: 0.081
This shows that the new treatment (T = 1) is better than the old treatment (T = 0) in Hospital 1.


#### Comparison
<div style="font-size:16px">

In both **Hospital 0** and **Hospital 1**, the treatment effect difference shows that the **new treatment (T = 1)** achieves better recovery outcomes compared to the **old treatment (T = 0)**.

However, when considering the **overall results** from Task 1, it shows that the **old treatment (T = 0)** has a better recovery rate on average across all hospitals.  
<br>
So we are facing **Simpson’s Paradox**! =)

</div>


### Q3

**Backdoor Criterion and Valid Adjustment Set**

By using the provided DAG, we applied the **backdoor criterion** to estimate the causal effect of **T (Treatment)** on **R (Recovery)**.

In the graph, **H (Hospital)** is a common cause (confounder) of both T and R, which opens a **backdoor path** from T to R through H. To block this path and satisfy the backdoor criterion, we must condition on H.

Thus, **H is a valid adjustment set**.

The causal effect of T on R can be computed by adjusting for H using the following formula:

$$
P(R \mid \text{do}(T)) = \sum_{h} P(R \mid T, H = h) \cdot P(H = h)
$$

This formula marginalizes over the hospital variable \( H \), effectively controlling for confounding and yielding an unbiased estimate of the causal effect.


In [6]:
average_recovery_hospital

Treatment  Hospital
0          0           0.804058
           1           0.570209
1          0           0.848616
           1           0.650819
Name: Recovery, dtype: float64

In [7]:
H_1 = data['Hospital'].mean()     
H_0 = 1 - H_1                        

print(f"Percentage of patients in Hospital 0: {H_0*100} %")
print(f"Percentage of patients in Hospital 1: {H_1*100} %")

Percentage of patients in Hospital 0: 61.08 %
Percentage of patients in Hospital 1: 38.92 %


In [8]:
# Compute E[R | do(T=0)] = Σ_h P(R=1 | T=0, H=h) * P(H=h)

recovery_do_T0 = average_recovery_hospital[0, 0] * H_0 + average_recovery_hospital[0, 1] * H_1

# Compute E[R | do(T=1)] = Σ_h P(R=1 | T=1, H=h) * P(H=h)

recovery_do_T1 = average_recovery_hospital[1, 0] * H_0 + average_recovery_hospital[1, 1] * H_1

In [9]:
print(f"\nEstimated recovery under old treatment (T = 0): {recovery_do_T0:.3f}")
print(f"Estimated recovery under new treatment (T = 1): {recovery_do_T1:.3f}")

causal_effect = recovery_do_T1 - recovery_do_T0
print(f"\nThe estimated causal effect of switching from the old treatment (T = 0) to the new treatment (T = 1) is: {causal_effect:.3f}")


Estimated recovery under old treatment (T = 0): 0.713
Estimated recovery under new treatment (T = 1): 0.772

The estimated causal effect of switching from the old treatment (T = 0) to the new treatment (T = 1) is: 0.059


In [10]:
# Create table of counts for Hospital and Treatment
table = data.groupby(['Hospital', 'Treatment']).size().unstack()

table.index.name = 'Hospital'
table.columns.name = 'Treatment'
table.columns = ['T = 0', 'T = 1']
display(table)


Unnamed: 0_level_0,T = 0,T = 1
Hospital,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4865,1243
1,766,3126


#### **Conclusion**

<div style="font-size:16px">

- The **overall recovery rate** is higher for the **old treatment** (0.772) compared to the **new treatment** (0.707), which suggests the old treatment is more effective.
- However, looking at the breakdown by **hospital**, we see a different story:
  - In **Hospital 0**, the new treatment has a higher recovery rate (0.849 > 0.804).
  - In **Hospital 1**, the new treatment again performs better (0.651 > 0.570).
- This shows that the **new treatment is more effective within both hospitals**.
- The misleading overall comparison is due to a **confounding variable**: Hospital. The old treatment was used much more often in Hospital 0, which has higher recovery rates overall.
- Hospital influences both the treatment assigned and recovery.
- By adjusting for hospital using the **backdoor criterion**, we isolate the true causal effect and find that the **new treatment is causally better** than the old one, despite what the raw averages initially suggest.

</div>

#### Table 1: Number of Patients by Treatment and Hospital

| Treatment       | Hospital 0 | Hospital 1 |
|------------------|------------|------------|
| T = 0 (Old)      | 4865       | 766        |
| T = 1 (New)      | 1243       | 3126       |

#### Table 2: Average Recovery by Treatment and Hospital

| Treatment       | Overall     | Hospital 0 | Hospital 1 |
|------------------|-------------|-------------|-------------|
| T = 0 (Old)      | 0.772       | 0.804       | 0.570       |
| T = 1 (New)      | 0.707       | 0.849       | 0.651       |