# HW: Confounding and Intermediate Variables

## Question 1

**Read chapter 8 of Understanding Uncertainty.**

## Question 2

**Explain what confounding variables and intermediate variables are.**

*   **Confounding Variable:** A third variable that influences both the independent and the dependent variable. In the Chapter 8 medical trial, "Sex" was a confounder because it influenced who received the treatment and the inherent recovery rate.
*   **Intermediate Variable (Mediator):** A variable that is the middle-man between the independent variable and the dependent variable. The independent variable causes the intermediate variable, which in turn causes the dependent variable. In the Chapter 8 agricultural trial, "Height" was intermediate because the plant variety determined the height, which determined the yield.

## Question 3

**Identify 2 examples (X, Y, Z) where Z is a confounding variable and 2 examples where Z is an intermediate variable.**

**Confounding Variables:**
1.  **X (Ice Cream Sales), Y (Drowning Incidents), Z (Temperature):** Higher temperatures cause more people to buy ice cream and more people to swim (increasing drowning risk). Controlling for temperature removes the association.
2.  **X (Carrying a Lighter), Y (Lung Cancer), Z (Smoking):** Smoking causes people to carry lighters and causes lung cancer. Lighters do not cause cancer.

**Intermediate Variables:**
1.  **X (Studying), Y (Exam Score), Z (Knowledge of Material):** Studying increases knowledge, which increases exam scores. Knowledge is the mechanism.
2.  **X (Taking Aspirin), Y (Headache Relief), Z (Reduction in Inflammation):** Aspirin reduces inflammation, which relieves the headache. Inflammation reduction is the pathway.

## Question 4

**Using seed = 20251028, generate 1000 draws from the function. Calculate a summary of the effect of vaccination.**

In [3]:
import numpy as np
import pandas as pd

def vax_data(R, seed=None):
    if seed is not None:
        np.random.seed(seed)
    vs = np.random.binomial(1, 0.5, R)
    # ds depends on vs
    ds = np.random.binomial(1, 0.25 * (vs == 1) + 0.75 * (vs == 0))
    # rt depends on ds
    rt = np.random.binomial(1, 0.7 * (ds == 1) + 0.5 * (ds == 0))

    df = pd.DataFrame({
        "vaccination_status": np.where(vs == 1, "vaccinated", "unvaccinated"),
        "disease_severity": np.where(ds == 1, "mild", "severe"),
        "recovery_time": np.where(rt == 1, "short", "long")
    })

    return df

df = vax_data(1000, seed=20251028)

p_short_vax = len(df[(df.vaccination_status == 'vaccinated') & (df.recovery_time == 'short')]) / len(df[df.vaccination_status == 'vaccinated'])
p_short_unvax = len(df[(df.vaccination_status == 'unvaccinated') & (df.recovery_time == 'short')]) / len(df[df.vaccination_status == 'unvaccinated'])

delta = p_short_vax - p_short_unvax
print(f"P(Short | Vax): {p_short_vax:.3f}")
print(f"P(Short | Unvax): {p_short_unvax:.3f}")
print(f"Delta (Overall): {delta:.3f}")

P(Short | Vax): 0.544
P(Short | Unvax): 0.628
Delta (Overall): -0.084


## Question 5

**Calculate the treatment effect separately for the mild and severe populations.**

In [4]:
def calc_delta(sub_df):
    n_vax = len(sub_df[sub_df.vaccination_status == 'vaccinated'])
    n_unvax = len(sub_df[sub_df.vaccination_status == 'unvaccinated'])
    
    if n_vax == 0 or n_unvax == 0:
        return 0
    
    p_vax = len(sub_df[(sub_df.vaccination_status == 'vaccinated') & (sub_df.recovery_time == 'short')]) / n_vax
    p_unvax = len(sub_df[(sub_df.vaccination_status == 'unvaccinated') & (sub_df.recovery_time == 'short')]) / n_unvax
    return p_vax - p_unvax

delta_mild = calc_delta(df[df.disease_severity == 'mild'])

delta_severe = calc_delta(df[df.disease_severity == 'severe'])

print(f"Delta (Mild): {delta_mild:.3f}")
print(f"Delta (Severe): {delta_severe:.3f}")

Delta (Mild): 0.031
Delta (Severe): -0.027


## Question 6

**Is disease severity a confounding variable or an intermediate variable? Explain your answer.**

Disease severity is an **intermediate variable**.

Looking at the data generation code:
1. `vs` (Vaccine Status) is determined first.
2. `ds` (Disease Severity) is calculated based on `vs` (`0.25 * (vs == 1)...`).
3. `rt` (Recovery Time) is calculated based on `ds`.

The causal path is: Vaccination Status $\rightarrow$ Disease Severity $\rightarrow$ Recovery Time. Since severity lies on the path between vaccination and recovery, it is an intermediate variable, similar to the "Height" variable in the Chapter 8 agricultural example.

## Question 7

**Determine if the vaccine is effective. Justify your answer.**

No, the vaccine is **ineffective**.

Because disease severity is an intermediate variable (part of the mechanism by which the vaccine works), we should look at the aggregate data (Question 4), not the stratified data (Question 5). The overall delta calculated in Question 4 (approx -0.08) shows that vaccinated individuals have a lower probability of short recovery.

## Question 8

**Suppose you want to know if going to office hours improves performance on an exam. Identify 2 variables that might be confounding and 2 variables that might be intermediate.**

**Confounding Variables:**
1.  **Student Motivation:** Highly motivated students are more likely to attend office hours (X) AND more likely to study effectively/score high (Y). This could make office hours look effective even if they aren't.
2.  **Course Schedule Conflicts:** Students who work full-time might not be able to attend office hours (X) and might have less time to study, leading to lower scores (Y). This creates a false correlation between attendance and scores.

**Intermediate Variables:**
1.  **Clarification of Difficult Concepts:** Going to office hours (X) leads to clearing up confusion on specific topics (Z), which leads to higher scores (Y).
2.  **Exam Strategy/Tips:** Instructors in office hours (X) might give specific tips on how to format answers (Z), which leads to getting more points (Y).