### Helsinki: A Case Study

- **Helsinki** is the capital of **Finland** and a pretty cold place.
- Despite the cold, the Finnish people are the **happiest in the world**.
- Among the many great things about Finland is **metal music**:
  - Finland has more heavy metal bands per person than any other country.
  - When you plot **happiness** and the **logarithm of the number of metal bands per million people** on the same graph, Finland ranks number one in both.

<img src="images/image1.jpeg" width="600" height="400" />

---

### Correlations vs. Causation

- **Spurious correlations** are common. Examples include:
  - The number of **Waffle Houses** and the **divorce rate** in the Southern United States.
  - There is a strong statistical association between the number of Waffle Houses per million people and the divorce rate, but it's not plausible that Waffle House causes divorce any more than it's plausible that heavy metal makes nations happy.

 <img src="images/image2.jpeg" width="600" height="400" />

- **Time Series Correlations**:
  - Example: The divorce rate in **Maine** from 2000 to 2009 correlated with the per capita consumption of **margarine** in the same years, with a correlation of 0.99.

 <img src="images/image3.jpeg" width="600" height="400" />
 
---

## Key Takeaways

- **Correlation is common in nature; causation is sparse**.
- We must be sophisticated about how we think about threats to validity to distinguish associations that reflect cause from those that don't.

---


To remind you where we're going in this course, in every example, the idea is we start with a scientific question and demand, which is our goal. Here is a cooking metaphor:

- **Goal**: Make a hedgehog cake (our demand).
- **Recipe**: Our estimator (a set of instructions to assemble ingredients like data and code).
- **Result**: Our estimate (the outcome that resembles the demand).

<img src="images/image4.jpeg" width="600" height="400" />

Often, things can go wrong in either the design of the estimator or in its usage, resulting in an estimate that may not be what we hoped for. Today, we'll discuss these kinds of threats, specifically those mismatches between the estimator and the process we're studying that can lead us astray. This topic is sometimes called **confounding**.

<img src="images/image5.jpeg" width="600" height="400" />

## Understanding Confounding

The term **confound** means many things in statistics, but we'll use it in its plain English sense: something that misleads us. When confounded, you're confused. The causal sources of confounds are diverse, and today's goal is to introduce you to **the four elemental confounds**. Despite the diversity of causes, they are all built up from four fundamental relationships between triplets of variables: the fork, the pipe, the collider, and the descendant.

<img src="images/image6.jpeg" width="600" height="400" />

### The Fork

The fork is the most fundamental relationship, often introduced first to statistics students. It involves three variables: \(X\), \(Y\), and \(Z\). We're interested in the association between \(X\) and \(Y\) and analyzing the role of \(Z\) in influencing that relationship.

- **Common Cause**: \(Z\) is a common cause of \(X\) and \(Y\). \(X\) and \(Y\) will be associated because they share \(Z\), their common cause.
- **Notation**: \(Y \nind X\) means \(Y\) is not independent of \(X\). Knowing \(Y\) or \(X\) tells us something about the other due to their shared information from \(Z\).
- **Stratification**: When stratifying the sample by \(Z\), \(X\) and \(Y\) become independent at each level of \(Z\). We write this as \(Y \ind X \mid Z\).

<img src="images/image7.jpeg" width="600" height="400" />

#### Graphical Representation

- **DAG (Directed Acyclic Graph)**: \(Z\) influences both \(X\) and \(Y\).
- **Particles of Influence**: Represented by filled and empty circles, indicating different values.
- **Extra Influences**: \(X\) and \(Y\) also have their own unique influences, not drawn in the DAG.

#### Simulation Example

Here's a simple simulation of the fork:

1. **Simulate \(Z\)**: Generate \(Z\) first.
2. **Simulate \(X\) and \(Y\)**: Generate \(X\) and \(Y\) based on \(Z\) with additional random variation.


<img src="images/image8.jpeg" width="600" height="400" />

---

### Data Analysis Example: Marriage Rates and Divorce Rates

Simply looking at a scatter plot often doesn't reveal the underlying causal structure. We use causal models to design estimators that help us infer what's truly happening.

#### Example: Marriage Rate and Divorce Rate

Consider the relationship between marriage rates and divorce rates across the states of the United States. Higher marriage rates are statistically associated with higher divorce rates. This could imply:
- A causal relationship.
- Cultural factors influencing both.
- The simple fact that people can only get divorced if they get married first.

However, there might be more to this relationship. Our goal is to investigate the causal effect of marriage rates on divorce rates across regions in the U.S.

#### Introducing Another Variable: Median Age at Marriage

Another variable strongly related to divorce rates is the median age at which people get married. The relationship works in the opposite direction:
- Higher marriage rates are associated with higher divorce rates.
- Lower median ages of marriage are associated with higher divorce rates.

#### Constructing a Causal Model

We can model these relationships using a Directed Acyclic Graph (DAG):
- **Median Age at Marriage**: A common cause (fork) affecting both marriage rates and divorce rates.
  - Younger populations tend to have higher marriage rates.
  - We need to assess whether the association between marriage rate and divorce rate is solely due to this common cause.

#### Developing an Estimator

We aim to develop a scientific model and statistical estimator to analyze the data. This approach helps us determine if the observed association between marriage and divorce rates results from their shared common cause (median age at marriage) or if there is a direct causal effect.

#### Visualizing the Relationships

Scatter plots alone cannot reveal the causal structure. The true causal relationships require knowledge of the variables and their interactions:
- We need to think about the directions of the arrows in the DAG.
- Multiple DAGs could fit the same scatter plots if the variables were anonymous.

<img src="images/image10.jpeg" width="800" height="500" />

Understanding and modeling the causal relationships are essential for accurate data analysis and interpretation.

### Breaking the Fork: Estimating Causal Effects

To understand causal effects in our model, we need to generate a synthetic simulation and develop an estimator. In this example, we'll skip the testing stage to focus on more complex aspects.

#### Stratifying to Break the Fork

To estimate the causal effect of \( M \) (marriage rate), we need to "break the fork" by stratifying by the common cause \( A \) (age at marriage). By stratifying, we remove the association between \( M \) and \( D \) (divorce rate) caused by \( A \). Here's how:

- **Stratify by \( A \)**: Within each level of \( A \), the association between \( M \) and \( D \) caused by \( A \) is removed.
- **Average Across Levels of \( A \)**: This helps isolate the direct influence of \( M \) on \( D \).

<img src="images/image11.jpeg" width="600" height="400" />

#### Stratifying by a Continuous Variable

When \( A \) is continuous (like age at marriage), stratifying means examining the association between \( M \) and \( D \) for every value of \( A \). We use a function to capture these relationships:

- **Linear Regression**: A linear regression model can effectively stratify by incorporating \( A \) into the intercept. For each \( A \), we estimate a different relationship between \( M \) and \( D \).

<img src="images/image12.jpeg" width="600" height="400" />

#### Example Linear Model

Consider a regression model where the expected divorce rate (\( E[D] \)) is:

<img src="images/image13.jpeg" width="600" height="400" /> <img src="images/image14.jpeg" width="600" height="400" />

- \( \alpha \): Intercept
- \( \beta_M \): Slope for marriage rate
- \( \beta_A \): Slope for age at marriage

This model stratifies by \( A \), making \( \alpha + \beta_A A \) the intercept, and allows us to measure \( \beta_M \) against this stratified baseline.

#### Standardizing Variables

Standardizing variables makes the mean zero and the standard deviation one, facilitating the development of priors and improving computational efficiency. Standardizing involves:

- Subtracting the mean and dividing by the standard deviation.

<img src="images/image15.jpeg" width="600" height="400" /> <img src="images/image16.jpeg" width="600" height="400" />

#### Developing Priors

For Bayesian models, selecting appropriate priors is crucial. Here’s a guideline for standardized variables:

- **Intercept (\( \alpha \))**: Mean of zero, reflecting the average divorce rate.
- **Slopes (\( \beta_M \), \( \beta_A \))**: Should be loose enough to allow for realistic variations but not so extreme that they imply implausible relationships.

We simulate prior predictive distributions to ensure these priors make sense.

#### Simulation Example

Simulating from \( \text{Normal}(0, 10) \) priors can result in extreme slopes, which are unrealistic. More realistic priors might be \( \text{Normal}(0, 1) \), allowing for both strong and weak associations without implausible extremes.

<img src="images/image18.jpeg" width="600" height="400" /> <img src="images/image19.jpeg" width="600" height="400" />

In [6]:
import re

# Open and read the content of the transcript file
with open("./transcript.txt") as f:
    txt = f.read()
    txt = txt.lower()
    
    # Find the index of the phrase "associations that reflect cause"
    start_phrase = "what we need to do surprize is"
    start_index = txt.find(start_phrase)
    end_phrase = "only one unknown in the posterior distribution"
    end_index = txt.find(end_phrase)
    
    sliced_txt = txt[start_index:end_index] + f" {end_phrase}"
    
    print(sliced_txt)

look at the summary table and already get a sense of what's going on here we've got an intercept and two slopes and what i'm showing you in this plot this is the pricey plot this is a sometimes called a forest plot or caterpillar plot these are the posterior means in the open circles for each unknown in the posterior distribution and the bars are 89 percent percentile intervals or compatibility intervals i sometimes call them you can get a sense what's going on unsurprisingly the intercept alpha is centered on zero all right it has to be by measurement we induce that through the transformation and uh the two slopes are the focus of our interest you can see that um beta sub m bm there is close to zero and it spans both sides of it so any causal effect of marriage rate is is this doesn't say the causal effect of marriage is zero just because the interval includes zero i'll say that again this does not mean that the causal effect of marriage rate is zero just because this interval include