<a href="https://colab.research.google.com/github/vanislekahuna/Statistical-Rethinking-PyMC/blob/main/Chp_06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Statistical Rethinking: A Bayesian Course with Examples in Python**

# **Chapter 6 - The Haunted DAG & The Causal Terror**


## Key Takeaways

Jump to [*Section 6.5*](#scrollTo=sQ8T-TlwU9Ux)

<img src="https://media.architecturaldigest.com/photos/638a6d38cd02cbdd42979aaa/master/w_1600,c_limit/Wednesday_S1_E1_00_17_18_04R.jpg" width=1000 height=494>

[Source](https://www.architecturaldigest.com/story/wednesday-on-netflix-inside-the-castle-used-for-nevermore-academy)

## Intro

We all know that the most sensational scientific studies, the ones that make the biggest headlines, often turn out to be the least reliable. On the otherhand, often the more boring the topic, the more rigorous the result. Why is that? You'd think that the things people care about would be more reliable and studied with the care.

The reason for this is something called **Berson's Paradox** (also called **selection-distortion effect**) which is a phenomenon that occurs when a selection process, like peer review for a scientific journal, prioritizes multiple factors at once. In the case of science, these factors are trustworthiness (rigor and accuracy) and newsworthiness (how interesting or impactful a study is).

Imagine a group of reviewers who are told to pick the best proposals. They value both trustworthiness and newsworthiness equally. The only way for a proposal to get picked is to score well overall. This means that a proposal that is not very trustworthy can still get chosen if it's extremely newsworthy. Likewise, a proposal that isn't very newsworthy can still be selected if it's very trustworthy. There's code to simulate this issue in **[Figure 6.1](#scrollTo=skHm5JwQHM83)** which simulates the proposals, which got selected, and the negative correlation between between accuracy and newsworthiness.

This selective process creates a fake negative relationship between the two factors that didn't exist in the beginning. Among the chosen studies, the most newsworthy ones tend to be the least trustworthy, and vice versa. It's like how a restaurant in a great location can have bad food and still stay in business. The good location "compensates" for the poor food.

Another example to illustrate this is if you imagine that someone is a good musician or a good athlete, but not both.  If you look at the general population, there might be no correlation between musical ability and athletic ability. However, if you only study people who have been accepted into a highly competitive university that values both skills, you will find a negative correlation. The students who are not very musical must be exceptional athletes to have been accepted, and vice versa. This is **Berkson's paradox**, which is an example of **collider bias** caused by a **selection-distortion effect**.


This effect is a major problem for multiple regression, a statistical technique we use to understand relationships between variables. We might be tempted to just throw all our data into a regression model and expect it to figure out the important relationships, but this is a mistake. *Regression will not figure it out. Regression is indeed an oracle, but a cruel one that speaks in riddles and punishes you for asking bad questions.* The selection-distortion effect can happen inside a regression model when we add certain variables, which can lead us to believe that a negative relationship exists when it doesn't.

This chapter will teach you about the <u>three different pitfalls</u> of multiple regression:

1. **Multicollinearity**: When two or more of the variables you're using to predict something are highly correlated with each other. For example, if you wanted to predict a person's height using both their left and right leg lengths. Since a person's left and right legs are very similar in length, the model can't tell which one is more important, so it will get confused and suggest that neither one is a reliable predictor. The model will still be good for making predictions, but you won't be able to understand which variables are actually important.

2. **Post-treatment Bias**: We'll cover this a little later.

3. **Collider Bias** (also known as **Selection-Distortion Effect**): This is a problem that can mislead you into thinking that there is a relationship between two variables when it is really just a consequence of how you set up your model.


At the end of this chapter, we will learn a framework that will help us choose which variables to include in our models to avoid these problems and make valid conclusions. However, remember that this framework can't replace the most important step: having a good understanding of what you are trying to model.

In [None]:
import warnings

import arviz as az
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import pymc as pm
import seaborn as sns

from scipy import stats
from scipy.optimize import curve_fit

warnings.simplefilter(action="ignore", category=FutureWarning)

In [None]:
%config Inline.figure_format = 'retina'
az.style.use("arviz-darkgrid")
az.rcParams["stats.hdi_prob"] = 0.89  # sets default credible interval used by arviz
np.random.seed(0)

#### Figure 6.1. Simulated science distortion.

#### Code 6.1

In [None]:
np.random.seed(3)
N = 200  # num grant proposals
p = 0.1  # proportion to select

# uncorrelated newsworthiness and trustworthiness
nw = np.random.normal(size=N)
tw = np.random.normal(size=N)

# select top 10% of combined scores
s = nw + tw  # total score
q = np.quantile(s, 1 - p)  # top 10% threshold
selected = s >= q
cor = np.corrcoef(tw[selected], nw[selected])
cor

In [None]:
# Figure 6.1
plt.scatter(nw[~selected], tw[~selected], lw=1, edgecolor="k", color=(0, 0, 0, 0))
plt.scatter(nw[selected], tw[selected], color="C0")
plt.text(0.8, 2.5, "selected", color="C0")

# correlation line
xn = np.array([-2, 3])
plt.plot(xn, tw[selected].mean() + cor[0, 1] * (xn - nw[selected].mean()))

plt.xlabel("newsworthiness")
plt.ylabel("trustworthiness")

#####################
### CODE ADDITION ###
#####################
# Adding the graph explanations from the textbook
plt.suptitle(
    x=1.32,
    y=.75,
    t="Figure 6.1. Why the most newsworthy studies \n \
    might be the least trustworthy. 200 research \n \
    proposals are ranked by combined \n \
    trustworthiness and newsworthiness. The top \n \
    10% are selected fo funding. While there is \n \
    no correlation before selection, the two \n \
    criteria are strongly negatively correlated \n \
    after selection. The correlation here is -0.77",
    ma="left"
  )

## *Section 6.1* - Multicollinearity

It's a common temptation when building a regression model to include every possible predictor variable you can think of. But there are a few dangers to doing this. Here, we'll focus on a problem called multicollinearity.

**Multicollinearity** is a simple idea: it means that two or more of your predictor variables are strongly correlated with each other. For example, if you wanted to predict a person's height using the length of both their left and right legs. Since a person's legs are almost the same length, these two variables are highly correlated.

When you put both of these variables into a regression model, it can't really tell which one is more important. The model gets confused and will make it seem as though neither of the variables is a good predictor on its own, even if both are clearly related to the outcome.

The good news is that this doesn't break the model. It'll still work just fine for making predictions. The bad news is that it will be very frustrating for you because you won't be able to easily understand what's going on or which variables are actually driving the result.

#### **6.1.1. Multicollinear legs.**

To make this concept clear, we'll use a simple simulation to predict an individual's height using the length of both their left and right legs. Our simulation will create a group of 100 individuals, each with a simulated height and two slightly different leg lengths that are a proportion of their simulated height ranging from 0.4 to 0.5 plus a bit of random error. We'll show you how the model gets "vexed" when we try to use both leg lengths to predict height.

#### Code 6.2

In [None]:
N = 100  # number of individuals
height = np.random.normal(10, 2, N)  # sim total height of each
leg_prop = np.random.uniform(0.4, 0.5, N)  # leg as proportion of height
leg_left = leg_prop * height + np.random.normal(0, 0.02, N)  # sim left leg as proportion + error
leg_right = leg_prop * height + np.random.normal(0, 0.02, N)  # sim right leg as proportion + error

d = pd.DataFrame(
    np.vstack([height, leg_left, leg_right]).T,
    columns=["height", "leg_left", "leg_right"],
)  # combine into data frame

d.head()

#### Code 6.3

Now let's analyze our simulated data to predict height using the lengths of a person's left and right legs.

Before we do, let's think about what we expect to happen. In our simulation, a person's legs are about $45\%$ of their height. If we were to use a single leg to predict height, we would expect the coefficient to be around ${10} \div {4.5} ≈ 2.2$. This means that for every one-unit increase in leg length, we would expect height to increase by about $2.2$ units.

However, when we include both leg lengths in the model, things will get weird. We're going to use very vague priors (our initial beliefs about the parameters) so we can be sure that our results aren't being influenced by them. This will allow us to clearly see how multicollinearity messes with our ability to interpret the model.

In [None]:
with pm.Model() as m_6_1:
    a = pm.Normal("a", 10, 100)
    bl = pm.Normal("bl", 2, 10)
    br = pm.Normal("br", 2, 10)

    mu = a + bl * d["leg_left"].values + br * d["leg_right"].values
    sigma = pm.Exponential("sigma", 1)

    height = pm.Normal("height", mu=mu, sigma=sigma, observed=d["height"].values)

    m_6_1_trace = pm.sample(draws=100)
    idata_6_1 = az.extract_dataset(m_6_1_trace)

In [None]:
az.summary(m_6_1_trace, round_to=2)

#### Code 6.4

Look how crazy the posterior means are given that there's a vast difference between the right and left leg, along with the 89% interval for each parameter!

In [None]:
_ = az.plot_forest(m_6_1_trace, var_names=["a", "br", "bl", "sigma"], combined=True, figsize=[5, 2])

#### Figure 6.2. The posterior distribution of the highly correlated left and right leg parameters.

#### Code 6.5 & 6.6

(Because we used MCMC (c.f. `quap`), the posterior samples are already in `m_6_1_trace`.)

<br>

<br>

Even though our model seems "weird" and unsure of itself, it's actually giving us the correct answer. A multiple regression model answers a very specific question: *what is the value of knowing one predictor after already knowing the others?*

Since our two predictors (left and right leg length) are nearly identical, knowing one tells us almost everything about the other. The model gets confused and can't confidently assign a specific value to each leg's coefficient. Instead, it finds that many different combinations of values for the two legs are equally plausible. The **bivariate posterior distribution** will show this, revealing a range of possible answers rather than a single, clear one.

In [None]:
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=[7, 3])

br_post = m_6_1_trace.posterior["br"].values
bl_post = m_6_1_trace.posterior["bl"].values

# code 6.5
ax1.scatter(br_post, bl_post, alpha=0.05, s=20)
ax1.set_xlabel("br")
ax1.set_ylabel("bl")

# code 6.6
az.plot_kde(br_post + bl_post, ax=ax2)
ax2.set_ylabel("Density")
ax2.set_xlabel("sum of bl and br")

#####################
### CODE ADDITION ###
#####################
plt.suptitle(
    x=0.5,
    y=-0.06,
    t="Figure 6.2. Left: Posterior distribution of the association of each leg with \n \
    height, from model 6.1. Since both variables contain almost identical, \n \
    information the posterior is a narrow bridge of negatively correlated values. \n \
    Right: The posterior distribution of the sum of the two parameters is \n \
    centered on the proper association of either leg.",
    ma="left"
  );

#### Code 6.7

The plot from **[Figure 6.2](#scrollTo=m9Lw-SaBgz3C)** in our simulation shows that the plausible values for the left and right leg coefficients lie along a narrow, diagonal ridge (left). This means that if the coefficient for the left leg is large, the coefficient for the right leg must be small, and vice-versa. Since both leg variables contain almost the same information, there are countless combinations of these two coefficients that give the same predictions.

Think about it like this: your model is essentially trying to solve a problem with two variables, where those variables are almost identical. It's like having the equation:

<br>

$ y_i \sim \text{Normal}(\mu_i, \sigma) $

$ \mu_i = \alpha + \beta_1 x_i + \beta_2 x_i$

<br>

Since we can't tell the difference between $\beta_1$ and $\beta_2$, the model combines them. So, the equation effectively becomes:

<br>

$ y_i \sim \text{Normal}(\mu_i, \sigma) $

$ \mu_i = \alpha + ( \beta_1 + \beta_2 ) x_i$

<br>

The model can't figure out the individual values for $\beta_1$ and $\beta_2$, but it does a great job of figuring out their sum which is displayed in the right-hand graph of **[Figure 6.2](#scrollTo=m9Lw-SaBgz3C)**. The model correctly estimates that the sum of the two coefficients is a little over 2, which is what we expected. If you were to fit a model with only one of the legs, you'd get the same result. The takeaway here is that when multicollinearity is present, the model can still predict accurately, but it gets confused when it tries to separate the effects of the individual predictors.

The resulting mean value of `bl` where we used only one of the leg parameters is almost identical to the mean value of the  `sum_blbr` where we added both leg parameter mean values and plotted it on the right-hand side of **[Figure 6.2](#scrollTo=m9Lw-SaBgz3C)**.

Even with random variations in our simulations, the key lesson of multicollinearity remains the same: when you include two highly correlated predictor variables in a model, it can get confused. The model's output isn't wrong; it's simply telling you that the question you asked (which variable is more important?) can't be answered with the given data. This is a good thing for a model to do.

Multicollinearity doesn't stop a model from making accurate predictions. It just prevents you from understanding the individual contribution of each variable.


In [None]:
with pm.Model() as m_6_2:
    a = pm.Normal("a", 10, 100)
    bl = pm.Normal("bl", 2, 10)

    mu = a + bl * d.leg_left.values
    sigma = pm.Exponential("sigma", 1)

    height = pm.Normal("height", mu=mu, sigma=sigma, observed=d.height.values)

    m_6_2_trace = pm.sample()

az.summary(m_6_2_trace, round_to=2)

#### **6.1.2. Multicollinear milk.**

The leg length example is a simple, clear-cut case of multicollinearity. In real-world data, the problem is more subtle. We might not realize that two variables are highly correlated and mistakenly think that our model's output means that neither variable is important

#### Code 6.8

We'll now apply this concept to a real dataset: primate milk. We'll be using the `perc.fat` and `perc.lactose` variables to model the total energy content, `kcal.per.g`. Both of these variables are used to determine the total energy content of the milk, so it's a natural case where multicollinearity might arise.

In [None]:
d = pd.read_csv("https://raw.githubusercontent.com/vanislekahuna/Statistical-Rethinking-PyMC/refs/heads/main/Data/milk.csv", sep=";")


def standardise(series):
    """Standardize a pandas series"""
    return (series - series.mean()) / series.std()


d.loc[:, "K"] = standardise(d["kcal.per.g"])
d.loc[:, "F"] = standardise(d["perc.fat"])
d.loc[:, "L"] = standardise(d["perc.lactose"])

d.head()

#### Code 6.9.1

We'll start by modelling energy content (`kcal.per.g`) using the percentage of fat in the milk (`perc.fat`) solely as the predictor variable in a Simple Bayesian Linear Regression model:

In [None]:
# kcal.per.g regressed on perc.fat
with pm.Model() as m_6_3:
    a = pm.Normal("a", 0, 0.2)
    bF = pm.Normal("bF", 0, 0.5)

    mu = a + bF * d.F.values
    sigma = pm.Exponential("sigma", 1)

    K = pm.Normal("K", mu, sigma, observed=d.K.values)

    m_6_3_trace = pm.sample()

az.summary(m_6_3_trace, round_to=2)

#### Code 6.9.2

Next, we'll similarly use the percentage of lactose (`perc.lactose`) as the predictor for energy content in another bivariate regression.

In [None]:
# kcal.per.g regressed on perc.lactose
with pm.Model() as m_6_4:
    a = pm.Normal("a", 0, 0.2)
    bL = pm.Normal("bL", 0, 0.5)

    mu = a + bL * d.L.values
    sigma = pm.Exponential("sigma", 1)

    K = pm.Normal("K", mu, sigma, observed=d.K.values)

    m_6_4_trace = pm.sample()

az.summary(m_6_4_trace, round_to=2)

#### Code 6.10

See how similar the mean slope and y-intercept values are between the two models except that lactose percentage slope (`bL`) seems to have a negative (but similar distance) relationship with energy content. On the otherhand, the fat percentage slope (`bF`) is positive and essentially mirrors the lactose percentage.

Given the variance between the two, we might conclude that both variables are reliable (and hopefully uncorrelated) predictors of energy content. However, watch what happens when we include both in the same **multiple Bayesian linear regression model**:

In [None]:
with pm.Model() as m_6_5:
    a = pm.Normal("a", 0, 0.2)
    bF = pm.Normal("bF", 0, 0.5)
    bL = pm.Normal("bL", 0, 0.5)

    mu = a + bF * d.F.values + bL * d.L.values
    sigma = pm.Exponential("sigma", 1)

    K = pm.Normal("K", mu, sigma, observed=d.K.values)

    m_6_5_trace = pm.sample()

az.summary(m_6_5_trace, round_to=2)

### Figure 6.3. A pairwise plot of fat and lactose percentage.

#### Code 6.11

When we look at the primate milk data, we see that fat percentage and lactose percentage are strongly and negatively correlated with each other. This means that they are nearly redundant as predictors similar to the leg-as-a-predictor-of-height example. While both are individually good at predicting the total energy, a model that includes both variables will have a difficult time separating their individual effects.

In [None]:
sns.pairplot(d.loc[:, ["kcal.per.g", "perc.fat", "perc.lactose"]])

#####################
### CODE ADDITION ###
#####################
plt.suptitle(
    x=1.3,
    y=.65,
    t="Figure 6.3. A pairwise plot of the total energy, \n \
    percent fat, and percent lactose variables \n \
    from the primate milk data. Percent fat and \n \
    percent lactose are strongly negatively \n \
    correlated with one another, providing \n \
    mostly the same information.",
    ma="left"
  );

We can tell that fat and lactose percentage are related by looking at the pairwise plot in **[Figure 6.3](#scrollTo=saYISxXI_3Yy)**, we can observe the following relationships about our variables:

1. That fat percentage is <u>positively correlated</u> with energy content, our target variable;

2. Meanwhile lactose percentage is <u>negatively correlated</u> with energy content;

3. And our predictors, fat and lactose are <u>negatively correlated</u> with one another. meaning the two variables are almost redundant.

This strong correlation is a sign of **multicollinearity**, where a model's predictor variables contain almost the same information. This is why when you use both variables, the model gets confused and is unable to tell you the individual effect of each one. In this case, either predictor helps in predicting the energy content target variable but not both.

 >In the scientific literature, you might encounter a variety of suspect ways of coping with multicollinearity. Few of them take a causal perspective. Some fields actually teach students to inspect pairwise correlations before fitting a model, to identify and <u>mistakenly</u> drop highly correlated predictors. However, pairwise correlations are not the problem. It is the conditional associations—not correlations—that matter. And even then, the right thing to do will depend upon what is causing the collinearity. The associations within the data alone are not enough to decide what to do (McElreath, 2020, p.173)

Instead of just checking for correlations, we should think about what's actually causing this relationship. In this case, the negative correlation between fat and lactose is likely due to a biological tradeoff:
1. A species' milk can be either watery and high in sugar (lactose);
2. Or fatty and high in energy, but not both.  

This suggests a causal model where a hidden variable, like milk density (D), influences both fat (F) and lactose (L), which in turn affect total kilocalories (K). Since we can't measure milk density, we're stuck with fat and lactose which are strongly correlated.

This problem of not being able to estimate a parameter is called **non-identifiability**. It means that even with a correctly coded model, the data might not contain enough information to give us a clear answer about a specific parameter. When this happens, a Bayesian model will give you a posterior distribution that looks very similar to your prior, which is a sign that *the model didn't learn much from the data*.

In [None]:
# Create a directed graph
G = nx.DiGraph()

# Add nodes
nodes = ['L', 'D', 'F', 'K']
G.add_nodes_from(nodes)

# Add edges based on the diagram
edges = [('D', 'L'), ('D', 'F'), ('L', 'K'), ('F', 'K')]
G.add_edges_from(edges)

# Set up positions to match the diagram layout
pos = {
    'L': (0, 1),
    'D': (1, 1),
    'F': (2, 1),
    'K': (1, 0)
}

# Draw the graph
plt.figure(figsize=(8, 6))
nx.draw(G, pos, with_labels=True, node_color='lightblue',
        node_size=1500, font_size=16, font_weight='bold',
        arrows=True, arrowsize=20, edge_color='black')

# plt.title("Directed Graph: L → D → F, L → K, F → K")
plt.axis('off')
plt.show()

**Rethinking: Identification guaranteed; comprehension up to you.** We don't worry about **non-identifiability** as much in Bayesian modelling as other statistical approaches do. As long as the posterior distribution is "proper" (meaning its probabilities add up to 1), all the parameters are technically "identified."

However, this technicality doesn't mean the results are easy to understand. When parameters are weakly identified, the model has trouble separating their effects, and the resulting posterior distribution will be too spread out to be useful. So it's more accurate to talk about *weakly identified parameters* in a Bayesian context, where the model works, but the answer it gives is too vague to be helpful.

#### Code 6.12

**Overthinking: Simulating collinearity.**

The code below simulates a scenario to demonstrate how increasing the correlation between two predictor variables dramatically increases the **uncertainty** of their estimated effects in a linear model, thus modelling the core problem of **multicollinearity** in linear model.

**Methodology**

1. `sim_coll(r)`: This function simulates a new predictor variable (let's call it $X_{new}$) that is correlated with your existing `perc.fat` variable by a specific correlation value $r$.
    - It then fits a linear model to predict the outcome (`kcal.per.g`) using both `perc.fat` and the new, correlated predictor $X_{new}$.
    - The function returns the **standard deviation** of the posterior distribution for the slope relating `perc.fat` to the outcome. This standard deviation is a direct measure of the model's **uncertainty** about that slope.

2. `rep_sim_coll(r, n=100)`: This function repeats the `sim_coll` process 100 times for a given correlation r and calculates the average standard deviation across those 100 runs. Averaging helps produce a stable estimate of the uncertainty.

3. Final Plot: The main part of the script uses `r_seq` to test correlation values from 0 (no correlation) up to 1 (perfect correlation).

    - It collects the average standard deviation for each correlation level.
    - The final scatter plot visualizes the relationship between the degree of correlation on the x-axis and the resulting model uncertainty (**standard deviation**) on the y-axis.


<br>

**Expected Outcome**

When you run this code, the plot will show that as the correlation between the two predictors approaches 1.0, the standard deviation (model uncertainty) will *increase sharply*, forming a curve that shoots up dramatically. This visually proves that even a small increase in correlation near the high end can make your individual slope estimates highly unreliable.

Because the code uses implicit flat priors, the result will exaggerate this effect. Using more realistic priors would slow down the inflation of the standard deviation.

In [None]:
def mv(x, a, b, c):
    return a + x[0] * b + x[1] * c


def sim_coll(r=0.9):
    x = np.random.normal(loc=r * d["perc.fat"], scale=np.sqrt((1 - r**2) * np.var(d["perc.fat"])))
    _, cov = curve_fit(mv, (d["perc.fat"], x), d["kcal.per.g"])
    return np.sqrt(np.diag(cov))[-1]


def rep_sim_coll(r=0.9, n=100):
    return np.mean([sim_coll(r) for i in range(n)])


r_seq = np.arange(0, 1, 0.01)
stdev = list(map(rep_sim_coll, r_seq))

plt.scatter(r_seq, stdev)
plt.xlabel("correlation")
plt.ylabel("standard deviation of slope");

## *Section 6.2* - Post-treatment Bias

Researchers routinely worry about **omitted variable bias** which describes mistakes that happen when you leave out an important predictor variable. However, we also need to worry about **post-treatment bias** which describes a mistake that come from including a variable that is a consequence of another variable in the model. This kind of bias can ruin any study, whether it's a carefully controlled experiment or an observational study. You should never blindly add variables to a model.

To illustrate this issue, let's use an analogy of a plant experiment where we're studying whether the application of a *anti-fungal soil treatment* (predictor) has any effect in increasing a plant's *final height* (target). In this experiment, we'll first measure the plant's initial height ($t = 0$), apply the treatment, check for the presence of the fungus, then measure the final height ($t = 1$)

In this example, the fungus is a **post-treatment effect** because it's a result of whether the treatment worked or not. If your goal is to determine the causal effect of the treatment on final height, you must not include the presence of fungus in your model. By including the fungus, you would be blocking the pathway through which the treatment exerts its effect.


#### Code 6.13

In [None]:
# number of plants
N = 100
# simulate initial heights
h0 = np.random.normal(10, 2, N)
# assign treatments and simulate fungus and growth
treatment = np.repeat([0, 1], N / 2)
fungus = np.random.binomial(n=1, p=0.5 - treatment * 0.4, size=N)
h1 = h0 + np.random.normal(5 - 3 * fungus, size=N)
# compose a clean data frame
d = pd.DataFrame.from_dict({"h0": h0, "h1": h1, "treatment": treatment, "fungus": fungus})

az.summary(d.to_dict(orient="list"), kind="stats", round_to=2)

#### **6.2.1. A prior is born.**

When we build a model, we should use our scientific knowledge to set up our parameters. For example, when looking at height change, we might want to model the *final height as a proportion of the initial height*.

$ h_{1,i} = \text{Normal}( \mu_i, \sigma ) $

$ \mu_i = h_{0, i} \times p $

Im the Linear Model above, if $h_0$ is the initial height and $h_1$ is the final height, we could model $h_1$ as $h_0$ multiplied by a proportion parameter, $p$. Since plants usually grow, we expect $p$ to be greater than 1, but we must set a prior that allows $p$ to be less than 1 in the event that the plants dies as well as ensure that $p$ is always positive. A **Log-Normal distribution** is a good choice for this parameter because it naturally keeps the values above zero.

#### Code 6.14

So with the prior we've specified, we can expect the plant to either shrink by 40% (in other words, be 60% of its original height) or to grow by 50% (i.e. grow to 150% of its original height).

In [None]:
sim_p = np.random.lognormal(0, 0.25, int(1e4))

az.summary(sim_p, kind="stats", round_to=2)

#### Code 6.15

Let's now fit this prior in our model `m_6_6` to simulate the average growth in this experiment:

In [None]:
with pm.Model() as m_6_6:
    p = pm.Lognormal("p", 0, 0.25)

    mu = p * d.h0.values
    sigma = pm.Exponential("sigma", 1)

    h1 = pm.Normal("h1", mu=mu, sigma=sigma, observed=d.h1.values)

    m_6_6_trace = pm.sample()

az.summary(m_6_6_trace, round_to=2)

#### Code 6.16

So with our previous model showing about 40% growth on average, let's now add our treatment ($T_i$) and fungus ($F_i$) variables to measure their impact based on the *change in proportion growth*. In our new `m_6_7` model below, the proportion of growth ($p$) is now a function of our predictor values similar to linear models we've seen before:

<br>

$ h_{1,i} = \text{Normal}( \mu_i, \sigma ) $

$ \mu_i = h_{0, i} \times p $

$ p = \alpha + \beta_T T_i + \beta_F F_i $

$ \alpha = \text{Log-Normal}(0, 0.25) $

$ \beta_T = \text{Normal}(0, 0.5) $

$ \beta_F = \text{Normal}(0, 0.5) $

$ \sigma = \text{Exponential}(1) $

In [None]:
with pm.Model() as m_6_7:
    a = pm.Normal("a", 0, 0.2)
    bt = pm.Normal("bt", 0, 0.5)
    bf = pm.Normal("bf", 0, 0.5)

    p = a + bt * d.treatment.values + bf * d.fungus.values

    mu = p * d.h0.values
    sigma = pm.Exponential("sigma", 1)

    h1 = pm.Normal("h1", mu=mu, sigma=sigma, observed=d.h1.values)

    m_6_7_trace = pm.sample()

az.summary(m_6_7_trace, round_to=2)

Now that we've exposed our priors to the data, what did the results tell us:

1. The overall growth parameter ($\alpha$) has a mean of about 1.45, which means the average plant grew by about **45%** (1.45 times its initial height).

2. The coefficient for treatment ($\beta_T$) is almost at zero (mean of 0.04), and its 94.5% interval includes zero (it goes from -0.01 to 0.10). This result suggests that the anti-fungal treatment DID NOT affect plant growth.

3. The coefficient for fungus ($\beta_F$) is negative (mean of -0.27), and its interval is entirely negative (-0.32 to -0.23). This result suggests that the presence of fungus hurt growth.

But here's the problem: We know the treatment actually works, because we designed the simulation to make it so! So what was the issue with our model?

#### **6.2.2. Blocked by consequence.**

What happened is that we committed **post-treatment bias**. By including the fungus variable in the model, we "blocked" the path through which the treatment exerts its effect. The treatment ($\beta_T$) works by reducing the fungus ($\beta_F$), which then allows the plant to grow. In other words, the fungus is a consequence of the treatment which is why it's considered to be the "**post-treatment variable**" in this instance as it prevented the model from seeing the treatment's true causal effect on height.

When we included fungus in the model, we made the model answer a wrong question: "*Does the soil treatment matter, once we already know if the plant developed fungus?*" The answer was "no" because the treatment already did its job by reducing the fungus. This is why the model's estimate for treatment ($\beta_T$) became zero... It's because we blocked the causal path.

#### Code 6.17

To correctly measure the true causal impact of the treatment on final growth, we must omit that post-treatment variable, fungus ($\beta_F$), from the model:

In [None]:
with pm.Model() as m_6_8:
    a = pm.Normal("a", 0, 0.2)
    bt = pm.Normal("bt", 0, 0.5)

    p = a + bt * d.treatment.values

    mu = p * d.h0.values
    sigma = pm.Exponential("sigma", 1)

    h1 = pm.Normal("h1", mu=mu, sigma=sigma, observed=d.h1.values)

    m_6_8_trace = pm.sample()

az.summary(m_6_8_trace, round_to=2)

When we fit the new model without the fungus variable, the impact of treatment is clearly positive, which is what we expected! It makes sense to control for pre-treatment things like initial height, but including variables that happen *after* the treatment can hide the very effect you're trying to measure. You can still use the model that includes both variables to learn how the treatment works (its **mechanism**), but for a correct measure of the treatment's **causal effect**, you must leave the post-treatment variable out.

#### **6.2.3. Fungus and *d*-separation.**

#### Code 6.18

Using [`causalgraphicalmodels`](https://github.com/ijmbarr/causalgraphicalmodels) for graph drawing and analysis instead of `dagitty`, following the example of [ksachdeva's Tensorflow version of Rethinking](https://ksachdeva.github.io/rethinking-tensorflow-probability/)

To visualize this causal relationship, it helps to structure it in terms of a DAG:

In [None]:
###############
### MY CODE ###
###############
# Create a directed graph
plant_dag = nx.DiGraph()

# Add nodes
nodes = ['H0', 'H1', 'F', 'T']
plant_dag.add_nodes_from(nodes)

# Add edges based on the diagram
edges = [('H0', 'H1'), ('F', 'H1'), ('T', 'F')]
plant_dag.add_edges_from(edges)

# Set up positions to match the diagram layout
pos = {
    'H0': (0, 1),
    'H1': (1, 1),
    'F': (2, 1),
    'T': (3, 1)
}

# Draw the graph
plt.figure(figsize=(8, 2))
nx.draw(plant_dag, pos, with_labels=True, node_color='white',
        node_size=1500, font_size=16, font_weight='bold',
        arrows=True, arrowsize=20, edge_color='black')

plt.axis('off')
plt.show()

Our plant experiment has a simple causal path (i.e. DAG): the Treatment ($T$) influences the Fungus ($F$), which in turn influences the final Height ($H_1$). Initial Height ($H_0$) also influences final Height ($H_1$).

When we include the post-treatment variable Fungus ($F$) in our model, we essentially block the causal path from the Treatment ($T$) to the final Height ($H_1$). The technical term for this blocking is **d-separation** (where "d" stands for directional).

**D-separation** means that two variables in a causal graph are independent of each other. In our case:
- When we condition on $F$ (by including it in the model), we d-separate the Treatment ($T$) from the final Height ($H_1$).
- Conditioning on $F$ effectively blocks the directed path: $T \rightarrow F \rightarrow H_1$

Why does this happen? Because all the information the Treatment has about the final Height is channeled *through* the Fungus. Once we already know the Fungus status ($F$), learning the Treatment ($T$) provides no additional, independent information about the final Height ($H_1$).

We can use the DAG below to communicate the correct measure of the treatment's causal effect and leaves the post-treatment variable out:

In [None]:
###############
### MY CODE ###
###############
# Create a directed graph
tx_dag = nx.DiGraph()

# Add nodes
nodes = ['H0', 'H1', 'T']
tx_dag.add_nodes_from(nodes)

# Add edges based on the diagram
edges = [('H0', 'H1'), ('T', 'H1')]
tx_dag.add_edges_from(edges)

# Set up positions to match the diagram layout
pos = {
    'H0': (0, 1),
    'H1': (1, 1),
    'T': (2, 1)
}

# Draw the graph
plt.figure(figsize=(8, 1))
nx.draw(tx_dag, pos, with_labels=True, node_color='white',
        node_size=1500, font_size=16, font_weight='bold',
        arrows=True, arrowsize=20, edge_color='black')

plt.title("The DAG that correctly measures the treatment's causal effect")
plt.axis('off')
plt.show()

#### Code 6.19

Credit [ksachdeva](https://ksachdeva.github.io/rethinking-tensorflow-probability/)

In [None]:
# all_independencies = plant_dag.get_all_independence_relationships()
# for s in all_independencies:
#     if all(
#         t[0] != s[0] or t[1] != s[1] or not t[2].issubset(s[2])
#         for t in all_independencies
#         if t != s
#     ):
#         print(s)

When we query the conditional independencies implied by our initial plant experiment DAG, we'll find a few implications. Two of the implications are ways to test if our original DAG is accurate:

1. Fungus is *independent* of the Initial Height $( F \perp\perp H_0 )$; and

2. Initial Height is *independent* of the Treatment $(H_0 \perp\perp T )$.

They say that the plant's initial height shouldn't be associated with the treatment or the fungus status, which makes sense becaus $H_0$ happened before anything else.

The rule about not including post-treatment variables applies everywhere but it's easier to follow in experiments where you know the order of events. However, experiments have their own traps. For instance, conditioning on a post-treatment variable can not only mask a true effect (as we saw before), it can also *fool you into thinking an effect exists when it doesn't*.

Consider a new scenario with this DAG which features an unobserved common cause $(M)$:

In [None]:
###############
### MY CODE ###
###############
# Create a directed graph
moist_dag = nx.DiGraph()

# Add nodes
nodes = ['H0', 'H1', 'M', 'F', 'T']
moist_dag.add_nodes_from(nodes)

# Add edges based on the diagram
edges = [('H0', 'H1'), ('M', 'H1'), ('M', 'F'), ('T', 'F')]
moist_dag.add_edges_from(edges)

# Set up positions to match the diagram layout
pos = {
    'H0': (0, 1),
    'H1': (1, 1),
    'M': (2, 0),
    'F': (3, 1),
    'T': (4, 1)
}


# Drawing the graph
plt.figure(figsize=(8, 3))

# Adding the original nodes only first
standard_nodes = ['H0', 'H1', 'F', 'T']
nx.draw(
    moist_dag,
    pos,
    nodelist=standard_nodes,
    with_labels=True,
    node_color='white',
    node_size=1500,
    font_size=16,
    font_weight='bold',
    arrows=True,
    arrowsize=20,
    edge_color='black'
  )

# Adding the 'unobserved' node
latent_node = ['M']
nx.draw(
    moist_dag,
    pos,
    nodelist=latent_node,
    with_labels=True,
    node_color='grey',
    node_size=1500,
    font_size=16,
    font_weight='bold',
    arrows=True,
    arrowsize=20,
    edge_color='black'
  )

plt.axis('off')
plt.show()

In this new setup:

- The Treatment $(T)$ still influences the Fungus $(F)$.
- However, the Fungus $(F)$ DOES NOT affect the final Height $(H_1)$.
- A new, unobserved variable, **Moisture** $(M)$, affects both the Fungus $(F)$ and the final Height $(H_1)$.

If we run a model of $H_1$ on $T$ (without including $F$), we correctly find *no association* between the treatment and growth. But if we *include the Fungus $(F)$* in the model, suddenly it will look like the treatment has an effect on height!

This happens because the Fungus $(F)$ and the final Height $(H_1)$ share a common cause: Moisture $(M)$. When you include $F$ in the model, you create a misleading statistical connection between $T$ and $H_1$, making you believe the treatment works when it doesn't.

#### Code 6.20

We'll now run a simulation that is designed this way to prove this false association by generating 1,000 fictional plants and simulating their growth and conditions according to the new DAG where Fungus $(F)$ does not directly affect Final Height $(H_1)$:

In [None]:
# The number of plants
N = 1000

# Generates initial plant heights centered at 10, with a standard deviation of 2.
h0 = np.random.normal(10, 2, N)

# Assigns half the plants to the "No Treatment" group (0) and the other half to the "Treatment" group (1).
treatment = np.repeat([0, 1], N / 2)

# Simulates Moisture as a binary variable (0 or 1), where 50% of the plants have high moisture (M=1).
M = np.random.binomial(1, 0.5, size=N)  # assumed probability 0.5 here, as not given in book

# Simulates the presence of fungus (1 for yes, 0 for no).
# Logic: The probability (p) of getting fungus is decreased by the treatment (−0.4) and increased by high M (moisture, +0.4).
# This implements the T→F and M→F causal links.
fungus = np.random.binomial(n=1, p=0.5 - treatment * 0.4 + 0.4 * M, size=N)

# Simulates the final height.
# Logic: H1 is determined by H0 plus some growth. The growth is heavily influenced by Moisture (+3 × M).
# This implements the H0 →H1 and M→H1​ causal links.
# Crucially, the treatment (T) and fungus (F) variables are intentionally excluded here
# Note: 5 and 3 were arbitrarily chosen in the simulation code.
h1 = h0 + np.random.normal(5 + 3 * M, size=N)

d2 = pd.DataFrame.from_dict({"h0": h0, "h1": h1, "treatment": treatment, "fungus": fungus})
az.summary(d.to_dict(orient="list"), kind="stats", round_to=2)

Now let's re-run `m_6_6` and `m_6_7` on this `d2` dataset. We'll see that including fungus confounds inference about the treatment by making it seem that it helped the plants even though it had no effect. However, this result begs the question *why should $M$ even have this effect?*. We'll examine this issue more closely in the next section when we cover collider bias.

In [None]:
with pm.Model() as m_6_6_v2:
    p = pm.Lognormal("p", 0, 0.25)

    mu = p * d2.h0.values
    sigma = pm.Exponential("sigma", 1)

    h1 = pm.Normal("h1", mu=mu, sigma=sigma, observed=d2.h1.values)

    m_6_6_v2_trace = pm.sample()

az.summary(m_6_6_v2_trace, round_to=2)

In [None]:
with pm.Model() as m_6_7_v2:
    a = pm.Normal("a", 0, 0.2)
    bt = pm.Normal("bt", 0, 0.5)
    bf = pm.Normal("bf", 0, 0.5)

    p = a + bt * d2.treatment.values + bf * d2.fungus.values

    mu = p * d2.h0.values
    sigma = pm.Exponential("sigma", 1)

    h1 = pm.Normal("h1", mu=mu, sigma=sigma, observed=d2.h1.values)

    m_6_7_v2_trace = pm.sample()

az.summary(m_6_7_v2_trace, round_to=2)

## *Section 6.3* - Collider Bias

We started this chapter by noting that selection processes, like journal review, can make trustworthy studies look less newsworthy (and vice-versa) just by caring about both factors. This same problem, known as **collider bias**, can happen inside a statistical model and seriously distort your conclusions.

Let's look at the grant funding process again. **Trustworthiness** $(T)$ and **Newsworthiness** $(N)$ are independent in the general population of submitted proposals, but both influence **Selection** $(S)$ for funding.

The node $S$ is a **collider** because two arrows point into it. When you *condition on a collider* (by analyzing only the funded proposals), *it creates a misleading association between its causes*. For instance, if you see a selected proposal has low Trustworthiness $(T)$, you must infer it had high Newsworthiness $(N)$—otherwise, it wouldn't have met the funding threshold. This is how a fake negative correlation between $T$ and $N$ is created.

This same selection phenomenon happens when you include a collider variable as a predictor in a regression. Let's illustrate this collider bias using a simple DAG to model the relationship:

In [None]:
###############
### MY CODE ###
###############
# Create a directed graph
funding_dag = nx.DiGraph()

# Add nodes
nodes = ['T', 'S', 'N']
funding_dag.add_nodes_from(nodes)

# Add edges based on the diagram
edges = [('T', 'S'), ('N', 'S')]
funding_dag.add_edges_from(edges)

# Set up positions to match the diagram layout
pos = {
    'T': (0, 1),
    'S': (1, 1),
    'N': (2, 1)
}

# Draw the graph
plt.figure(figsize=(8, 1))
nx.draw(funding_dag, pos, with_labels=True, node_color='white',
        node_size=1500, font_size=16, font_weight='bold',
        arrows=True, arrowsize=20, edge_color='black')

plt.axis('off')
plt.show()

#### **6.3.1. Collider of false sorrow.**

Let's now turn our attention to the question of whether aging influences happincess. If we conducted a large survey of people's happiness, would we find that age is associated with happiness? If so, is this a causal association? Here we're trying to demonstrate how controlling for happiness can actually bias our model's inference about the influence of age.

Suppose, in reality, your average **Happiness** $(H)$ is set at birth and *never changes* with **Age** $(A)$. However, let's add the influence of marriage rate into the mix by adding the following clauses:

1. **Happier** people $(H)$ are more likely to get **Married** $(M)$.

2. The older you get in **Age** $(A)$, the more likely you are to get **Married** $(M)$.


This makes **Married** $(M)$ a collider. Even though Age and Happiness are not causally related, if we *condition on marriage* (by including $M$ in our regression), it will create a fake statistical link between $A$ and $H$. This could fool us into believing that happiness changes with age when it does not.

In [None]:
###############
### MY CODE ###
###############
# Create a directed graph
happiness_dag = nx.DiGraph()

n1 = 'H'
n2 = 'M'
n3 = 'A'

# Add nodes
nodes = [n1, n2, n3]
happiness_dag.add_nodes_from(nodes)

# Add edges based on the diagram
edges = [(n1, n2), (n3, n2)]
happiness_dag.add_edges_from(edges)

# Set up positions to match the diagram layout
pos = {
    n1: (0, 1),
    n2: (1, 1),
    n3: (2, 1)
}

# Draw the graph
plt.figure(figsize=(8, 1))
nx.draw(happiness_dag, pos, with_labels=True, node_color='white',
        node_size=1500, font_size=16, font_weight='bold',
        arrows=True, arrowsize=20, edge_color='black')

plt.axis('off')
plt.show()

#### Code 6.21

To prove this, we can run a simulation where we know the truth: happiness is constant. The simulation we'll call `sim_happiness` is designed to do the following:

1. Track 20 new people each year, keeping their happiness uniformly fixed;

2. Make the odds of getting married increase with both age and happiness;

3. Individuals can't file for divorce;

4. Once an individual hits the age of 65, they move to Spain and leave the sample;

5. We'll be running this simulation for 1000 years.

If our statistical procedure can't find the truth in this controlled environment, we definitely shouldn't trust it with real-world data.

In [None]:
def inv_logit(x):
    return np.exp(x) / (1 + np.exp(x))


def sim_happiness(N_years=100, seed=1234):
    np.random.seed(seed)

    popn = pd.DataFrame(np.zeros((20 * 65, 3)), columns=["age", "happiness", "married"])
    popn.loc[:, "age"] = np.repeat(np.arange(65), 20)
    popn.loc[:, "happiness"] = np.repeat(np.linspace(-2, 2, 20), 65)
    popn.loc[:, "married"] = np.array(popn.loc[:, "married"].values, dtype="bool")

    for i in range(N_years):
        # age population
        popn.loc[:, "age"] += 1
        # replace old folk with new folk
        ind = popn.age == 65
        popn.loc[ind, "age"] = 0
        popn.loc[ind, "married"] = False
        popn.loc[ind, "happiness"] = np.linspace(-2, 2, 20)

        # do the work
        elligible = (popn.married == 0) & (popn.age >= 18)
        marry = np.random.binomial(1, inv_logit(popn.loc[elligible, "happiness"] - 4)) == 1
        popn.loc[elligible, "married"] = marry

    popn.sort_values("age", inplace=True, ignore_index=True)

    return popn



popn = sim_happiness(N_years=1000, seed=1977)

popn_summ = popn.copy()
popn_summ["married"] = popn_summ["married"].astype(int)
# this is necessary before using az.summary, which doesn't work with boolean columns.
az.summary(popn_summ.to_dict(orient="list"), kind="stats", round_to=2)

The data comprises of 1,300 people between 0 to 65 years old and the variables correspond to our `happiness_dag`.

Here are some key insights about the data:

1. Marriage and Age: Individuals cannot get married until age 18. After that age, the number of married individuals (dark blue dots) gradually increases as people get older.

2. Marriage and Happiness: At every age, the people with higher happiness (top of the plot) are much more likely to be married.

The overall plot confirms the simulation's design: both age and happiness are clearly associated with the marriage status of the individuals in the sample.

### Figure 6.4. The happiness simulation.

In [None]:
# Figure 6.4
fig, ax = plt.subplots(figsize=[10, 3.4])

colors = np.array(["w"] * popn.shape[0])
colors[popn.married] = "b"
ax.scatter(popn.age, popn.happiness, edgecolor="k", color=colors)

ax.scatter([], [], edgecolor="k", color="w", label="unmarried")
ax.scatter([], [], edgecolor="k", color="b", label="married")
ax.legend(loc="upper left", framealpha=1, frameon=True)

ax.set_xlabel("age")
ax.set_ylabel("hapiness")



#####################
### CODE ADDITION ###
#####################
plt.suptitle(
    x=0.5,
    y=-0.06,
    t="Figure 6.4. Simulated data, assuming that happiness is uniformly distributed and never changes. \n \
    Each point is a person. Married individuals are shown with filled blue points. At each age after 18, \n \
    the happiest individuals are more likely to be married. At later ages, more individuals tend to be married. \n \
    Marriage status is a collider of  age and happiness: A → M ← H. If we condition on marriage in a regression, \n \
    it will mislead us to believe that happiness declines with age.",
    ma="left"
  );

#### Code 6.22

Imagine you have the simulated data and want to figure out if *age is related to happiness*. You don't know the true causal relationships, but you reasonably suspect that **marriage status** is an important confounder. If married people are generally happier or sadder than single people, you'd need to control for marriage status to isolate the true relationship between age and happiness.

Therefore, we'll use a multiple regression model that aims to infer the influence of age while controlling for marriage status:


$ \mu_i = \alpha_{MID[i]} + \beta_A A_i $

$Where$:
- $\mu_i$ is the predicted happiness for individual $i$;
- $MID[i]$ is a separate intercept for each marriage status category (MID), with 1 meaning single and
2 meaning married. This is just the categorical variable strategy from Chapter 4;
- $\beta_A$ is the slope, representing the influence of age;
- $ A_i $ is the age of individual $i$.


To make slope $\beta_A$ easier to interpret when setting up our priors, we'll focus only on the *adult sample (age 18 and over)* and *rescale age* so that the entire range from 18 to 65 corresponds to a single unit. This helps us set sensible prior expectations for how much happiness might change over an adult lifespan.



In [None]:
adults = popn.loc[popn.age > 17].copy()
adults.loc[:, "A"] = (adults["age"].copy() - 18) / (65 - 18)

#### Code 6.23

In [None]:
mid = adults.loc[:, "married"].astype(int).values

with pm.Model() as m_6_9:
    a = pm.Normal("a", 0, 1, shape=2)
    bA = pm.Normal("bA", 0, 2)

    mu = a[mid] + bA * adults.A.values
    sigma = pm.Exponential("sigma", 1)

    happiness = pm.Normal("happiness", mu, sigma, observed=adults.happiness.values)

    m_6_9_trace = pm.sample(1000)

az.summary(m_6_9_trace, round_to=2)

From the output of the `m6_9` model, it's clear that age seems to be negatively associated with happiness.

#### Code 6.24

Let's now try running a model below that omits the marriage status variable `mid` from the model. Based on the results, it seems that model `m6_10` finds no association between age and happiness.

In [None]:
with pm.Model() as m6_10:
    a = pm.Normal("a", 0, 1)
    bA = pm.Normal("bA", 0, 2)

    mu = a + bA * adults.A.values
    sigma = pm.Exponential("sigma", 1)

    happiness = pm.Normal("happiness", mu, sigma, observed=adults.happiness.values)

    trace_6_10 = pm.sample(1000)

az.summary(trace_6_10, round_to=2)

The results we see are exactly what we expect when we condition on a **collider**, which in this case is marriage status. Because marriage is a common consequence of both age and happiness, including it in the model forces a fake, or **spurious**, association between age and happiness. Our model now incorrectly suggests that age is *negatively associated* with happiness, even though we know from our simulation that age and happiness are not causally linked.

You can actually see this false association in the plot . If you look only at the *married individuals* (the blue dots), the older people, on average, are slightly less happy. Similarly, if you look only at the *unmarried people* (the open dots), their average happiness also appears to decline with age. Why? Because the happiest people are constantly leaving the single group to join the married group, and over time, the mean happiness of both groups ends up tracking the overall population average. This negative relationship exists within the subgroups, but it is purely a statistical artifact, not a causal truth.

The crucial lesson here is that *you cannot make reliable **causal inferences** from a multiple regression unless you first have a **causal model** (a DAG) to guide you*. The regression itself won't provide the science needed to justify the model; you need to bring that scientific understanding yourself.

### **6.3.2. The haunted DAG.**

With trying to identifying collider bias in statistical modelling, the problem isn't always obvious Sometimes the collider may actually be an unmeasured cause, which leads to the concept of a "**Haunted DAG**."

For instance, let's say we were interested in the effects that both **parents** $(P)$ and **grandparents** $(G)$ have on a **child's education** $(C)$. Since grandparents also presumably have an effect on their own kid's education (i.e. the parents) we can assume there's link between the two: $ G → P $. In creating this DAG, we need to be careful of unmeasured variables that could be acting as hidden colliders:









In [None]:
###############
### MY CODE ###
###############
# Create a directed graph
child_educ_dag = nx.DiGraph()

n1 = 'G'
n2 = 'P'
n3 = 'C'

# Add nodes
nodes = [n1, n2, n3]
child_educ_dag.add_nodes_from(nodes)

# Add edges based on the diagram
edges = [(n1, n2), (n2, n3), (n1, n3)]
child_educ_dag.add_edges_from(edges)

# Set up positions to match the diagram layout
pos = {
    n1: (0, 1),
    n2: (1, 1),
    n3: (1, 0)
}

# Draw the graph
plt.figure(figsize=(6, 4))
nx.draw(child_educ_dag, pos, with_labels=True, node_color='white',
        node_size=1500, font_size=16, font_weight='bold',
        arrows=True, arrowsize=20, edge_color='black')

plt.axis('off')
plt.show()

However if we suppose that there are unmeasured, common influences between parents and their children that's *not shared* by the grandparents, such as the neighbourhood they live in or the income level of the direct family. In this case, our new DAG becomes haunted by the unobserved $U$ variable:

In [None]:
###############
### MY CODE ###
###############
# Create a directed graph
haunted_dag = nx.DiGraph()

n1 = 'G'
n2 = 'P'
n3 = 'C'
n4 = 'U'

# Add nodes
nodes = [n1, n2, n3, n4]
haunted_dag.add_nodes_from(nodes)

# Add edges based on the diagram
edges = [(n1, n2), (n2, n3), (n1, n3), (n4, n2), (n4, n3)]
haunted_dag.add_edges_from(edges)

# Set up positions to match the diagram layout
pos = {
    n1: (0, 1),
    n2: (1, 1),
    n3: (1, 0),
    n4: (2, 0.5)
}

# Draw the graph
plt.figure(figsize=(6, 4))

# Adding the original nodes only first
standard_nodes = [n1, n2, n3, n4]
nx.draw(
    haunted_dag,
    pos,
    nodelist=standard_nodes,
    with_labels=True,
    node_color='white',
    node_size=1500,
    font_size=16,
    font_weight='bold',
    arrows=True,
    arrowsize=20,
    edge_color='black'
  )

# Adding the 'unobserved' node
latent_node = ['U']
nx.draw(
    haunted_dag,
    pos,
    nodelist=latent_node,
    with_labels=True,
    node_color='grey',
    node_size=1500,
    font_size=16,
    font_weight='bold',
    arrows=True,
    arrowsize=20,
    edge_color='black'
  )

# Draw the graph
# plt.figure(figsize=(6, 4))
# nx.draw(haunted_dag, pos, with_labels=True, node_color='white',
#         node_size=1500, font_size=16, font_weight='bold',
#         arrows=True, arrowsize=20, edge_color='black')

plt.axis('off')
plt.show()

#### Code 6.25

The core idea here is that **Parents' achievement** $(P)$ is a common consequence of both **Grandparents' achievement** $(G)$ and some **unmeasured variable** $(U)$. The problem is that if we include the parents' achievement (P) in our regression model (i.e. our **collider**) it will bias our inference about the direct link between the grandparents' achievement $(G)$ and the **children's achievement** $(C)$. This bias happens even though we can't measure $U$.

To show exactly how this works, we will simulate 200 family units (grandparent, parent, child). Our simulation will follow the rules of the DAG, treating the relationships as simple functional relationships:

1. Parents' achievement $(P)$ is based on Grandparents' $(G)$ and the Unmeasured factor $(U)$.

2. Children's achievement $(C)$ is based on Grandparents' $(G)$, Parents' $(P)$, and the Unmeasured factor $(U)$.

3. The Grandparents' achievement $(G)$ and the Unmeasured factor $(U)$ are not caused by anything else in the model.

Next, we will crawl through a quantitative example to prove how this unmeasured collider $(U)$ distorts the statistical relationship between G and C when we include P in the model.

In [None]:
N = 200  # number of of grandparent-parent-child triads
b_GP = 1  # direct effect of G on P
b_GC = 0  # direct effect of G on C
b_PC = 1  # direct effect of P on C
b_U = 2  # direct effect of U on P and C

#### Code 6.26

These parameters will be slike slopes in a regression model. Notice that we've assumed that gradparents $(G)$ have *zero effect* on their grandkids $(C)$.

<br>

`b_GC = 0` (Line 2 of Code 6.25)

</br>

This example doesn't depend on the example being exactly zero but it will drive the point home with this simulation. We've also made the unmeasured factor referring to neighbourhood $(U)$ as a binary variable so it's easier to understand but the example also doesn't need to depend on that assumption. Let's now generate the data for our simulation:


In [None]:
U = 2 * np.random.binomial(1, 0.5, N) - 1
G = np.random.normal(size=N)
P = np.random.normal(b_GP * G + b_U * U)
C = np.random.normal(b_PC * P + b_GC * G + b_U * U)
d = pd.DataFrame.from_dict({"C": C, "P": P, "G": G, "U": U})

#### Code 6.27

We've  now created a simulation where Parents' achievement $(P)$ acts as a collider, being caused by Grandparents' achievement $(G)$ and an Unmeasured factor $(U)$.

Now we want to figure out the influence of the grandparents $(G)$ on the children's achievement $(C)$. Since we know some of the effect from $G$ to $C$ runs through the parents $(P)$, we realize we need to control for the parents' achievement by including it in our regression model.

We're going to run a simple multiple regression of $C$ on $P$ and $G$. Normally, we'd standardize these variables to set better priors but in this case, let's just keep them on their original scale. This allows us to see if the regression can accurately recover the original, known values from the simulation. Because we're focused on demonstrating the bias, we're setting very vague priors just to quickly move forward with the example.

In [None]:
with pm.Model() as m_6_11:
    a = pm.Normal("a", 0, 1)
    p_PC = pm.Normal("b_PC", 0, 1)
    p_GC = pm.Normal("b_GC", 0, 1)

    mu = a + p_PC * d.P.values + p_GC * d.G.values
    sigma = pm.Exponential("sigma", 1)

    pC = pm.Normal("C", mu, sigma, observed=d.C.values)

    m_6_11_trace = pm.sample()

az.summary(m_6_11_trace, round_to=2)

The regression found two surprising things:

1. The effect of Parents' achievement (i.e. the mean of the `b_PC` variable) looks too large due to the confounding error where the unmeasured factor $U$ influences both $P$ and $C$;
2. The direct effect of Grandparents' achievement (i.e. the mean of the `b_GC` variable) is inferred to be **negative** which means $G$ appears to hurt the child's education $(C)$. However, please note that *this negative finding is a statistical artifact, not a causal truth*.

This bias here is due to Parents' achievement $(P)$ *acting as an **unmeasured collider***. The best way to see how this happens is by looking at the plot. **[Figure 6.5](#scrollTo=yDEs0w60y44w)** separates the data by the unmeasured factor, which is **neighborhood quality** $(U)$: **good neighborhoods (blue points)** & **bad neighborhoods (black points)**. Notice that within each neighborhood type, the association between $G$ and $C$ is **positive** meaning that educated grandparents still have more educated grandkids. This is the true, positive relationship that exists in the population.

The negative association arises when we condition on $P$, which means we statistically select people who have similar levels of parental education. Look at the **filled-in points** in the **[Figure 6.5](#scrollTo=yDEs0w60y44w)** which represent parents who are all in a narrow, similar range of achievement. If you draw a line through only these points, the slope of $C$ on $G$ is clearly **negative**.

This negative slope exists because, once you fix $P$, learning about $G$ actually tells you about the unseen neighborhood $(U)$. Since the neighborhood $(U)$ strongly influences the child's achievement $(C)$, parents with highly educated (grand)parents but grew up in 'bad' neighbourhoods end up with less educated children in our simulation. On the other hand, parents with less educated (grand)parents but grew up in 'good' neighbourhoods ended up with more educated children. Thus, when holding $P$ constant, $G$ falsely predicts lower $C$. The only way to fix this collider bias is to ***measure the unobserved factor $U$** and include it in the model*.


### Figure 6.5. Unobserved confounds and collider bias.

In [None]:
# grandparent education
bad = U < 0
good = ~bad
plt.scatter(G[good], C[good], color="w", lw=1, edgecolor="C0")
plt.scatter(G[bad], C[bad], color="w", lw=1, edgecolor="k")

# parents with similar education
eP = (P > -1) & (P < 1)
plt.scatter(G[good & eP], C[good & eP], color="C0", lw=1, edgecolor="C0")
plt.scatter(G[bad & eP], C[bad & eP], color="k", lw=1, edgecolor="k")


p = np.polyfit(G[eP], C[eP], 1)
xn = np.array([-2, 3])
plt.plot(xn, np.polyval(p, xn))

plt.xlabel("grandparent education (G)")
plt.ylabel("grandchild education (C)")

#####################
### CODE ADDITION ###
#####################
plt.suptitle(
    x=1.35,
    y=.65,
    t="Figure 6.5. Unobserved confounds and collider bias. \n \
    In this example, grandparents influence grandkids \n \
    only indirectly, through parents. However, \n \
    unobserved neighbourhood effects on parents and \n \
    their children create the illusion that \n \
    grandparents harm their grandkids education. \n \
    Parental education is a collider: Once we condition \n \
    on it, grandparental education becomes negatively \n \
    associated with grandchild education.",
    ma="left"
  )

plt.text(
    x=-2,
    y=8,
    s="good neighbourhoods",
    ma="left",
    color="C0"
  )

plt.text(
    x=2,
    y=-8,
    s="bad neighbourhoods",
    ma="left",
    color="k"
  );

#### Code 6.28

 Since the unmeasured $U$ makes $P$ a collider, the only way to fix this collider bias is to measure the unobserved factor $U$. Here's the regression model that conditions on $U$:

In [None]:
with pm.Model() as m_6_12:
    a = pm.Normal("a", 0, 1)
    p_PC = pm.Normal("b_PC", 0, 1)
    p_GC = pm.Normal("b_GC", 0, 1)
    p_U = pm.Normal("b_U", 0, 1)

    mu = a + p_PC * d.P.values + p_GC * d.G.values + p_U * d.U.values
    sigma = pm.Exponential("sigma", 1)

    pC = pm.Normal("C", mu, sigma, observed=d.C.values)

    m_6_12_trace = pm.sample()

az.summary(m_6_12_trace, round_to=2)

### **Rethinking: Statistical paradoxes and causal explanations.**

The misleading result we saw in the grandparents example is a case of **Simpson's paradox**. This statistical phenomenon occurs when the association between two variables reverses direction once you include a third variable (the predictor $P$ reverses the association between $G$ and $C$).

Usually, adding a variable helps clarify the true relationship, but in this case, it *misleads* us into thinking grandparents harm their grandkids' education. Simpson's paradox is purely a statistical observation. To know whether the direction reversal in the regression is a real causal truth or just a statistical accident, we need to rely on something beyond the statistical model itself. We need a **clear causal model (a DAG)** based on scientific understanding.

## *Section 6.4* - Confronting confounding

Throughout this chapter, we've learned that while **multiple regression** can help us handle **confounding variables**, it can also <u>cause confounding variable</u> if we control for the wrong variables like a collider. The main lesson is to never blindly throw every variable into a model hoping for a clear answer. Effective inference is possible only if we are careful and knowledgeable about causation.

To bring coherence to these issues, let's first define **confounding variables** as a bias where the observed association between an outcome $(Y)$ and a predictor of interest $(X)$ in your data differs from the **true causal effect of the predictor on the outcome**. In broader terms: confounding happens when an invisible or **unmeasured third variable $(U)$** is influencing both your predictor and your outcome.

For example, if we examine the relationship between **Education** $(E)$ and **Wages** $(W)$. The problem is that many **unobserved variables** $(U)$, such as neighborhood, family background, or natural talent, influence both your education level and your wages:






In [None]:
###############
### MY CODE ###
###############
# Create a directed graph
wage_dag = nx.DiGraph()

n1 = 'E'
n2 = 'W'
n3 = 'U'

# Add nodes
nodes = [n1, n2, n3]
wage_dag.add_nodes_from(nodes)

# Add edges based on the diagram
edges = [(n1, n2), (n3, n1), (n3, n2)]
wage_dag.add_edges_from(edges)

# Set up positions to match the diagram layout
pos = {
    n1: (0, 0),
    n2: (1, 0),
    n3: (0.5, 1)
}

# Draw the graph
plt.figure(figsize=(6, 4))
nx.draw(wage_dag, pos, with_labels=True, node_color='white',
        node_size=1500, font_size=16, font_weight='bold',
        arrows=True, arrowsize=20, edge_color='black')

plt.axis('off')
plt.show()

If we regress $W$ages on $E$ducation, our result is confounded by $U$nknown because there are two paths connecting $E$ and $W$:

1. A **Causal Path** where education directly increases wages: $E→W$.

2. A **Non-Causal Path** where an unobserved factor causes both higher education and higher wages: $E←U→W$

A "**path**" is any way to walk from one variable to another, ignoring arrow direction. Both paths create a statistical association but only the first path (1) is truly causal. The second path (2) is non-causal because if we magically changed only $E$, it wouldn't affect $W$ through $U$.

To truly isolate that causal path, the most famous solution is to run a randomized experiment, which fundamentally changes the causal graph:

In [None]:
###############
### MY CODE ###
###############
# Create a directed graph
isolated_dag = nx.DiGraph()

n1 = 'E'
n2 = 'W'
n3 = 'U'

# Add nodes
nodes = [n1, n2, n3]
isolated_dag.add_nodes_from(nodes)

# Add edges based on the diagram
edges = [(n1, n2), (n3, n2)]
isolated_dag.add_edges_from(edges)

# Set up positions to match the diagram layout
pos = {
    n1: (0, 0),
    n2: (1, 0),
    n3: (0.5, 1)
}

# Draw the graph
plt.figure(figsize=(6, 4))
nx.draw(isolated_dag, pos, with_labels=True, node_color='white',
        node_size=1500, font_size=16, font_weight='bold',
        arrows=True, arrowsize=20, edge_color='black')

plt.axis('off')
plt.show()

The problem of confounding, like the one between Education $(E)$ and Wages $(W)$ caused by Unobserved factors $(U)$, exists because there are two paths connecting $E$ and $W$: the causal path $(E→W)$ and the non-causal path $(E←U→W)$. To get a true measure of the causal effect, we must block the non-causal path which can be achieved through the following two solutions.

1. **Experimental Manipulation**. When you run an experiment and **randomly assign** education levels  which effectively manipulates $E$, you effectively remove the influence of $U$ on $E$. Since $U$ can no longer influence $E$, the $E←U→W$ path is broken. With that path blocked, the only remaining link between $E$ and $W$ is the true causal path and your measurement of the association will accurately reflect the causal influence.

2. **Statistical Control**. We can also achieve the same result statistically by adding $U$ to the model (i.e. conditioning on $U$). This also blocks the flow of information along the non-causal path $E←U→W$.

Think of it this way: if $U$ is the average wealth in a region, and high wealth causes both high education and high wages, then $E$ and $W$ are correlated across all regions. However, once you control for $U$ by holding the region's wealth constant in your model, learning a person's education $(E)$ gives you no additional information about their wages $(W)$ since they were only associated through $U$. Conditioning on $U$ effectively makes $E$ and $W$ statistically independent, thereby isolating and allowing you to measure the true causal path.

### Figure 6.6. The four elemental confounds.

In [None]:
# Define the four causal structures (edges)
# ----------------------------------------
# 1. The Fork (Confounder): X <- Z -> Y
fork_edges = [('Z', 'X'), ('Z', 'Y')]

# 2. The Pipe (Mediator/Chain): X -> Z -> Y
pipe_edges = [('X', 'Z'), ('Z', 'Y')]

# 3. The Collider: X -> Z <- Y
collider_edges = [('X', 'Z'), ('Y', 'Z')]

# 4. The Descendent (of a Collider): X -> Z <- Y, Z -> D
descendent_edges = [('X', 'Z'), ('Y', 'Z'), ('Z', 'D')]

# Define the list of all structures to iterate over
structures = [
    (fork_edges, ['X', 'Z', 'Y'], "The Fork"),
    (pipe_edges, ['X', 'Z', 'Y'], "The Pipe"),
    (collider_edges, ['X', 'Z', 'Y'], "The Collider"),
    (descendent_edges, ['X', 'Z', 'Y', 'D'], "The Descendent")
]

# --- Create the combined plot ---
plt.figure(figsize=(12, 4))

for i, (edges, nodes, title) in enumerate(structures):
    # 1. Create a subplot for the current DAG
    ax = plt.subplot(1, 4, i + 1)

    # 2. Create the graph object
    G = nx.DiGraph()
    G.add_edges_from(edges)

    # 3. Define the positions for clear visualization (centered structure)
    if 'D' in nodes:
        # Custom positions for the Descendent plot (Z on top, D below)
        pos = {
            'X': (0, 0),
            'Y': (2, 0),
            'Z': (1, 1),
            'D': (1, -1)
        }
    else:
        # # Standard V-shape positions for Fork, Pipe, Collider
        # pos = {
        #     'X': (0, 0),
        #     'Y': (2, 0),
        #     'Z': (1, 1)
        # }

        # Re-center Z for the Pipe structure to make it a straight line
        if title == "The Pipe":
            pos['Z'] = (1, 0.5)
            pos['Y'] = (2, 0)

        # The position for Fork
        elif title == "The Fork":
          pos = {
              'X': (0, 1),
              'Y': (2, 1),
              'Z': (1, 0)
          }

        else:
          # Standard V-shape positions for Fork, Pipe, Collider
          pos = {
              'X': (0, 0),
              'Y': (2, 0),
              'Z': (1, 1)
          }


    # 4. Draw the graph
    nx.draw(
        G, pos,
        with_labels=True,
        node_color='white',
        node_size=2500,
        font_size=16,
        font_weight='bold',
        arrows=True,
        arrowsize=25,
        edge_color='black',
        linewidths=2,  # Add border to nodes
        ax=ax
    )

    # 5. Add the title
    ax.set_title(title, fontsize=18)
    ax.axis('off')

# 6. Add the subheading
plt.suptitle(
    x=0.5,
    y=-0.06,
    t="Figure 6.6. The four elemental confounds. Any directed acyclic graph is built from these elementary relationships.",
    ma="left"
  );

plt.tight_layout()
plt.show()

### **6.4.1. Shutting the backdoor.**

The goal of causal inference is to "**shut the backdoor**," which means statistically blocking all non-causal paths (spurious correlations) between our predictor of interest $(X)$ and the outcome $(Y)$. *In other words, with causal inference we're trying to ensure that $X$ alone influences $Y$ by removing or accounting for any other confounding variable $(Z)$.* A backdoor path is any non-causal path that enters the predictor $X$ with an arrow. The good news is that by analyzing our causal DAG, we can always figure out which variables to control for and which ones to avoid.

All complex DAGs are built from just four **elemental causal relations**. Understanding how information flows in each is the key to shutting the backdoor:

1. **Fork:** $(X←Z→Y)$. This is the classic **confounder** $(Z)$. In a fork, information flows between $X$ and $Y$. Conditioning on $Z$, such as what we did with the DAG on education, regional wealth, and wages, blocks this non-causal path (i.e. It isolates the true effect). In other words we can solve for this backdoor by just including $Z$ as one of our predictors along with $X$ in our multi-regression models.

2. **Pipe**: $(X→Z→Y)$. This is a **mediator** (or **post-treatment variable** - $Z$). Information flows from $X$ to $Y$ through $Z$. In other words, $Z$ is blocking the effect that the predictor can have on the outcome so therefore it's best to remove $Z$ in this instance.

3. **Collider**: $(X→Z←Y)$. This is the **selection trap** $(Z)$. Unlike the others, there is no association between $X$ and $Y$ unless you do something to $Z$. Conditioning on $Z$ opens the path, creating a spurious association between $X$ and $Y$.

4. **Descendent**: A **descendent** $(D)$ is a variable influenced by another variable $(Z)$. Conditioning on a descendent $(D)$ is like weakly conditioning on $Z$ itself. If $Z$ is a collider, conditioning on its descendent $D$ will partially open the spurious path between $X$ and $Y$.

Here's the recipe for shutting down backdoors
to correctly isolate a causal effect using a DAG

1. List all paths connecting your predictor $(X)$ to your outcome $(Y)$.

2. Classify each path as either *open (no collider)* or *closed (contains a collider)*.

3. Identify backdoor paths (any path with an arrow entering $X$).

4. Close all open backdoor paths by deciding which variable(s) to condition on (usually the middle variable of a fork).

### **6.4.2. Two roads.**

The DAG below contains a predictor $(X)$, a target variable $(Y)$, an unobserved/confounding variable $(U)$, and three observed covarites ($A$, $B$, and $C$):


In [None]:
###############
### MY CODE ###
###############
# Create a directed graph
road_dag = nx.DiGraph()

n1 = 'X'
n2 = 'Y'
n3 = 'U'
n4 = 'A'
n5 = 'B'
n6 = 'C'

# Add nodes
nodes = [n1, n2, n3, n4, n5, n6]
road_dag.add_nodes_from(nodes)

# Add edges based on the diagram
edges = [(n1, n2), (n3, n1), (n6, n2), (n4, n3), (n4, n6), (n3, n5), (n6, n5)]
road_dag.add_edges_from(edges)

# Set up positions to match the diagram layout
pos = {
    n1: (0, 0),
    n2: (1, 0),
    n3: (0, 1),
    n4: (0.5, 1.5),
    n5: (0.5, 0.5),
    n6: (1, 1)
}

# Draw the graph
plt.figure(figsize=(6, 4))

# Adding the 'unobserved' node
latent_node = ['U']

nx.draw(
    road_dag,
    pos,
    nodelist=latent_node,
    with_labels=True,
    node_color='grey',
    node_size=1500,
    font_size=16,
    font_weight='bold',
    arrows=True,
    arrowsize=20,
    edge_color='black'
  )

plt.axis('off')
plt.show()

If we're interested in the causal effect of $X$ on $Y$, we need to shut down all non-causal "backdoor" paths. The DAG above reveals two such indirect paths that could cause confounding:

1. Path 1 (Open): $X←U←A→C→Y$
    - With Path 1, we would clasify it as open because it contains no colliders and the information flows freely, meaning it will confound our estimate of $X→Y$. To shut this backdoor, we must condition on a variable within it. Since the unobserved variable $U$ cannot be used, we must use $A$ or $C$. Controlling for $C$ is generally preferable because it may also increase the precision (efficiency) of our estimate for $X→Y$.


2. Path 2 (Closed): $X←U→B←C→Y$
    - Path 2 is already closed because it contains a collider at $B (U→B←C)$. We must not control for $B$. If we condition on $B$, we would open the path and introduce a collider bias, which would ruin the inference about $X→Y$. A change in the $X→Y$ coefficient after adding a variable like $B$ doesn't mean the estimate improved; it often means you've just conditioned on a collider.

Therefore, to correctly infer the causal effect of $X$ on $Y$, you must control for $A$ or $C$, and you must not control for $B$.

#### Code 6.29

Credit [ksachdeva](https://ksachdeva.github.io/rethinking-tensorflow-probability/)

In [None]:
# dag_6_1 = CausalGraphicalModel(
#     nodes=["X", "Y", "C", "U", "B", "A"],
#     edges=[
#         ("X", "Y"),
#         ("U", "X"),
#         ("A", "U"),
#         ("A", "C"),
#         ("C", "Y"),
#         ("U", "B"),
#         ("C", "B"),
#     ],
# )
# all_adjustment_sets = dag_6_1.get_all_backdoor_adjustment_sets("X", "Y")
# for s in all_adjustment_sets:
#     if all(not t.issubset(s) for t in all_adjustment_sets if t != s):
#         if s != {"U"}:
#             print(s)

### **6.4.3. Backdoor waffles.**

In a final example, let's return to the correlations we were exploring between the **number of Waffle houses $(W)$** and the **divorce rate $(D)$** in each state from Chapter 5. Here we can build a DAG to illustrate how to find a minimal adjustment set and derive testable implications. Remember, while data can never prove a DAG is correct, it can certainly prove a DAG is wrong.

Along with $D$ and $W$, we'll also include the following variables in our proposed causal model:
- $S$ for whether a state is considered a "Southern" state in the US;
- $A$ is for the median age of marriage in the state;
- $M$ is the marriage rate in the state.


In [None]:
###############
### MY CODE ###
###############
# Create a directed graph
waffle_dag = nx.DiGraph()

n1 = 'A'
n2 = 'M'
n3 = 'D'
n4 = 'S'
n5 = 'W'

# Add nodes
nodes = [n1, n2, n3, n4, n5]
waffle_dag.add_nodes_from(nodes)

# Add edges based on the diagram
edges = [(n1, n2), (n2, n3), (n1, n3), (n4, n1), (n4, n2), (n4, n5), (n5, n3)]
waffle_dag.add_edges_from(edges)

# Set up positions to match the diagram layout
pos = {
    n1: (0, 0),
    n2: (0.5, 0.5),
    n3: (1, 0),
    n4: (0, 1),
    n5: (1, 1)
}

# Draw the graph
plt.figure(figsize=(6, 4))
nx.draw(waffle_dag, pos, with_labels=True, node_color='white',
        node_size=1500, font_size=16, font_weight='bold',
        arrows=True, arrowsize=20, edge_color='black')

plt.axis('off')
plt.show()

#### Code 6.30

Credit [ksachdeva](https://ksachdeva.github.io/rethinking-tensorflow-probability/)

This DAG  assumes that being a Southern State $(S)$ is a common cause for several other factors: $S$ influences the median age at marriage $(S→A)$, marriage rate ($S→M$ and $S→A→M$), and the number of Waffle Houses $(S→W)$. Both $A$ and $M$ then influence the divorce rate $(D)$.

Our goal is to find the true causal effect of $W$ on $D$. We do this by finding all backdoor paths between $W$ and $D$.

In this graph, there are three open backdoor paths from $W$ to $D$, but every single one of them passes through the State variable $(S)$ first. Therefore, the minimal adjustment set required to close all three confounding paths is simply to condition on $S$. That's all we need to do to block the spurious correlation and isolate the potential causal effect of Waffle Houses.

In [None]:
# dag_6_2 = CausalGraphicalModel(
#     nodes=["S", "A", "D", "M", "W"],
#     edges=[
#         ("S", "A"),
#         ("A", "D"),
#         ("S", "M"),
#         ("M", "D"),
#         ("S", "W"),
#         ("W", "D"),
#         ("A", "M"),
#     ],
# )
# all_adjustment_sets = dag_6_2.get_all_backdoor_adjustment_sets("W", "D")
# for s in all_adjustment_sets:
#     if all(not t.issubset(s) for t in all_adjustment_sets if t != s):
#         print(s)

#### Code 6.31

Credit [ksachdeva](https://ksachdeva.github.io/rethinking-tensorflow-probability/)

<br>

We've found that we could control for either:
1. **Median Age at Marriage $(A)$** <u>and</u> **Marriage Rate $(M)$**;
2. Or for **Southern State $(S)$** <u>alone</u>.

The rule of thumb in modeling is simple: **if you don't have to add a variable to the model, then don't.** Therefore, controlling only for $S$ is the prefered approach.

While this DAG is likely too simple for real-world data because it ignores unobserved confounding factors, we can still learn something useful by analyzing it. Although data alone can never prove a DAG is correct, it can often suggest where a DAG is wrong.

This is done by deriving the model's **testable implications**, which are the **conditional independencies**. These are pairs of variables that the DAG predicts will have no association *once* we control for a specific set of other variables.

You can derive these conditional independencies using the exact same path-logic you learned for finding and closing backdoors. For any pair of variables, you find all paths connecting them and then determine the minimal set of variables you would need to condition on to close all those paths. While this is a lot of work in a complex graph, it provides a crucial way to test if your hypothesized causal model is consistent with the evidence.

In [None]:
# all_independencies = dag_6_2.get_all_independence_relationships()
# for s in all_independencies:
#     if all(
#         t[0] != s[0] or t[1] != s[1] or not t[2].issubset(s[2])
#         for t in all_independencies
#         if t != s
#     ):
#         print(s)

### **Rethinking: DAGs are not enough.**

**DAGs (Directed Acyclic Graphs)** are excellent tools, especially when you lack a detailed, **mechanistic model** of the system you're studying. They serve as a *crucial caution against the common mistake of treating multiple regression as a substitute for scientific theory*.

However, DAGs aren't the ultimate goal. If you successfully develop a comprehensive **dynamical model** of your system, you no longer need a DAG. In fact, many complex dynamical systems, which feature complex behavior sensitive to initial conditions, cannot be accurately or usefully represented by a simple DAG. These systems can still be analyzed, and causal interventions can be designed from them. Although DAGs have limitations, like all theoretical tools, they remain the best available method for clearly teaching the mechanics and common obstacles of causal inference.

## *Section 6.5* - Summary

1. When building a Multi Regression model, it's tempting to add as much predictors to the model as you can ("throw the kitchen sink at the problem") to improve our understanding of the model. However, this the first common pitfall with building multi regression models as it can lead to **multicollinearity** where two or more correlated variables are used as predictors and confuses the model into underestimating their predictive power over the target variable.
   - Let's take an extreme example of multicollinearity where we included the length of both a person's leg to predict their height. What will happen is that the model will give greater importance to one leg and less (or zero) to another when in reality, we only need one of the legs to predict height. Since both leg variables contain almost the same information, its almost as if our model is essentially trying to solve a problem with two identical variables by combining them:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $ y_i \sim \text{Normal}(\mu_i, \sigma) $

$\require{enclose}$
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $ \enclose{horizontalstrike}{\mu_i = \alpha + \beta_L x_i + \beta_R x_i} $

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $ \mu_i = \alpha + ( \beta_L + \beta_R ) x_i$


- Another problem with multicollinearity is the issue of **non-identifiability** where there may actually be a 3rd predictor influencing the correlated variables that isn't in our data. This is also called **omitted variable bias** and when  this happens, our posterior distributions may end up looking very similar to our priors signalling to us that the model didn't learn that much from the data. *The way to deal with multicollinearity is to remove one of the highly correlated variables in order to understand the true effect of the other.*

2. A second common pitfall is **post-treatment bias** which describes when we include two or more predictors and one of them $(Z)$ is the consequence of another which ends up blocking the effect that the predictor $(X)$ has on the target $(Y)$. The *elemental causal graph* associated to this issue is the **pipe** $(X → Z → Y)$.
    - For example, let's say we were trying to predict a plant's height based on the use of a new anti-fungal soil treatment AND the presence of fungus in the plant. The issue with including both is that the fungus is a *consequence* of the the treatment where if the treatment was effective, then we would see less of the fungus and potentially more plant growth. By using fungus as a predictor along with soil treatment, the fungus is actually *blocking* the causal effect that the treatment would have had on the plant's growth. *Therefore, we need to be sure to remove the fungus (and other blocking variables) in our analysis.*

3. The third common pitfall that happens when building multi-regression models is **collider bias** where two seemingly independent predictors both influence the same target variable which may create misleading associations amongst themselves ($ T → S ← N $). In the DAG earlier, $S$election represents the collider variable in this instance because it's the common link between article $T$rustworthiness and $N$ewsworthiness which appear to have a negative correlation with one another when they're really independent.
    - There could also be the issue of a "**haunted DAG**" where we have an unmeasured variable acting as a hidden collider. [Figure 6.5](#scrollTo=yDEs0w60y44w) is the best example of this where we may initially think that $G$randparents and $P$arents have a negative correlation with one another, but this is actually due to the neighbourhoods $(U)$ where the parents raised their kid in that was causing this negative correlation. Unfortunately, the only way to fix a collider is to recognize the unobserved variabe $(U)$ and include it in our model.
    - In this sense, it's haunted DAG's are almost like the opposite of multicollinearity.



In [None]:
%load_ext watermark
%watermark -n -u -v -iv -w