A common example of a causal association is a *linear* association or a line. For example, consider the equation for a line $y=2x$. This is an example where changes in $x$ are direct *causes* of changes in $y$. We can see this association by plotting $x$ vs $y$ in a scatterplot as we saw in Chapter (add vizualization chapter number).

The following code can be used to make such a scatterplot. First, we import our usual modules and set a random seed as you saw in Chapter 9. Also in Chapter 9, you were introduced to the `random.choice` function from the `numpy` module. That function chooses a value at random from list of potential choices. For our simulation below, we use a similar function `random.normal`. This function chooses values at random from a normal distribution, which we will discuss in more detail in later chapters. We use this function to choose 100 random values to make up our `x` array. Then we define `y` to be $2x$ plus random noise (simulating error and natural variation that exists in the real world) added using our `random.normal` function and make a scatterplot.

In [None]:
import numpy as np
from matplotlib import pyplot as plt

np.random.seed(1890)

x = np.random.normal(size=100)
y = 2 * x + np.random.normal(size=100)

plt.scatter(x, y)
plt.title("Scatterplot showing true association between x and y such that y = 2x")
plt.ylabel("y")
plt.xlabel("x")
plt.show()

This scatterplot depicts a true association between x and y that is present due to the way we defined y to depend on x.

Let's use scatterplots as we did earlier to better understand confounding. We create our confounding variable `z` as an array of 100 random values. Next, we define both `x` and `y` to depend explicitly on `z`. When we make the scatterplot, we see an association appear between `x` and `y`. However, when we look at the equations for `x` and `y`, there is no true association (the equation for `x` does not depend on `y` and vice versa). The association in the scatterplot is not causal, it is due to the confounding variable `z`.

In [None]:
z = np.random.normal(size=100)
x = 2 * z + np.random.normal(size=100)
y = z + 4 + np.random.normal(size=100)

plt.scatter(x, y)
plt.title(
    "Scatterplot showing false association between x and y caused by confounder z"
)
plt.ylabel("y")
plt.xlabel("x")
plt.show()

We can use a scatterplot to understand colliding as well. We create `x` as an array of 100 random values between 0 and 1, this time from a uniform distribution. Imagine these to be probabilities of getting Disease X. We consider those who have probability greater than 50% to have developed the disease. Define `y` to be related to `x` such that those who have Disease X ($x > 0.5$) have twice as high of a score on Measure Y. Next, we define our collider `z` to depend on both `x` and `y`. When we make a scatterplot of `x` and `y`, we see the true association appear between `x` and `y`.

In [None]:
np.random.seed(312)

x = np.random.uniform(0, 1, size=100)
y = (np.round(x + 1)) * np.random.uniform(0, 1, size=100)
z = x + y

plt.scatter(x, y)
plt.title("Scatterplot showing true association between x and y")
plt.ylabel("y")
plt.xlabel("x")
c = np.polyfit(x, y, 1)
p = np.poly1d(c)
plt.plot(x, p(x), "-")
plt.show()

However, when we condition on values of our collider by plotting only values of `x` and y for which `z` is greater than 2 (those who have high probability of disease), we see the association between `x` and `y` change directions and appear to be negative.

In [None]:
import pandas as pd

dat = pd.DataFrame({"X": x, "Y": y, "Z": z})

smallz = dat[dat["Z"] > 2]

plt.scatter(smallz.X, smallz.Y)
plt.title(
    "Scatterplot showing false association between x and y caused by conditioning on collider z"
)
plt.ylabel("y")
plt.xlabel("x")
c = np.polyfit(smallz.X, smallz.Y, 1)
p = np.poly1d(c)
plt.plot(smallz.X, p(smallz.X), "-")
plt.show()


Let's simulate data that suffers from non-response bias. First, we create a ground truth dataset. Imagine we are studying student satisfaction at UChicago and we send surveys to 400 students using stratified random sampling by year in school. Students are asked to rank their satisfaction on a scale of 1 to 5 with 5 being more satisfied and 1 being less satisfied. We also ask students to report their average letter grade. We assume that students with lower grades are more likely to be less satisfied and those with high grades are more likely to be more satisfied.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt


np.random.seed(92)

grades = ["A", "B", "C", "D", "F"]
satisfaction = [1, 2, 3, 4, 5]


def get_satisfaction(grade):
    if grade == "A" or grade == "B":
        return np.random.choice(satisfaction, 1, p=[0.05, 0.05, 0.2, 0.3, 0.4])
    
    if grade == "D" or grade == "F":
        return np.random.choice(satisfaction, 1, p=[0.25, 0.35, 0.25, 0.1, 0.05])
    
    return np.random.choice(satisfaction, 1, p=[0.05, 0.1, 0.25, 0.35, 0.25])


get_satisfaction_vec = np.vectorize(get_satisfaction)

student_grades = np.random.choice(grades, 400, [0.35, 0.3, 0.25, 0.05, 0.05])

student_satisfaction = get_satisfaction_vec(student_grades)

student_survey = pd.DataFrame(
    {
        "student_grades": student_grades,
        "student_satisfaction": student_satisfaction,
    }
)

plt.hist(student_survey.student_satisfaction, bins=[0.5, 1.5, 2.5, 3.5, 4.5, 5.5])
plt.show()

Overall, there are more students who are satisfied with UChicago than those who are unsatisfied. However, it is unlikely that all student who are sent the survey will complete it. Assume those with stronger opinions on UChicago are more likely to respond to the survey. Students are busy with schoolwork so those with less strong opinions respond with 50% likelihood.

In [None]:
def get_response(sat):
    if sat == 1:
        return np.random.choice([True, False], 1, p=[0.95, 0.05])
    
    if sat == 5:
        return np.random.choice([True, False], 1, p=[0.8, 0.2])

    return np.random.choice([True, False], 1, p=[0.5, 0.5])


get_response_vec = np.vectorize(get_response)

student_survey["response"] = get_response_vec(student_survey.student_satisfaction)

student_survey_biased = student_survey[student_survey["response"] == True]


plt.hist(
    student_survey_biased.student_satisfaction,
    bins=[0.5, 1.5, 2.5, 3.5, 4.5, 5.5],
)

plt.show()

The non-response bias changes the distribution of satisfaction scores, making it look like more students are responding with 1's and 5's than there are in truth.

