
<div align="center">

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiangtgao/DS-UA_201-Causal-Inference-Spring-2025/blob/main/labs/1-Introduction.ipynb)

</div>






$$
\begin{array}{c}
\textbf{CAUSAL INFERENCE}\\\\
\textbf{Zichao Zhang} \\
\textit{Center for Data Science, New York University} \\\\
\textit{January 30, 2026}\\\\\\
\text{Materials prepared by: Daniela Pinto Veizaga, Xiang Pan, Xiang Gao, and Zichao Zhang}
\end{array}
$$

---

# Introduction

![Causality](https://simons.berkeley.edu/sites/default/files/styles/hero_xxl_1x/public/2023-01/Causality_hi-res.jpg?h=6dcb57f1&itok=5R0Da6OT.png?raw=true)




## Goals for today

- Review of setup and tools for running code.
- Differentiate descriptive from causal questions.
- Introduce the all-causes model
- Identify and estimate causal parameters with simulated data.
---



## Review of setup and tools for running code.

+ Tutorial to install Python, available [here](https://realpython.com/installing-python/).

+ Tutorial to install Jupyter [here](https://jupyter.org/install).

+ Tutorial to work with [NYU HPC](https://sites.google.com/nyu.edu/nyu-hpc).


+ Easiest way for now: [Google Colab](https://colab.research.google.com/drive/16pBJQePbqkz3QFV54L4NIkOn1kwpuRrj).



## The effect of class attendace on grades

In [8]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

A professor is on a mission to find out if attending classes regularly really gives students that extra grade boost. He's not just interested in *describing* what's happening he wants to figure out the *effect* of attendance on grades. **How do we go from simply observing grades to truly understanding *what causes* better performance?**


### Notation
Let

$$Y_i(s, u)$$

represent the outcome of interest: grades for student $i$ under potential state $s$ and given all other factors $u$.
- **$s$**: A potential state of the world, such as regularly attending classes ($s = 1$) or attending irregularly ($s = 0$).
- **$u$**: All other factors that might influence grades, like study habits, prior knowledge, or motivation.
- **$Y_i(1, u)$** would be student i’s grade if they attend class regularly (with all other personal factors being u), and Y_i(0, u) would be their grade if they don’t attend regularly . Each student has these two theoretical possibilities: grade if they attend, grade if they don’t.






### Descriptive Analysis


Let us simulate grades based on a normal distribution. This allows us to later analyze how attendance impacts these grades.



In [9]:
# Setting a random seed for reproducibility
np.random.seed(42)

# Parameters
students = 1000
mean_grade = 75
std_grade = 12

In [None]:
# Simulate attendance and grades
grades = np.random.normal(mean_grade, std_grade, students)
attendance = np.random.binomial(1, 0.6, students)
grades = pd.DataFrame({
    'Grade': grades,
    'Attendance': attendance
})

# Plot
plt.figure(figsize=(6, 6))
sns.kdeplot(grades['Grade'])


Let’s start by examining the current state of the world, focusing on a straightforward descriptive question: **What is the average grade in this course?** Notice, here we are not distinguishing between different attendance patterns — our goal is to simply describe the grades as they are:

- **_Current State ($s$):_** This represents the actual, observed state of the world where students have varied levels of attendance, but we are not yet separating them into “regular” or “irregular” attendance categories.
  
- **_Outcome ($Y_i(s, u)$):_** The grade $Y_i$ for student $i$ given their current attendance pattern $s$ and other influencing factors $u$.

To calculate the average grade, we observe:

\begin{aligned}
\mathbb{E}[Y] = \frac{1}{N} \sum_{i=1}^N Y_i(s, u)
\end{aligned}


where:
- $N: \text{is the number of students.}$
- $s$ is the current attendance status for each student (which varies but is not yet isolated into specific patterns).
- $u$ includes other relevant factors like study habits, prior knowledge, and motivation.

In [None]:
grades.head(5)

In [None]:
print(f"Average grade in the course is {grades['Grade'].mean():.2f}.")

**Can we now answer what is the effect of attendance on grades? Not yet! Why?** Even though we've got some insights from our descriptive analysis, there's a big issue we can't overlook: we're only seeing each student in **one attendance scenario** — either regular or irregular.

We have no idea what would happen to their grades if they switched their attendance pattern! This missing information —**the counterfactual**— is exactly what makes causal inference so tricky and it's known as **the fundamental problem of causal inference**.


### From Descriptive to Causal Analysis in a Perfect World (For Now)

Assumptions:

1. **Perfect Knowledge of Both States:** We can observe both regular $Y_i(1, u)$ and irregular attendance $Y_i(0, u)$ grades for every student — something that's impossible in the real world.
2. **No Confounding Factors:** Attendance is the only factor affecting grades, which may not be true in reality.
3. **Random Assignment:** Attendance can simulated as being randomly assigned, avoiding biases often seen in real-world settings.

Given our “perfect world”, **Average Treatment Effect (ATE)** is defined as:

$$
\text{ATE} = \mathbb{E}[Y_i(1, u)] - \mathbb{E}[Y_i(0, u)].
$$


In [13]:
# Simulations
attend_effect = 10 # attending class gives a real boost of +10 points to a student’s grade
attendance = np.random.binomial(1, 0.5, students)
grades = grades['Grade']
grades_attend = grades + attendance * attend_effect
grades_nattend = grades

# Create DataFrame
attend = pd.DataFrame({
    'Student_ID': np.arange(1, students + 1),
    'Attendance': attendance,
    'Grade_With_Attendance': grades_attend,
    'Grade_Without_Attendance': grades_nattend
})


In [None]:
attend.head(5)

In [None]:
attend.describe()

In [None]:
# Calculate average grades for each state
avg_reg = attend['Grade_With_Attendance'].mean()
avg_irr = attend['Grade_Without_Attendance'].mean()

# Calculate the causal effect
ate = avg_reg - avg_irr

# Output results
print(f"Average grade with regular attendance: {avg_reg:.2f}")
print(f"Average grade with irregular attendance: {avg_irr:.2f}")
print(f"Regular attendance improves grades by {ate:.2f} points.")



In [None]:
# Plotting
labels = ['Regular Attendance', 'Irregular Attendance']
values = [avg_reg, avg_irr]
plt.figure(figsize=(6, 6))
sns.barplot(
    x=labels, y=values, hue=labels, palette = ['#c98389','#5c9ba4'], legend=False
    )
plt.title('Effect of Regular Attendance on Grades', fontsize=16)
plt.ylabel('Average Grade', fontsize=14)
plt.ylim(0, 100)
plt.annotate(
    f'ATE: {ate:.2f} points', xy=(0.95, avg_reg), xytext=(1.15, avg_reg + 10)
    )


## Probability

### Probability Mass and Density Function

In [18]:
def pmf_die(x):
    """
    Probability Mass Function
    for a fair six-sided die.
    """
    if x in [1, 2, 3, 4, 5, 6]:
        return 1/6
    else:
        return 0

def pdf_gaussian(x, mu, sigma):
    """Probability Density Function
    for a Gaussian distribution.
    """
    return np.exp(-0.5 * ((x - mu) / sigma) ** 2) / (sigma * np.sqrt(2 * np.pi))

In [None]:
x = np.linspace(-5, 5, 1000)
y = pdf_gaussian(x, 0, 1)
plt.plot(x, y)
plt.title('Gaussian Probability Density Function')
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.show()
x = np.linspace(-5, 5, 1000)

In [None]:
y = pdf_gaussian(x, 0, 1)
y = np.cumsum(y, axis=0)
print(f"Final cumulative value: {y[-1]}")

In [None]:
# Calculate PDF
y = pdf_gaussian(x, 0, 1)

# Calculate CDF by cumulative sum and normalize
y_cdf = np.cumsum(y) / np.sum(y)

print(f"Final cumulative value: {y_cdf[-1]}")

# Plot CDF
plt.plot(x, y_cdf)
plt.title('Gaussian Cumulative Distribution Function')
plt.xlabel('Value')
plt.ylabel('Cumulative Probability')
plt.show()

### Function of random variables

\begin{align}
E[g(X)]=\int_{-\infty}^{\infty} g(x) f(x) d x
\end{align}

### Variance

\begin{align}
E\left[(X-E[X])^2\right] & =E\left[X^2-2 X E[X]+E[X]^2\right] \\
& =E\left[X^2\right]-E[2 X E[X]]+E[X]^2 \\
& =E\left[X^2\right]-2 E[X] E[X]+E[X]^2 \\
& =E\left[X^2\right]-E[X]^2
\end{align}

## Path Forward

1. Dive deeper into the assumptions we've made and understand how to relax or verify them.
2. Explore statistical methods to estimate causal effects in more complex, real-world data settings.
3. Investigate how to identify and handle confounding variables using techniques like matching, regression, and instrumental variables.

Moving from simple descriptive analysis to robust causal inference!
