# Week 6 Overview
Understanding **treatment effects** is central to causal inference. 

A **treatment effect** captures the difference an intervention makes—such as how much a medication lowers blood pressure or how access to a program affects college admissions. But, because we can never observe both the treated and untreated outcomes for the same individual, we rely on models like regression or matching to estimate what would have happened in the counterfactual scenario.  

This week introduces the concept of the **average treatment effect (ATE)** and its variations, including: 
- the **average treatment effect on the treated (ATT)**, 
- **conditional average treatment effect (CATE)**, and others. 

You'll also explore how treatment effects can be adapted for specific subgroups, prioritized based on variance, or interpreted in light of real-world complexities like imperfect compliance.

## Learning Objectives
At the end of this week, you will be able to: 
- Calculate treatment effects such as ATE, ATT, and ATUT (average treatment effect on the untreated). 
- Explain why treatment effects are defined in a particular way. 
- Choose the right treatment effect for a given situation. 


## Topic Overview: Foundations of Treatment Effects 
This section lays the groundwork for understanding treatment effects by focusing on three foundational measures: 
- The **average treatment effect (ATE)**, 
- The **average treatment effect on the treated (ATT)**, and
- The **average treatment effect on the untreated (ATUT)**

Each of these captures how an outcome would change if a treatment were applied or withheld, but they differ in whose outcomes are being averaged. 

To estimate these effects, we use techniques like regression or matching to model the counterfactual — what would have happened to each individual under the opposite treatment condition. 

By comparing observed outcomes to these modeled counterfactuals, we can estimate the effect of a treatment across the full sample (ATE), just those who were treated (ATT), or just those who were not (ATUT). 

Understanding the distinctions between these measures is critical for designing studies and interpreting causal results in real-world applications.

### Learning Objectives 
- Calculate treatment effects such as ATE, ATT, and ATUT. 
- Explain why treatment effects are defined in a particular way. 

## 1.1 Lesson: Foundations of Treatment Effects: ATE, ATT, and ATUT
In this video, you’ll learn how ATE, ATT, and ATUT help us estimate the impact of a treatment by comparing actual outcomes to modeled counterfactuals across different groups

### Average Treatment Effect
The analysis estimates the causal impact of deer presence on flower growth using three key measures:

⸻

1. Average Treatment Effect (ATE):
- Formula: $\text{ATE} \; = \; E[Y(1)] - E[Y(0)]$
    - $E[Y(1)]$ represents the expected outcome for the treatement group,
    - $E[Y(0)]$ represents the expected outcome for the control group.
    - In simpler terms, it's the average difference in outcomes between those who received the treatment and those who did not. 
- Compares flower growth across all samples, using both observed and modeled (counterfactual) outcomes.
- With deer (Deer = 1): average flower growth = 0.25
- Without deer (Deer = 0): average flower growth = 2.5
- ATE = 0.25 - 2.5 = -2.25
- Interpretation: On average, the presence of deer reduces flower growth by 2.25 flowers.

⸻

2. Average Treatment Effect on the Treated (ATT):
- Formula: $ATT \; = \; E[Y(1) - Y(0) | D=1]$
    -  $E[Y(1) - Y(0) | D=1]$
    - $Y(1)$ is the potential outcome if an individual receives the treatment.
    - $Y(0)$ is the potential outcome if an idnividual does not receive the treatment.
    - $D = 1$ indicates that the individual receives the treatement
    - In simpler terms: The ATT measures the average impact of the treatment on the specific group of people who actually received it. 
- Focuses only on environments where deer were present.
- Actual outcomes: average flower growth = 1
- Estimated counterfactual (if no deer): average = 3.5
- ATT = 1 - 3.5 = -2.5
- Interpretation: For places that had deer, flower growth was reduced by 2.5 flowers on average compared to if deer hadn’t shown up.

⸻

3. Average Treatment Effect on the Untreated (ATUT):
- Focuses only on environments where deer were absent.
- Actual outcomes: average flower growth = 1.5
- Estimated counterfactual (if deer had been present): average = -0.5
- ATUT = -0.5 - 1.5 = -2
- Interpretation: For places that didn’t have deer, having them would have reduced flower growth by 2 flowers on average.

⸻

### ATE, ATT, and ATUT
This section builds on previous ATE/ATT/ATUT estimates by introducing heterogeneous treatment effects using an interaction term between pesticides (P) and deer (D) in the model.

⸻

1. Why Interaction Terms Matter
- In the earlier model, the treatment effect of deer was constant across all samples.
- Now, a more realistic scenario is modeled: the impact of deer depends on pesticide levels.
- For example, pesticides may reduce the damage deer cause, acting as a buffer.
- To capture this dependency, the model adds an interaction term $F = 5 - 2D - P + 0.25(D \times P)$
- Where F is flower growth, D is presence of deer, P is pesticide level.
- The D × P term adds a small positive effect only when both deer and pesticides are present.

⸻

2. Average Treatment Effect (ATE) with Interaction
- Using the updated model:
- $ATE = (3.75 - 9) / 4 = -1.3125$
- This is less negative than before because the interaction term mitigates deer damage when pesticide levels are higher.
- Interpretation: On average, deer still reduce flower growth, but the harm is slightly softened in high-pesticide conditions.

⸻

3. Average Treatment Effect on the Treated (ATT)
- $ATT = (3 - 7) / 2 = -2$
- More negative than the ATE. Why?
- The treated samples had lower pesticide levels, so they benefited less from the interaction term.
- Lower D × P ⇒ weaker protection ⇒ greater flower loss.

⸻

4. Average Treatment Effect on the Untreated (ATUT)
- ATUT = -0.625, much less negative than ATT.
- Untreated environments had higher pesticide levels.
- When we imagine adding deer, the D × P interaction term offsets more of the damage.
- Interpretation: The presence of deer would have hurt untreated environments less, because those environments had more pesticide protection.

⸻

Key Insight:
- Interaction terms like D × P introduce heterogeneous treatment effects:
- ATT, ATE, and ATUT can diverge significantly.
- The treatment effect varies depending on background conditions (like pesticide levels).
- This model shows how real-world causal effects are often context-dependent—not all groups are affected the same way.

A treatment effect tells us the effect of a particular treatment on an outcome. If the treatment variable is binary, this typically takes the form of (the outcome variable when the treatment = 1) - (the outcome variable when the treatment = 0). However, it’s complicated — there can be many ways to interpret this equation. 
___
### Average Treatment Effect
Here, we must build a model that is capable of predicting the outcome for a given sample “as if” it was treated or untreated. 

For example: 

Suppose we can use linear regression to model a blood pressure outcome ($Y$) based on whether the person received a medication treatment ($X$) as well as other variables, 
- Their age ($Z$), and 
- Initial blood pressure ($W$). 

We need the confounders $Z$ and $W$ because it turns out they have a causal relationship with both $X$ and $Y$. Then, for a given sample with $X = 0$ and a given $Z$ and $W$, we can predict $Y$ for $X = 1$ and the same $Z$ and $W$. Or if $X = 1$, we can predict $Y$ for $X = 0$ and the same $Z$ and $W$. So, we know (or predict) what “would have happened” if the untreated item was treated or if the treated item was untreated.
___
#### Suppose we use matching:
- For a given sample with $X = 0$ and a given $Z$ and $W$
- Is matched to another sample with $X = 1$ is matched to another sample with $X = 1$ and a similar $Z$ and $W$. 

Then, we can know or predict "what would have happened" if the untreated item was treated or if a treated item were untreated. We can then predict the difference in $Y$ values if the "same" item was treated vs. untreated (where, in practice, "same item" means "matched item" in this case).
___
#### In the case of regresssion:
- We use the whole sample to build the regression model.
- We then use the model to predict the counterfactual outcome if the whole sample was treated or not treated.
- In the case of matching, we use a matching relationship between treated and untreated samples to get the counterfactual outcome. 
- If we want to know "what would have happened" (a counterfactual) if a treated item were not terated, we just assume that it would be like the nearest untreated item.
___
#### Digression: 
(We should probably standardize the data in order to define “nearest” properly — that is, if the range of $X$ is $0$ to $1$, and the range of $W$ is $0$ to $100$, we do not want to treat them the same, otherwise only $W$ will matter for our matching. We want to standardize $X$ and $W$ first, then match.)

The difference between the treated sample’s $Y$ and the untreated sample’s $Y$ — where one is real and the other counterfactual — is then averaged over all samples. This gives us the ATE. 

For treated samples, we use the treated (factual) minus the untreated (counterfactual). For untreated samples, we use the treated (counterfactual) minus the untreated (factual).

Thus, if sample $i$ is in the treatment group, we know $Y(\text{i, treated})$, but we want to use a regression model or matching to predict $Y(\text{i, untreated})$ in order to find the difference. If sample $i$ is in the untreated group, we know $Y(\text{i, untreated})$, but we want ot use a regression model $r$ matching to predict $Y(\text{i, treated})$.

We could also use a regression model to find a counterfactual for both, which would have the advantage of being consistent in using the same model for both cases. More commonly, we would use the true value in one of the cases, which would be more accurate in that case. (The former, model-only treatment effect would likely be interesting only if a regression model has polynomial or interaction terms. If the model has only linear terms, then the model-only effect is trivially the same for all data points; it is just the coefficient of $X$ in the model.) 

But whatever we do, we certainly can’t use a factual value for both because we never know for sure what would happen if a given treated sample was untreated or if an untreated sample was treated. 

This idea, like the others here, assumes that treatment is binary so that it is meaningful to talk about the sample being “treated” or “not treated” as opposed to treatment being a continuous number. Thus, if the treatment is that a patient takes a dose of medication between 10 mg and 100 mg, where all patients get a dose, then the average treatment effect cannot be calculated in this way. We can’t identify which patients were treated and which were untreated because they were all treated. 

___
### Average Treatment Effect on the Treated (ATT)
This is the same as the average treatment effect, except that we only use the treated group in our analysis. We still need to use the untreated group to help construct the model, but we only use $Y(\text{i, treated}) - Y(\text{i, control})$ for $i$ in the treatement group, using the regression or matching model to compute $Y(\text{i, untreated})$. We already know $Y(\text{i, treated})$ for the treatement group. 

### Average Treatent Effect on the Untreated (ATUT)
This is as above, but we only use the untreated group. We again compute $Y(\text{i, treated}) - Y(\text{i, control})$ for $i$ in the control group, using the model to compute $Y(\text{i, treated})$. We already know $Y(\text{i, untreated})$ for the untreated group. 

## Topic 2: Extensions and Special Cases of Treatment Effects
Once you understand the basic framework of average treatment effects, you can begin to ask more nuanced questions: 
- Does the treatment work better for certain subgroups? 
- Who should be treated next? 
- Can we trust our estimates equally across all contexts? 

This section explores those deeper layers through a set of extended treatment effect measures. 

- You’ll learn how the **conditional average treatment effect (CATE)** allows you to examine effects within specific subpopulations. 
- The **marginal treatment effect** helps prioritize treatment decisions based on potential benefit. 

You’ll also explore how weighting treatment effects — by variance or distribution — can improve interpretability and robustness and how the **local average treatment effect (LATE)** isolates effects among individuals who comply with their treatment assignment. 

These special cases add flexibility and precision to your causal analysis toolkit. 


### Learning Objectives 
- Choose the right treatment effect for a given situation.

## 2.1 Lesson: Alternative Treatment Effects: CATE, Marginal, Weighted, and Local Treatment Effect

### Conditional Average Treatment Effect (CATE)
This uses some other subgroup — not the treatment or untreated group. It could be any group. 

For example: 
- If your samples are managers, you could use managers with five years of experience or less as your group. 
    - You’d then compute $Y(\text{i, treated}) - Y(\text{i, control})$ for samples $i$ in this group of managers. 
    - As before, you must use some model, such as regression or matching, to compute the counterfactuals. 

### Marginal Average Treatement Effect
This is the treatment effect for the next sample you’d want to treat. Here, we’re imagining that you are treating samples in a certain order. You rank the samples in terms of importance
- The importance of a sample has to do with the benefit of treating it, which is $Y(\text{i, treated}) - Y(\text{i, untreated})$.
- Assuming you want to get the maximum possible effect, you want to choose the next sample with the maximum treatment effect $Y(\text{i, treated}) - Y(\text{i, control})$ taken over all untreated samples $i$.
- (There’s no point in considering treated samples; they’re already treated, so they can’t be the “next” samples to be treated.) 

### Variance Weighted Treatment Effect
This terminology is not standard, but is the term used by Huntington-Klein (2022). It means that we are going to find two treatment effects for different groups (via CATE). Then, we are going to find the variance of the treatment (0 vs. 1) in each group and weight the treatment effects by that variance. Of course, we could do this with more than two groups as well.

High-variance groups are those with an even distribution of treated and untreated individuals. Low-variance groups have a preponderance on one side or another, mostly treated or mostly untreated. The advantage of the VWTE approach is partly that high-variance groups are giving us more genuine information about the effect. We may not even trust our model about low-variance groups, as the model is based on limited data. Moreover, it may be hard, for some reason, to adjust the treatment for low-variance groups — that’s why they are low variance! But if we can’t adjust the treatment, we don’t care about the size of the effect.

We should also probably include a weight for the sample size of each group; otherwise, we might end up counting a very large group the same as a very small one. Thus, the overall effect is:

$$ \frac{((Y(\text{i, treated}) - Y(\text{i, control})_1 \cdot N_1 \cdot \text{Var}_1 + (Y(\text{i, treated}) - Y(\text{i, control}))_2 \cdot N_2 \cdot \text{Var}_2))}{N_1 \cdot \text{Var}_1 + N_2 \cdot \text{Var}_2} $$

### Distribution Weighted Treatment Effect
When we do an average treatment effect on the treated or untreated (ATT or ATUT), this can be viewed as a distribution-weighted effect, in the sense that some values of the covariates are more likely than others. That is, if treated items have large values of $Z$ while untreated items have smaller values of $Z$, then ATT is favoring a distribution with large values of $Z$, while ATUT is favoring a distribution with small values of $Z$. To the extent that the “true” distribution (whatever that means) is something else, then these effects favor a particular distribution of covariate values that isn’t the “true” one. 

### Intent-to-Treat Effect
This is the effect of assigning treatment rather than the effect of delivering treatment. That is, it’s the effect of giving medication (which the patient may or may not consume) rather than the effect of the patient consuming the medication. This means that if $X_1 = 1$ when the patient takes the medication and $X_1 = 0$ when they do not, whereas $X_2 = 1$ when the patient is given the medication (but may or may not take it) while $X_2 = 0$ when they are not, then for the purpose of internet-to-treat, we are interested in $X_2$, not $X_1$. 

### Local Average Treatment Effect (LATE)
This is the effect counting only compliers. Compliers are those who either (1) are assigned to treatment and actually get treated or (2) are not assigned to treatment and actually don’t get treated.

## Knowledge Check: Treatment Effect
1. The average treatment effect is most commonly found by:
- Correct: Finding the treated and untreated outcome for each sample, using counterfactuals when needed. Then, take the average over all samples.
- You cannot get the actual treated and untreated outcome for the same sample. You can get the counterfactuals for the same sample, but usually, we use one actual and one counterfactual for a given sample. Subtraction, not division, is the correct operation here. 
2. The average treatment effect on the treated (ATT) is most commonly found by:
- Correct: Finding the average treatment effect but using only treated samples in the average.
- The ATT requires applying the same approach as the ATE but specifically with treated samples.
3. The average treatment effect on the untreated (ATUT) typically uses the:
- Correct: Factuals for the untreated, counterfactuals for the treated.
- With the ATUT, we have access to the actual untreated samples, so we use the factuals for them. 
4. In VWTE, high-variance groups are weighted:
- Correct: More because they give us more information about the effect.
- A higher variance group is more evenly divided between treated and untreated samples. This means we have more information about what happens when the treatment status changes.
5. If counterfactuals are used for both treated and untreated samples, a simple linear regression will report the effect as:
- Correct: The coefficient of ﻿X﻿ in the regression.
- As $X$ changes from 0 to 1, a term like $\beta \cdot X$ will change by $\beta$. So that's the effect, using counterfactuals (i.e., the regression model values) for both treated and untreated samples.


Suppose the factual outcomes ($Y$) of two treated items are $Y = 3, 4$, and of two untreated items are $1$ and $2$. 

The counterfactual outcomes of the treated items are $0$ and $2$ (in that order) and of the untreated items are $–3$ and $4$ (in that order). 


What is the marginal treatment effect?

In [1]:
# Given data
treated_factual = [3, 4]  # Y(1) for treated units
treated_counterfactual = [0, 2]  # Y(0) for treated units
untreated_factual = [1, 2]  # Y(0) for untreated units
untreated_counterfactual = [-3, 4]  # Y(1) for untreated units

# Calculate treatment effects for untreated units
treatment_effects = []
for i in range(len(untreated_factual)):
    effect = untreated_counterfactual[i] - untreated_factual[i]
    treatment_effects.append(effect)
    print(f"Untreated unit {i+1}: Y(1) - Y(0) = {untreated_counterfactual[i]} - {untreated_factual[i]} = {effect}")

# Find the maximum treatment effect among untreated units
marginal_effect = max(treatment_effects)
print(f"\nMarginal treatment effect = {marginal_effect}")

Untreated unit 1: Y(1) - Y(0) = -3 - 1 = -4
Untreated unit 2: Y(1) - Y(0) = 4 - 2 = 2

Marginal treatment effect = 2


You only look at the untreated items and pick the one with the largest benefit

$$\Delta_i \;=\; Y_i(1) \;-\; Y_i(0)$$

for those $i$ with $D_i=0$

1. List the untreated units
- Unit 3: $Y_3(0)=1,\quad Y_3(1)=-3$
- Unit 4: $Y_4(0)=2,\quad Y_4(1)=4$
2. Compute each untreated unit’s treatment effect

$$\Delta_3 = (-3) - 1 = -4,\qquad \Delta_4 = 4 - 2 = 2$$

3. Pick the maximum
$$\text{MATE} = \max\{\,-4,\;2\} = 2$$

So the marginal (next‐to‐treat) average treatment effect is $\boxed{2}$.