**Outline**
0. Be clear that we are going to study cross-sectional models and rely on unconfoundedness and exogeneity assumptions.
    1. Fits A/B experiments and where we can first apply DML
    2. Before we get into complexities of HTE, we should always first look at ATE so we can diagnose while things are simple.
1. HTE overview
    1. Understanding fine-grained (individual, segmented) treatment effects. Useful when we want to do personalized targeting.
    2. We can think of HTE as an estimated function, rather than a scalar.


* T/R/S/X Learner sources:
    * [econml meta-learner documentation](https://econml.azurewebsites.net/spec/estimation/metalearners.html)
    * [hte_tutorial](https://gsbdbi.github.io/ml_tutorial/hte_tutorial/hte_tutorial.html)



1. **model: T2X Learners**
    1. Nests T-learner and X-learner methods
    2. Really flexible but suspectible to noise
        1. *Start to see a theme that imposing structure and assumptions can help a lot*
        2. *Simulation example needed* Additional bias/noise because there is extrapolation into space without support.
    3. Refinement through GML and "filtering" methods
2. **model:  OLS**
    1. Nests R-learner
3. **model:  ML Weights: GRF, Neural Nets, and others**
    1. 
    
    
4. Things not covered
    1. Panel models
5. Fun mentions
    1. Integration with surrogate models
    2. Additional structural assumptions
    
    

## Causal Inference Crash Course
# Heterogeneous Treatment Effect Models and Inference
Julian Hsu

## Overview
* Use cases for heterogeneous treatment effects (HTE) models.    
* Additional challenges compared to non-HTE models
* Showcase a few families of HTE models:
    1. Flexible learners and "filtering" 
    2. OLS and DML
    3. ML-driven weights: GRF, Neural Nets, and others
    * An on-going theme will be the benefits of making structural or functional form assumptions
    * We will only cover cross-sectional models here for simplicity
* Conclude with fun extensions and a lot of citations.


## When do we care about hetergeneous treatment effects (HTE)?
* We can use data to recommend a universal policy:
    1. What's the federal minimum wage?
    2. Product return policy
    3. Product pricing
    *  These are not good use cases for heterogeneous treatment effects (HTE)
* We also want to make targeted policies or take customized actions:
    1. Which customers should be defaulted to faster delivery options?
    2. How do we match sellers with the best support or representatives?
    3. Which users see which sort of ads?
    4. Which orders should we scrutize and delay for fraud investigation? 
    * These are use cases for HTE.
    
    

## HTE use cases
* Just like non-HTE use cases, like Average Treatment Effect (ATE) or Average Treatment Effect on the Treated (ATET), HTE remains a *causal question.*
* Like all causal questions, the foundational problem is that we not observe the outcomes under both the treatment and control conditions.
* This means it inherits all the causal complexities from its ATE/ATET cousins, and more.
    * Cross-sectional:  overlap, unconfoundedness, etc.
    * Panel: parallel/overlaping trends
    * Your ears perk up when you can do a randomized experiment
* You should **always** first estimate ATE/ATETs before trying HTE 


## HTE modeling and notation
* Suppose you have a dataset with:
    * $Y_i$, outcome
    * $X_i$, features or confounders
    * $Z_i$, features you want to explore heterogeneity in
    * $W_i$, treatment indicator

* When we think about ATE/ATET, we can estimate by taking the difference between the outcomes under treatment and control: 
$$ \tau = E[Y_1 (X_i) - Y_0 (X_i) ]$$
    * Recall that the causal problem is that we only observe $Y_1$ for treated ($W_i=1$) units or $Y_0$ for control ($W_i=0$) units, not both.
    * Therefore I am controlling for $X_i$. Again, this doesn't guarantee you have the correct estimate, but is common practice when assumptions hold.
* For HTE, we are interested in variation across $Z_i$:
$$ \tau^{hte} (Z_i) = E[Y_1(X_i) - Y_0(X_i)  | Z_i]$$ 

I will adopt an OLS type equation here so it's easier to make the point that we are estimating a function instead of a scalar.

* If we were to adop and OLS perspective, which is useful for the OLS/DML approach, ATE/ATET would be estimated with:

$$ Y_i = \hat{\beta}X_i + \hat{\tau} W_i + \epsilon_i$$
* HTE has us instead estimate:

$$ Y_i = \hat{\beta}X_i + \hat{\tau}(Z_i) W_i + \epsilon_i$$



## HTE interpretations
$$ \tau^{hte} (Z_i) = E[Y_1 - Y_0  | Z_i]$$ 

* We can also call this the conditional average treatment effect (CATE)
* Since our estimate is no longer a single scalar number, but a function that inputs $Z_i$, we are now estimating an HTE function.
* Your use case can allow different interpretations. 
    * "How being offered a five-day versus two-day delivery options impacts different customers."
    * "How the average seller reacts to being treated with different services."
    

In [5]:
from IPython.display import Image

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

import os as os

## Figure of arXiv papers
arXiv_paper_hits = pd.DataFrame(data={'year':[2017,2018,2019,2020,2021,2022],
                                     'hte_hits':[8,13,15,26,38,66],
                                     'causal_inf_hits':[78,118,154,219,305,446 ]})

hte_yoy_growth = [ (arXiv_paper_hits['hte_hits'].iloc[a] - arXiv_paper_hits['hte_hits'].iloc[a-1])/arXiv_paper_hits['hte_hits'].iloc[a-1] for a in range(1,len(arXiv_paper_hits))]
ci_yoy_growth = [ (arXiv_paper_hits['causal_inf_hits'].iloc[a] - arXiv_paper_hits['causal_inf_hits'].iloc[a-1])/arXiv_paper_hits['causal_inf_hits'].iloc[a-1] for a in range(1,len(arXiv_paper_hits))]


fig,ax = plt.subplots(ncols=1,nrows=1, figsize=(5,3))

ax.set_xlabel('Year')
ax.set_ylabel('Proportion Growth')

ax.set_xticks(arXiv_paper_hits['year'])
ax.set_xticklabels(labels=['{0:4.0f}'.format(a) for a in arXiv_paper_hits['year'].unique().tolist()])

ax.plot(arXiv_paper_hits['year'][1:],hte_yoy_growth, marker='o', label='"Heterogeneous\n Treatment\n Effect"', color='seagreen')
ax.plot(arXiv_paper_hits['year'][1:],ci_yoy_growth, marker='o', label='"Causal\n Inference"', color='purple', linestyle='--')
ax.legend(bbox_to_anchor=(1.05, 1.0))
ax.grid()
ax.set_title('YoY Growth in Submitted arXiv Papers')
fig.set_facecolor('lightsteelblue')
plt.savefig(os.getcwd() + '/Figures/'+'HTE_Figure_0.png'
           , bbox_inches='tight')
plt.close(fig)

## Figure of model trained on treated sample
datax = np.random.uniform(0,1,100)
treatment = ( (np.exp(datax) / (1+np.exp(datax)) + np.random.uniform(-0.05,0.025, 100)) > 0.6).astype(float)
datay = 0.25 + 0.5*np.log(1+datax) + 0.25*treatment + np.random.uniform(-0.05,0.05,100)
df = pd.DataFrame(data={'W':treatment,
                        'x':datax,
                       'y':datay})
df.sort_values(by='x',inplace=True)
fig,ax = plt.subplots(ncols=1,nrows=1, figsize=(4,2))

ax.set_xlabel('x')
ax.set_ylabel('y')


ax.scatter(df.loc[df['W']==1]['x'],
        df.loc[df['W']==1]['y'],  color='seagreen', label='Treated')
ax.scatter(df.loc[df['W']==0]['x'],
        df.loc[df['W']==0]['y'],  color='coral', label='Control')
ax.legend(bbox_to_anchor=(1.0, 1.0))
ax.grid()
ax.set_title('Predicting out of Sample')
plt.savefig(os.getcwd() + '/Figures/'+'HTE_Figure_10a.png'
           , bbox_inches='tight')
plt.close(fig)


# Some HTE Models
This is a very active literative, so consider this just a brief summary of broad classes of HTE models
<img src="Figures/HTE_Figure_0.png" >


|Class | categorical treatment | continuous treatment | cross-sectional data | panel data |
|---:|:---|:---|:---|:---|
|T+X (T2X) Learner | Y | N | Y | Sometimes |
|OLS | Y | Y | Y | Y |
|ML-Weights | Y | N  | Y | Sometimes |




## Common ingredients across models

* Estimated counterfactual outcome, $\hat{Y}_1(X_i), \hat{Y}_0(X_i)$
    * How would treated units perform if they were control units instead, and vice versa?
* Estimated outcome, $\hat{Y}(X_i)$
* Propensity score, $\hat{P}(X_i)$
    * Probability that a given unit is treated.
    * We will rely on this with the unconfoundedness/exogeneity assumption common in causal models
    

## T+X (T2X) Learners - the big idea
$$ \tau^{hte} (Z_i) = E[Y_1(X_i) - Y_0(X_i)  | Z_i]$$ 
* Let's treat it as a prediction problem ("T-Learner").
    * For each observation, predict $Y_1(X_i)$ and $Y_0(X_i)$ values    
    * Use your favorite ML model to train two models 1: $Y_1(X_i)$; 2: $Y_0(X_i)$.
* Calculate observation-level differences: $\tau^{hte}_i = \hat{Y}_1(X_i) - \hat{Y}_0(X_i) $
* We look at variation in $\tau^{hte}_i$ across $Z_i$.    
    * We can train a third prediction model of $\tau^{hte}_i$ as a function of $Z_i$ for interpretability and reduce noise in $\tau^{hte}_i$. We will come back to this in a few slides.

## T+X (T2X)  Learners - one question

1. In the figure below, how are we sure that our $\hat{Y}_1(X_i)$ model would do a good job predicting the outcome of the control units? 
    * We are extrapolating into an unknown space. We are training a model on treated units to predict the outcome of control units, so we do not have a ground truth to validate our $\hat{Y}_1(X_i)$ model.

<img src="Figures/HTE_Figure_10a.png" >


## T+X (T2X) Learners - relying on propensity scores
* We rely on additional information, the propensity score $\hat{P}(X_i)$
* Relying on the unconfoundedness assumption, we can compare treated and control units with similar propensity scores to have the correct estimate.
* A common way of doing this would to take a weighted average of the estimate for control and treated units ("X-Learner").
* Künzel et al. (2017) propose taking the weighted average after predicting variation in $\hat{\tau}^{hte}_i$ across $Z_i$. We will revisit this in the next slide.

$$ \hat{\tau}^{hte,x} = \hat{P}(X_i) \hat{\tau}^0_i  + (1-\hat{P}(X_i)) \hat{\tau}^1_i $$

* Where $\hat{\tau}^0_i$ ($\hat{\tau}^1_i$) is the impact for control (treated) units.


## T+X (T2X) Learners - a second question

2. Independent of the above question, how do we know that variation across $Z_i$ is real or due to noise?
    * This is a big question we should ask of *every HTE model* and is an HTE-specific complexity. 
    
*  Chernozhukov et al. (2017) propose using the propensity score and variation over $Z_i$ in a single equation, where you estimate how much variation in $\hat{\tau}^{hte}_i$ is driven by selected $Z_i$ while simultaneously controlling for the propensity score.
* Kennedy (2020) propose estimating $\hat{\tau}^{hte}_i$ using $\hat{Y}_1(X_i)$, $\hat{Y}_0(X_i)$, and $\hat{P}(X_i)$ in a single equation, and then estimating variation across $Z_i$. 
   

#### Summary - T2X Learners
* In summary, we start with estimating the counterfactual outcomes ("T-Learner"). However, this is suspectiable to biases in predicting out of sample.
* an "X-Learner" solves this by incorporating the propensity score so that we can estimate HTE with similar propensity scores. This relies on the unconfoundedness assumption.
    * We should also look at variation across $Z_i$
* There are a lot of approaches and sequences for doing this.  

| Künzel et al. (2017) | Chernozhukov et al. (2017) |  Kennedy (2020)|
|:--- |:---|:---|
|1. Estimate  $\hat{\tau}^{hte}_i$ with a T-Learner. <br> 2. Predict $\hat{\tau}^{hte}_i$ over $Z_i$  <br> 3. Take weighted average with $\hat{P}(X_i)$ <br> | 1. Estimate variation over $Z_i$ by debiasing with $\hat{P}(X_i)$  | 1. Estimate $\hat{\tau}^{hte}_i$ with a doubly robust estimator combining counterfactual outcomes and $\hat{P}(X_i)$. <br> 2. Predict variation over $Z_i$  |



## OLS
* Where T2X Learner HTE models allow a lot of flexibility, we see that leveraging additional information, specifically the propensity score yields improvements.
* We can simultaneously leverage both predictions of the outcome and propensity score with a common causal model, ordinary least squares (OLS)
* We will improve on its framework with a double-debiased machine learning (DML) approach

## OLS - simple model
* Recall that under the same assumptions as before, we can estimate the ATE/ATET with an OLS model:
$$ Y_i = \beta X_i + \hat{\tau} W_i + \epsilon_i $$
* We can incorporate HTE by including additional features
$$ Y_i = \beta X_i + \hat{\tau} W_i + \hat{\tau}^{hte,z} W_i \times Z_i + \epsilon_i $$
* Therefore, the HTE is a baseline treatment is a combination fo the baseline treatment $\hat{\tau}$ with $\hat{\tau}^{hte,z} Z_i$.
$$ \hat{\tau}^{hte} = \hat{\tau} + \hat{\tau}^{hte,z} Z_i $$


## OLS - drawbacks of the simple model

$$ Y_i = \beta X_i + \hat{\tau} W_i + \hat{\tau}^{hte,z} W_i \times Z_i + \epsilon_i $$
* This approach yields unbiased estimates for $\hat{\tau}^{hte}$. Interpretation is also very straight forward. However, we can face difficulties when:
    1. The functions that determine $Y_i$ or $W_i$ cannot be well modeled linearly
    2. $Z_i$ has high dimensionality; or
    
* We will solve both by incorporating approaches from double/debiased machine learning (DML) from Chernozhukov (2016).

## OLS - DML applied to HTE, Part 1
* Semenova et al (2017) approach borrows the residualizing approach from DML, where we predict the observed outcome $\hat{Y}(X_i)$ and propensity score $\hat{P}(X_i)$.
    * This is the "first stage" in DML. We then run the "second stage":
$$ \tilde{Y}_i = \hat{\tau}  \tilde{W}_i + \hat{\tau}^{hte,z}  \tilde{W}_i \times Z_i + \eta_i $$
    * Where the $\tilde{Y}_i$ and $\tilde{W}_i$ are the difference between the predicted and observed outcome and treatment status, respectively.
      
* If we run this "second stage" as is, then we can still have problems if $Z_i$ is high-dimensional. We can think of this as a feature selection problem.
    * We can have high dimensionality over different transformations of a variable. For example: $Z_i = [x_{1i}, x^2_{1i}, x^3_{1i}, log(x_{1i}), 1\{x_{1i} > 0 \} )]$.



## OLS - DML applied to HTE, Part 2
* Semenova et al (2017) incorporates selection over $Z_i$ by adapting a sample-splitted LASSO regression.
    * In general, LASSO regression coefficients do not have a causal interpretation. 
    * We get around this by doing sample-splitting
$$ \tilde{Y}_i = \hat{\tau}  \tilde{W}_i + \hat{\tau}^{hte,z}  \tilde{W}_i \times Z_i + \eta_i $$

* We will split the sample into training and test samples. We select $Z_i$ on the training set using LASSO, and then estimate OLS on the selected $Z_i$ on the test set.
    * We then average "selected" OLS coefficients across test samples.



#### Notes - OLS and DML
* We can get HTE through OLS as well, where we interact the treatment indicator with $Z_i$.
* This works great under linearity assumptions and when $Z_i$ is low dimensional.
* Otherwise, we can leverage concepts from DML. We residualize the outcome and treatment features to allow non-linearity in the components, and use a sample-splitted LASSO regression to select elements of $Z_i$.
    * Is it worth noting that we are envoking elements of the Partial Linear Model from DML, where we still assume linearity in the second stage.
    * The Interactive Regression Model from DML can also be applied to HTE, and is more similar to the T2X-Learners from the earlier set of slides.


## ML-Weights
* When we estimating differences between treatment and control units, we want to compare treatment and control units that are otherwise similar.
    * If we compare the outcomes of treatment/control that are similar on observable characteristics and we assume exogeneity, then the difference is causal.     
* We can do matching based on propensity scores. For example below. However, we may run into situations where a control unit can be a good match for multiple treated units (ie rows 1 and 3) 

| | Matched <br> Set | Treatment <br>Unit | Control <br> Unit | 
|---:|---:|:---:| :---:|
| 1.| 1 | 0.20 | 0.21 |
| 2.| 1 |  | 0.19 |
| 3.| 2 | 0.23 | 0.22 |
| 4.| 2 |  | 0.24 |


## Weighting Units 

| | Matched <br> Set | Treatment <br>Unit | Control <br> Unit | 
|---:|---:|:---:| :---:|
| 1.| 1 | 0.20 | 0.21 |
| 2.| 1 |  | 0.19 |
| 3.| 2 | 0.23 | 0.22 |
| 4.| 2 |  | 0.24 |

* We can weight control units based on how close they are in propensity scores or other distributional assumptions. 
* The more similar control and treated units are, the larger the weight should be.
* Athey et al. (2018) use a Generalized Random Forest to determine weights that:
    1. Compares similar treatment and control units; and
    2. Explores variation across $Z_i$
    

## Causal Tree is a building block for Generalized Random Forest (GRF)
* A Generalized Random Forest (GRF) is a forest of Causal Tree (CT), rather than standard Decision Trees (DF).
* We'll distinguish how standard Random Forests and Causal Forests would estimate $\tau^{hte}$.
* Both split the data sample based on having similar features $X_i$ to predict $\tau^{hte}$. Splits are evaluated based on the variation in $\tau^{hte}$ (ie entropy). This splitting process has two components. For a given potential split of a node:
    1. How good is this bifurcation compared to other potential splits?
    2. What's the predicted $\tau^{hte}$ in each of the splits?
* DT uses the same data to answer 1. and 2. In contrast, CT uses different data for each. Think of this as having one training sample determines splits, and the other training sample estimates $\tau^{hte}$.
    * You can think of 1. as a way to maximize prediction power, and 2. as an evaluation (for causal inference, estimating the effect).
    * This allows us to calculate confidence intervals over CT and GRF's estimates. Note this is the same high-level approach 
    *

## GRF approach
* GRF calculates weights so that:
    1. Control and treatment units with similar $X_i$ are compared to estimate $\hat{\tau}^{hte}$; and
    2. It maximizes variation in $\hat{\tau}^{hte}$.
* Note that this forces variation in $\hat{\tau}^{hte}$ across $X_i$, not necessarily $Z_i$. This is because the weighting is used to estimate the treatment effect based on matching and estimate heterogeneity. 
* We can improve upon this by taking the residualization concept from DML. That is, you first calculate $\tilde{Y}_i$ and $\tilde{W}_i$ and then train GRF to do variation across $Z_i$. 

## Papers
### T2X Learners:
* Künzel, Sekhon, Bickel, Yu. *Meta-learners for estimating heterogeneous treatment effects using machine learning*  http://arxiv.org/abs/1706.03461
* Semenova, Chernozhukov. *Debiased Machine Learning of Conditional Average Treatment Effects and Other Causal Functions* https://arxiv.org/abs/1702.06240
* Chernozhukov, Demirer, Duflo, Fernández-Val. *Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments* https://arxiv.org/abs/1712.04802
* Kennedy. *Towards optimal doubly robust estimation of heterogeneous causal effects* https://arxiv.org/abs/2004.14497
* Sant'Anna, Zhao *Doubly Robust Difference-in-Differences Estimators* https://arxiv.org/abs/1812.01723

### OLS:
* Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, Robins. *Double/Debiased Machine Learning for Treatment and Causal Parameters*  https://arxiv.org/abs/1608.00060
* Semenova, Goldman, Chernozhukov, Taddy. *Estimation and Inference on Heterogeneous Treatment Effects in High-Dimensional Dynamic Panels under Weak Dependence* https://arxiv.org/abs/1712.09988

### ML-Weights
* Athey, Tibshirani, Wager. *Generalized Random Forests* https://arxiv.org/abs/1610.01271 
* Friedberg, Tibshirani, Athey, Wager. *Local Linear Forests* https://arxiv.org/abs/1807.11408
* Wager, Athey. *Estimation and Inference of Heterogeneous Treatment Effects using Random Forests* https://arxiv.org/abs/1510.04342
* Farrell, Liang, Misra. *Deep Neural Networks for Estimation and Inference* https://arxiv.org/abs/1809.09953
