Taught by Wenhao Jiang · Department of Sociology · Duke University · Fall 2025
This week sets the stage for the course and introduces how and why Machine Learning (ML) can be integrated into causal inference.
- Motivate the integration of statistical prediction with causal inference in response to the emergence of high-dimensional data and the need for flexible, non-linear modeling of covariates.
- Review the statistical properties of the Conditional Expectation Function (CEF) and linear regression in a low-dimensional setting.
- The basic matrix formulation of linear regression is revisited.
- Introduce the Frisch–Waugh–Lovell (FWL) Theorem as a partialling-out technique in linear regression.
- Review asymptotic OLS inference and discuss issues with standard error estimation in high-dimensional settings.
- Summarize the concept of Neyman Orthogonality as an extension of the FWL Theorem to motivate Double Machine Learning (DML) in high-dimensional settings.
Optional Reading: For students who wish to explore the asymptotic properties of OLS in greater depth, see the Week 1 Supplements on asymptotic inference. Models that satisfy Neyman Orthogonality retain the classic asymptotic properties required for valid statistical inference.
Building on Week 1, where we introduced both the benefits and the challenges of high-dimensional data, this week focuses on regularization regression methods. These approaches address high dimensionality in order to improve out-of-sample prediction and strengthen statistical inference.
- Review the motivation for using high-dimensional data in analysis, and examine the limitations of ordinary linear regression in high-dimensional settings.
- Introduce regularization methods for handling high-dimensional data. We focus in particular on LASSO regression as a feature selection method under approximate sparsity, and Ridge regression for dense coefficient distributions. We also cover variants that combine LASSO and Ridge penalties.
- Introduce cross-validation and plug-in methods for fine-tuning the penalty level in regularization.
- Revisit the Frisch–Waugh–Lovell (FWL) Theorem and introduce Double LASSO for statistical inference in high-dimensional settings.
- Present other LASSO-like methods that satisfy Neyman orthogonality for valid inference.
- Demonstrate
Rimplementations of regularization methods and Double LASSO, applying them to test the Convergence Hypothesis in Macroeconomics with high-dimensional data.
- Slides: Week 2 Machine Learning Basics
- R Code: Regularization Methods
- R Code: Double LASSO and the Convergence Hypothesis
Building on Week 2, where we introduced linear regularization methods to address high-dimensional data, this week we turn to non-linear models in Machine Learning. These approaches are designed to capture flexible and complex relationships among covariates. Our focus will be on two broad classes: Tree-based Methods and Neural Networks, along with their key variants.
- Formally introduce the concept of the bias-variance tradeoff and explain its role in tuning Machine Learning models.
- Present classic Tree-based Methods, including Regression Trees, Bagging, Random Forests, and Boosted Trees, showing how each builds on the bias-variance tradeoff.
- Introduce the foundational Neural Network framework and discuss the theoretical background of training a Neural Network model.
Building on the Machine Learning methods introduced in the last two weeks, this week we focus on the Double Machine Learning (DML) approach in partial linear regression, where covariates may be high-dimensional. We formally justify DML using the concept of Neyman Orthogonality, a framework that ensures consistent estimation of the treatment effect even when nuisance functions are estimated with ML. We then connect DML to the potential outcomes framework in causal inference, introducing the key assumption of conditional ignorability, which links regression-based estimation to causal interpretation.
-
Formally introduce Neyman Orthogonality and explain why orthogonality is key to making ML-based nuisance estimates usable for valid inference in Double Machine Learning (DML)
-
Connect DML to the partial linear regression model with high-dimensional covariates. We explain the importance of hyperparameter tuning and cross-fitting in DML and demonstrate the technique based on the high-dimensional data we used to test the Convergence Hypothesis.
-
Link DML to the potential outcomes framework and conditional ignorability. We highlight how the regression-based approach ties to causal interpretation under ignorability.