Skip to content

Repository for the course SOCIOL 690S Machine Learning in Causal Inference. Materials are updated frequently, including the syllabus.

Notifications You must be signed in to change notification settings

rakeentanvir/CML

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SOCIOL 690S: Machine Learning in Causal Inference

Taught by Wenhao Jiang · Department of Sociology · Duke University · Fall 2025


Week 1 Introduction, Motivation, and Linear Regression

This week sets the stage for the course and introduces how and why Machine Learning (ML) can be integrated into causal inference.

Roadmap

  • Motivate the integration of statistical prediction with causal inference in response to the emergence of high-dimensional data and the need for flexible, non-linear modeling of covariates.
  • Review the statistical properties of the Conditional Expectation Function (CEF) and linear regression in a low-dimensional setting.
    • The basic matrix formulation of linear regression is revisited.
  • Introduce the Frisch–Waugh–Lovell (FWL) Theorem as a partialling-out technique in linear regression.
  • Review asymptotic OLS inference and discuss issues with standard error estimation in high-dimensional settings.
  • Summarize the concept of Neyman Orthogonality as an extension of the FWL Theorem to motivate Double Machine Learning (DML) in high-dimensional settings.

Materials

Optional Reading: For students who wish to explore the asymptotic properties of OLS in greater depth, see the Week 1 Supplements on asymptotic inference. Models that satisfy Neyman Orthogonality retain the classic asymptotic properties required for valid statistical inference.


Week 2 Machine Learning Basics

Building on Week 1, where we introduced both the benefits and the challenges of high-dimensional data, this week focuses on regularization regression methods. These approaches address high dimensionality in order to improve out-of-sample prediction and strengthen statistical inference.

Roadmap

  • Review the motivation for using high-dimensional data in analysis, and examine the limitations of ordinary linear regression in high-dimensional settings.
  • Introduce regularization methods for handling high-dimensional data. We focus in particular on LASSO regression as a feature selection method under approximate sparsity, and Ridge regression for dense coefficient distributions. We also cover variants that combine LASSO and Ridge penalties.
  • Introduce cross-validation and plug-in methods for fine-tuning the penalty level in regularization.
  • Revisit the Frisch–Waugh–Lovell (FWL) Theorem and introduce Double LASSO for statistical inference in high-dimensional settings.
  • Present other LASSO-like methods that satisfy Neyman orthogonality for valid inference.
  • Demonstrate R implementations of regularization methods and Double LASSO, applying them to test the Convergence Hypothesis in Macroeconomics with high-dimensional data.

Materials


Week 3 Machine Learning Advanced

Building on Week 2, where we introduced linear regularization methods to address high-dimensional data, this week we turn to non-linear models in Machine Learning. These approaches are designed to capture flexible and complex relationships among covariates. Our focus will be on two broad classes: Tree-based Methods and Neural Networks, along with their key variants.

Roadmap

  • Formally introduce the concept of the bias-variance tradeoff and explain its role in tuning Machine Learning models.
  • Present classic Tree-based Methods, including Regression Trees, Bagging, Random Forests, and Boosted Trees, showing how each builds on the bias-variance tradeoff.
  • Introduce the foundational Neural Network framework and discuss the theoretical background of training a Neural Network model.

Materials


Week 4 Neyman Orthogonality and Potential Outcome Framework

Building on the Machine Learning methods introduced in the last two weeks, this week we focus on the Double Machine Learning (DML) approach in partial linear regression, where covariates may be high-dimensional. We formally justify DML using the concept of Neyman Orthogonality, a framework that ensures consistent estimation of the treatment effect even when nuisance functions are estimated with ML. We then connect DML to the potential outcomes framework in causal inference, introducing the key assumption of conditional ignorability, which links regression-based estimation to causal interpretation.

Roadmap

  • Formally introduce Neyman Orthogonality and explain why orthogonality is key to making ML-based nuisance estimates usable for valid inference in Double Machine Learning (DML)

  • Connect DML to the partial linear regression model with high-dimensional covariates. We explain the importance of hyperparameter tuning and cross-fitting in DML and demonstrate the technique based on the high-dimensional data we used to test the Convergence Hypothesis.

  • Link DML to the potential outcomes framework and conditional ignorability. We highlight how the regression-based approach ties to causal interpretation under ignorability.

Materials


About

Repository for the course SOCIOL 690S Machine Learning in Causal Inference. Materials are updated frequently, including the syllabus.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TeX 100.0%