# The introduction of Double Machine Learning

## Must-know Concepts
- Causal discovery
    - the task of figuring out what causes what
    - aim to recover the underlying causal structure among variables from observational data by distinguishing cause–effect relationships from mere correlations
- Treatment variable
    - the variable whose causal effect on an outcome variable is of interest
- Outcome variable
    - the variable that is affected by the treatment variable
- Confounding variable
    - actors that simultaneously influence both the treatment and the outcome, potentially inducing spurious associations if not properly controlled.

### Identification of Variables
Purpose: does stronger wind reduce PM(_{2.5}) concentration?
| Variable    | What it represents       | Role in causal analysis | Why                                                           |
| ----------- | ------------------------ | ----------------------- | ------------------------------------------------------------- |
| Wind speed  | Strength of air movement | **Treatment(T)**           | This is the factor we conceptually “change” to see its effect |
| PM(_{2.5})  | Air pollution level      | **Outcome(Y)**             | This is what we want to explain                               |
| Temperature | Air temperature          | **Confounder(X)**          | Affects both wind patterns and pollution chemistry            |
| Humidity    | Moisture in the air      | **Confounder(X)**          | Influences wind behavior and particle formation               |
| Season      | Time of year             | **Confounder(X)**          | Determines wind regimes and emission intensity                |
| Location    | Station characteristics  | **Confounder(X)**          | Influences local wind channeling and baseline pollution       |



## Definition
DML estimates causal parameters in a robust manner by using machine learning to model nuisance components, combined with orthogonalization and cross-fitting techniques.
DML does not discover causality by itself — it estimates a causal effect given.



## The workflow of DML
The objective of DML is to estimate the causal parameter $\theta$ (the direct effect of Treatment T on Outcome Y) while controlling for confounders X using machine learning models to capture complex relationships.

1. Data Splitting (Cross-Fitting)
To avoid overfitting bias, the dataset is randomly partitioned into $K$ folds (typically 2 or 5).
    - Action: Train the machine learning models on one subset (e.g., Fold A) and predict values for the other subset (e.g., Fold B).

2. Nuisance Estimation (Stage 1)
Two machine learning models are trained to  capture the nuisance functions $g(X), m(X)$
    - The outcome model $g(X)$ predicts the outcome Y based on confounders X.
        - Equation: $Y = g(X) + \epsilon_Y$
        - Goal: Capture how environmental or background factors explain the outcome. 
    - The treatment model $m(X)$ predicts the treatment T based on confounders X.
        - Equation: $T = m(X) + \epsilon_T$
        - Goal: Capture how background factors influence the assignment or level of the treatment.

3. Residualization (Orthogonalization)
Calculate the residuals by subtracting the predicted values from the actual observed values.
- Outcome residual: $\tilde{Y} = Y - \hat{g}(X)$
    - Goal: Isolate the variation in Outcome that is not explained by confounders.
- Treatment residual: $\tilde{T} = T - \hat{m}(X)$
    - Goal: Isolate the variation in Treatment that is not explained by confounders.
    
4. Causal Estimation (Stage 2)
Perform a final regression of the outcome residuals on the treatment residuals to identify the causal link.
    - Equation: $\tilde{Y} = \theta \tilde{T} + \eta$
    - Here, $\theta$ represents the estimated causal effect of Treatment T on Outcome Y.
    - The coefficient $\theta$ is the Debiased Causal Effect. Because the residuals are "orthogonal" to the confounders $X$, $\theta$ represents a clean, unbiased estimate of $T$’s impact on $Y$.

5. Aggregate the estimates of $\theta$ across all folds to obtain the final causal effect estimate (average).



