Definition of the causal discovery task

Setting: Observe a single multivariate time series over variables $X_1, …, X_n$ in time window $[t_0 - T, t_0]$

Assumption: This is a sample from a stationary stochastic process.

Task: Learn the time series graph by testing for (conditional) independencies in the data

Problem: Formally, the time series graph extends ad infinitium into the future and past and therefore has an infinite number of vertices

Solution: Due to stationarity its structure is repetitive in time. Therefore, it is sufficient to learn the edges that end at any arbitrary time step t. We can thus restrict to learning the subgraph within a finite time window. 

Lagged edges: Edges between two vertices that are not at the same time

Contemporaneous edges: Edges between two vertices at the same time step

Autodependency edges: Edges between two time steps of the same variable

Lag: The $\tau$ in $X_{s - \tau}^{i} \to X_{s}^{j}$ with $\tau \ge 0$

Process order: The maximal lag in the time series graph.

Maximal considered time lag: The number $\tau_{\max} \ge 0$ specifying the considered time window $[t - \tau_{\max}, t]$.

Observation: We need $\tau_{\max}$ at least as large as the process order in order to learn the time series graph. 

Choosing $\tau_{\max}$ in practice: Through scientific domain knowledge or In a data driven way (correlation functions, fit scores).

Problem: We observe only a single (multivariate) time series, so what are the samples required for independence testing?

Solution: Due to stationarity only relative times, i.e. lags matter, but not absolute times. Therefore, we make create multiple samples in a sliding window manner.

Available background knowledge: The structure of the time series graph is repetitive in time. Causation cannot go back in time.

Consequence: We need only learn the edges that end at time t. As well, we might distinguish between Markov equivalent graphs.

Corollary: Since causation cannot go back in time, only the orientation of contemporaneous edges can remain ambiguous.

Straight-forward approach:

Take a causal discovery algorithm, for PC, and modify it by the available background knowledge to 
1. simplify the algorithm
2. learn more than could be learned without the background knowledge.

Problem: This typically leads to poor statistical performance, especially for time series with large autocorrelation.

More sophisticated approach:

As one of its major lines of research, the Causal Inference Group at the DLR-Institute of Data Science develops causal discovery algorithms specialized to time series that yield a superior statistical performance

* PCMCI: assumes lagged links only, causal sufficiency
* PCMCI+: assumes causal sufficiency
* LPCMCI

Why does the straight-forward approach not work well?

Time series data typically is autocorrelated, which leads to the following statistical complications

1. Low statistical detection power: Many test wrongly judge independence although dependence is true
2. Inflated false positives due to ill-calibrated test of independence: Samples are not iid because they are drawn from different time steps of the same time series. 

Widely-used CI-based method PC algorithm, has two main phases: Skeleton link removal phase and orientation phase. Skeleton phase can use different CI tests: Partial Correlation, Conditional Mutual Information $l(X; Y | S)$ etc. 

Statistical power for detecting $X \not\perp\!\!\!\perp Y | S$ depends on
1. Sample size (given by dataset)
2. Significance level $\alpha$ (given by researcher / hyperparameter)
3. Condition dimension / complexity, cardinality |S| (PC optimizes this)
4. Effect size, i.e. magnitude of $l(X; Y | S)$ (addressed here)

Suppose Z -> X -> Y <- W, then conditioning on Z reduces and conditioning on W increases effect size. 
* $l(X; Y | Z) \le l(X; Y) \le l(X; Y | W)$

Proof. Consider following chain-rule decomposition:
* $l(XZ; Y | Z) = l(X; Y) + l(Z; Y | X) ( = 0, Markovity) = l(Z; Y) (\ge 0, non-neg CMI) + l(X; Y | Z)$

Problem: PC algorithm tests $X \not\perp\!\!\!\perp Y | S$ for all $X, Y$ adjacent (non-future) conditions S and link is removed if $\min_S (l(X; Y | S)) < I_{\alpha}$

Effect size weakening conditions: False negatives

Two sources of false positives (incorrect links)
1. Separating conditions removed in earlier steps
2. Ill-calibrated CI tests due to autocorrelation (see experiments)

Then orientation phase generally also suffers from either wrong sepsets or similar CI test issues (majority / conservative collider rule option in PC)

PCMCI+ improves the reliability of CI tests by optimizing the choice of conditioning sets.
Steps:
1. PC_1 lagged phase
2. MCI contemporaneous phase
3. Orientation phase

PC_1 lagged phase differs from PC algorithm twofold:
1. S iterates through lagged links only
2. Only lagged conditions with largest association with $X_t^{i}$ are used: $S = \{A(X_t^{i})\}_{l=1}^p$ for every cardinality p => less likely to cond. on effect size weakening parents of $X_{t - \tau}^{i}$.

Still sufficient to block most paths

PC1 converges to lagged parents plus parents of contemporaneous ancestors: $\beta_t(X_t^{j})$

MCI contemporaneous phase is initialized with lagged links $\beta_t(X_t^{j})$ and all contemporaneous links and conducts further CI tests for link removals. 

MCI phase iterates through contemporaneous conditions $S \subseteq A_t(X_t^{j})$ with MCI tests: $X_{t - \tau}^{i} \perp\!\!\!\perp X_t^{j} | S, \beta_t(X_t^{j}) \backslash \{X_{t - \tau}^{i}\}, \beta_{t - \tau}(X_{t - \tau}^{j})$

Condition on S blocks contemporaneous paths, $\beta_t(X_t^{j})$ and $\beta_{t - \tau}(X_{t - \tau}^{j})$ block lagged paths. 

Conditioning on both $\beta_t(X_t^{j})$ and $\beta_{t - \tau}(X_{t - \tau}^{j})$ is important since
1. Contemporaneous MCI effect size larger than PC effect size
2. MCI tests well-calibrated (both discussed in paper.)

Spurious links due to contemporaneous drivers are removed and sepsets stored.

Orientation phase as for PC algorithm, but also with MCI tests for majority / conservative collider rule. PCMCI+ converges, links are repeated when assuming stationarity. 

Further theoretical considerations in paper:
* Asymptotical consistency: PCMCI+ is sound and complete
* MCI tests are well-calibrated also for autocorrelated data

Granger causality

In his influential work, C. W. J. Granger introduced a concept of causality based on prediction, which up to this date is a commonly used tool for causal analyses of time series. 

Definition: Time series $X_i$ is said to Granger cause time series $X_j$ if, conditional on the past of all time series but that of $X_i$, the past of $X_i$ helps in predicting $X_i$.

Example: Linear case. $X_i$ Granger causes $X_j$ if model 2 has a significantly smaller error

Formalization: $X_i$ Granger causes $X_j$ if $X_t^{j} \not\perp\!\!\!\perp X_{< t}^{i} | X_{<t}^{-i}$

Theorem: Let G be a time series graph without contemporaneous edges. Then $X_t^{j} \not\perp\!\!\!\perp X_{< t}^{i} | X_{<t}^{-i}$ if and only if there is $X_{t - \tau}^{i} \to X_t^{j}$ for some $\tau > 0$

Limitations of Granger causality:
* Requires causal sufficiency, i.e. does not allow unobserved confounders. Does not allow contemporaneous causal relationships. 
