Causal graph including noise variables

* Intuition: the noise variables $\eta_i$ substitutes all factors outside of the model in the form of a distribution $p(\eta_1, …, \eta_n)$. 
* The deterministic structural assignments transform this into a distribution $p(x_1, …, x_n)$ over the modeled variables $X$
* Joint independence of $\{\eta_i\}$, namely $p(\eta_1, …, \eta_n) = \prod_i p(\eta_i)$ means $X$ is large enough such that all dependencies among X are modeled explicitly by this structural assignments.  

Learn causal relationships, typically in the form of the causal graph, from observational data given appropriate assumptions which are to be specified. 

The fundamental distinction between causal relationships and statistical relationships cf. the ladder of causation necessitates assumptions.

Constraint-based:
* The structure of the causal graph imposes constraints in the observational distribution, e.g. conditional independence.
* Detect these constraints and use them to infer the causal graph.

Score-based:
* Devise a parametric function $Score(D, G)$ that assigns a numerical score to each graph G for the observed data D.
* Choose the best scoring graph.

SCM-based (bivariate case):
* Make a parametric assumptions on noise distributions and the functional relationship between cause $C$ and effect $E$, namely on $f$ in $E := f (C , \eta_E)$ such that $f$ cannot be inverted within the same function class.
* Choose the direction yielding the better fit as causal. Can be generalized to multivariate case. 

Data generation by SCM

Assumption 1 (Data generation by SCM):
* There is an SCM that generates the observational distribution p (observed) and interventional distributions (unobserved)
* Reflects the view that the data generating process is a composition of autonomous causal mechanisms
* May fail if individual mechanisms 

Assumption 2 (Acyclicity): 
* The data generating SCM is cyclic.
* Many algorithms require this as a technical assumption, which is indeed a limitation. But for time-dependent processes it is easier to justify.
* Recent development: Assumption can be relaxed.

Corollary: Causal Markov condition

If $\cal C$ is an acyclic SCM, the causal graph $G = G(\cal C)$ is a Bayesian network for the observational distribution $p$ generated by $\cal C$:
* $X \bowtie Y | S \implies X \perp\!\!\!\perp Y | S$: d-separation implies independence
* $X \not\perp\!\!\!\perp Y | S \implies X \not\bowtie Y | S$: dependence implies d-connection

Assumption 3 (Causal faithfulness)
* Every independence in p is due to a d-separation in G:
* $X \perp\!\!\!\perp Y | S \implies X \bowtie Y | S$: independence implies d-separation

Corollary: Assumptions 1 - 3 imply that conditional independence in the observational distribution is equivalent to d-separation in the causal graph.

* $X \perp\!\!\!\perp Y | S \iff X \bowtie Y | S$

Discussion:
* Most heavily criticized assumption of constraint-based causal discovery. 
* Certain relaxations to related but weaker assumptions possible. 
* Reflects the view that there are no accidental independencies due to tuned parameters
* For linear models non-faithful distributions provably correspond to a measure zero subset of the parameter space. 

Sources of violations: 

1. Counteracting mechanisms:

In this example $X \perp\!\!\!\perp Y$ for $c = - a * b$ despite of d-connection. 

$X = \eta_X, Y = aX + \eta_Y, Z = bZ + cX + \eta_Z$

2. Determinism: 

For $X = f(Z)$ deterministic we get $l(X; Y | Z) = l(f(Z); Y | Z) = 0$ and thus $X \perp\!\!\!\perp Y | Z$ even though $Y$ depends on $X$. 

$X = f(Z), Y = h(X) + \eta_Y, Z = \eta_Z$

One might argue that in complex processes unresolved processes almost always introduce non-deterministic relationships. 

Here, $l(X; Y | Z)$ denotes the conditional mutual information.

3. Purely non-pairwise dependencies:

Here $l(X; Y) = l(Z; Y) = 0$ even though both are connected to $Y$.

$Y = X \lor Z + \eta_Y, P(X = 0, 1) = P(Z = 0, 1) = 0.5$

For $P(X) \ne P(Z)$ faithfulness is not violated anymore.

Assumption 4 (Causal sufficiency)
* One observes (samples from) the marginal distribution $p(w) = \int p(w, l) dl$ with $L = X \backslash W$ of the observational distribution $p(x)$ over a causally sufficient subset $W \subset X$ of variables. 
* Variables in $W$ are called observed, variables in $L$ are called unobserved.

Hidden common cause:

* An unobserved variable $L_i \in L = X \backslash W$ is a hidden common cause of two distinct observed variables $W_1, W_2 \in W$ if there are directed paths $\pi_1$ and $\pi_2$ from $L_i$ to, respectively, $W_1$ and $W_2$, that intersect only at vertices in $L$.

Causal sufficiency: A subset of variables $W \subset X$ is causally sufficient, if there is no pair of distinct observed vertices $W_1, W_2 \in W$ with a hidden common cause.

Question: Can there be distinct DAGs G and G' that convey the exact same set of d-separations?

Relevance: If yes, then such G and G'
* are Bayesian networks for the same set of probability distributions
* can without further assumption

Markov equivalence: Two DAGS G and G' are Markov equivalent, also referred to as observationally equivalent and denoted G \sim G', if they have the same set of vertices and any d-separation in one graph corresponds to another d-separation in the other one.

Define Markov equivalence class (set of graphs) on a similar fashion, 

Example 1: 
* $[X, Y]_{X \bowtie Y}$ with d-separation $X \bowtie Y$
* $[X, Y]_{X \not\bowtie Y} := \{X \to Y, Y \to X\}$ with no d-separations
* In the fundamental case of two dependent random variables we can without further assumptions not decide whether $X \to Y$ or $Y \to X$

Definition: Unshielded triple

Three vertices X, Y and Z form an unshielded triple $(X, Y, Z)$ if
1. X and Y are adjacent
2. Z and Y are adjacent
3. X and Z are not adjacent

Definition: v-structure

An unshielded triple $(X, Y, Z)$ forms a v-structure, also called immorality, if $X \to Y \leftarrow Z$, also $Y$ is a collider node.

Theorem: Characterization of Markov equivalence

Two DAGs are Markov equivalent if and only if
1. they have the same skeleton and 
2. they have the same v-structures

Draw all DAGs that are Markov equivalent to the DAG shown below on the left hand side.

Skeleton: W, X, Y, Z

X and Z is a v-structure
* Switch X and W
* Switch Z and W

Graphically represent the entire Markov equivalence class of a DAG in a single graph that shows:
1. the common features
2. and the differences

of all members of the equivalence class

The CPDAG (completed partially directed acyclic graph) of a DAG $G$ is the graph with the following properties:
1. It has the same skeleton as G
2. It has an edge $X \to Y$ if $X \to Y$ in all $G' \in [G]$
3. It has an edge $X \leftrightarrow Y$ if there are $G'_1, G'_2 \in [G]$ such that $X \to Y$ in $G'_1$ and $X \leftarrow Y$ in $G'_2$.

Task: Learn CPDAG of causal graph from conditional independencies in observational distribution