## Probability Theory and Graph Models

Observational: $p((x_i)_i) = \Pi_i p(x_i | PA_i)$ where $X_i \perp\!\!\!\perp (X_{\neq i} \backslash PA_i) | PA_i$

Different choice of Markovian parents results in different causal models => Quest for common separation relationships leads to $X \bowtie Y | Z$

Open -> Close: 

> In usual case, X and Y are correlated. If Z is intervened, they become disconnected. 

> For example $g(y := f(x)) = z$ or $f(y) = (x, z)$, and fixing $y$ to a constant means fixing $x$ and $z$ as well.  

* Chain: $X \to Z \to Y$. 
* Fork: $Z \to X, Y$

Close -> Open: 

> X and Y have no common cause. If Z is intervened, the relationship of X and Y becomes dependent. 

> Take, for example $f(x, y) = z$. Fix $z$ to some constant, then you get the level set $N_{z = k} = \{(x, y) | f(x, y) = z\}$ where $x$ and $y$ have to depend on each other. 

* Collider (or n-th Desc): $X, Y \to_n Z$

Remember that the descendant relationship can go as far as possible, since $p(X_i | PA_i) \ne 0$.

D-separation is stricter than conditional independence ($X \bowtie Y | Z \implies X \perp\!\!\!\perp Y | Z$)

For any $X, Y$: D-separated if $\forall Z . (matches(X, Y \to_n Z) \leftrightarrow Z \notin S)$

## SCM

$X_i := f_i(PA_i, \eta_i) \forall i$ with $PA_i \subseteq X \backslash X_i$

* Observational distribution generated by $C$: $p_C((X_i)_i) = \Pi_i p_C(X_i | PA_i)$ (**Observed**)
* Interventional distribution generated by $C$: $p_C((X_i)_i | do(X_j := k_j)) = \frac{p_C((X_i)_i)}{p_C(X_j | PA_j)} \delta(X_j - k_j)$ if $p_C(X_j | PA_j) \ne 0$ (**Unobserved**)

Ladders of Causation:

1. Association (Seeing): $p(Y | X)$, requires $p(\cdot)$
2. Intervention (Doing): $p(Y | do(X := x))$, requires $p(\cdot)$ plus ${\cal G}({\cal C})$
3. Counterfactuals (Thinking): $p(Y_X | Y'_{X'})$, requires $\cal C$ ($\implies {\cal G}({\cal C}), p(\cdot)$)

## Causal Discovery

> Constraint-based: Identify conditional independencies using statistical tests

> Score-based: ML approach, minimize loss over predicting the observed data

> SCM-based: Predict cause-effect relationships by analysing the noises

Constraint-based

Causal Markov Condition: ''d-separation implies independence'' (see Bayesian networks)

1. Data Generation by SCM
2. Acyclicity
3. Causal Faithfulness: ''independence implies d-separation''
4. Causal Sufficiency

1-3: ''independence implies d-separation''

## Conditional Independence Testing

Hypothesis testing (frequentist): 
* Assume iid samples of random variables $X_i \sim p_{X; \theta}$
* $p_{X; \theta}$ depends on $\theta$ with unknown value

* Define a hypothesis $H_0: \theta = \theta_{0}$ for some $\theta_{0}$
* Choose test statistic $T$
* Obtain null distribution $p(T | H_0)$
* Fix a significance level $\alpha \in ]0, 1[$ for determining the rejection threshold
* Collect relevant data $D_n$ consisting of n samples and calculate $T(data)$
* Reject $H_0$ if $T(D_n)$ is outside region (or if p-value $\le \alpha$)

Given power: $TN = 1 - \beta$, high power depends on:
* Effect size (dependency strength)
* Significance $\alpha$-level
* [DoF] Degrees of freedom $\sim$ sample size - number of parameters to estimate

Given $m > 1$ tests:
* $P(\bigvee_i \alpha_i) = 1 - P(\bigwedge_i \neg \alpha_i) \le 1 - P(\bigwedge \neg \frac{\alpha}{m}) = P(\bigvee \frac{\alpha}{m}) = m * \frac{\alpha}{m} = \alpha$

Correlation Coefficients: 

* Pearson Correlation Coefficient:
    * $\rho(X, Y) = \frac{Cov(X, Y)}{\sqrt{Var(X) Var(Y)}}$ for $(X_i, Y_i) \overset{iid}{\sim} p_{X, Y}$
    * $X \perp\!\!\!\perp Y \implies \rho(X, Y) = 0$ (or $\iff$ if $X, Y$ jointly Gaussian)
    * Estimate as:
        * $\hat{\rho}(X, Y) = \frac{\hat{Cov}(X, Y)}{\sqrt{\hat{Var}(X) \hat{Var}(Y)}}$
        * $\hat{Cov}(X, Y) = \frac{1}{n}\sum_{i = 1}^n (X_i - \overline{X_n})(Y_i - \overline{Y_n})$
        * $\hat{Var}(X) = \frac{1}{n}\sum_{i = 1}^n (X_i - \overline{X_n})^2$
        * $\hat{Var}(Y) = \frac{1}{n}\sum_{i = 1}^n (Y_i - \overline{Y_n})^2$

* Distance Correlation Coefficient:
    * $dCor(X, Y) = \frac{dCov(X, Y)}{\sqrt{dVar(X) dVar(Y)}}$ for $(X_i, Y_i) \overset{iid}{\sim} p_{X, Y}$
    <!-- * $X \perp\!\!\!\perp Y \implies dCor(X, Y) = 0$ -->
    * Estimate as:
        * $dCor(X, Y) = \frac{dCov(X, Y)}{\sqrt{dVar(X) dVar(Y)}}$
        * $dCov(r_X, r_Y) = \frac{1}{n} \sqrt{\sum_{i, j} A_{ij} \cdot B_{ij}}$
        * $dVar(r_X) = \frac{1}{n} \sqrt{ \sum_{i, j} A_{ij}^2}$
        * $dVar(r_Y) = \frac{1}{n} \sqrt{ \sum_{i, j} B_{ij}^2}$

Linear Dependency Tests with Additive Noise: 

* Multivariate Gaussian:
    * Partial Correlation Test (ParCorr)
    * Null hypothesis: $H_0: X \perp\!\!\!\perp Y | Z$ and $X, Y, Z$ jointly Gaussian and $\#(Z) = p$
    * Assume $X = Z \beta_X + \epsilon_X, Y = Z \beta_Y + \epsilon_Y$ for all known common causes $Z$
    * Predict $\hat{\beta}_X, \hat{\beta}_Y$ by minimizing OLS
    * Residuals $r_X = X - Z \hat{\beta}_X, r_Y = Y - Z \hat{\beta}_Y$ 
    * Residuals are uncorrelated, also $\rho(r_X, r_Y) = 0$ if $X \perp\!\!\!\perp Y | Z$
    * $t$-Test statistic: $T = \sqrt{n - p - 2} \frac{\hat{\rho}(r_X, r_Y)}{\sqrt{1 - \hat{\rho}(r_X, r_Y)^2}}$ with $p$ as cardinality of $Z$
    * $T \sim T_{n - p - 2}$

    * If $p = 0$: 
        * $r_X = \epsilon_X + Z (\beta_X - \hat{\beta}_X), r_Y = \epsilon_Y + Z (\beta_Y - \hat{\beta}_Y)$
        * If $Z = 0$, residuals are $r_X = X, r_Y = Y$

* Discrete:
    * Null hypothesis: $H_0: X \perp\!\!\!\perp Y | Z$ and $X, Y, Z$ discrete and $\#(Z) = p$
    * $X \perp\!\!\!\perp Y | Z \iff p(X, Y | Z) = p(X | Z) p(Y | Z)$
    * $\Bbb{E}[S_{XYZ}^{abc} | S_{XZ}^{ac} = s_{XZ}^{ac}, S_{YZ}^{bc} = s_{YZ}^{bc}] = \frac{s_{XZ}^{ac} \cdot s_{YZ}^{bc}}{s_Z^{c}}$
    * $\chi^2$-Test statistic: $G^2 = 2\sum_{abc} s_{XYZ}^{abc} \log (\frac{s_{XYZ}^{abc} \cdot s_Z^c}{s_{XZ}^{ac} \cdot s_{YZ}^{bc}})$
    * $G^2 \overset{n \to \infty}{\sim} \chi_v^2$ with DoF: $(m_X - 1)(m_Y - 1)\max(p, 1)$
    
    * If $p = 0$: 
        * Replace $s_Z^c = 1$ and $\sum_{abc} … = n \sum_{ab}$
        * Then $G^2 = 2\sum_{ab} s_{XY}^{ab} \log (\frac{s_{XY}^{ab} \cdot n}{s_{X}^{a} \cdot s_{Y}^{b}})$
        * DoF: $(m_X - 1)(m_Y - 1)$

Non-linear Dependency Tests with Additive Noise:

* Multivariate Gaussian:
    * Gaussian process distance correlation test (GPDC)
    * Null hypothesis $H_0: X \perp\!\!\!\perp Y | Z$ and $\#(Z) = p$
    * Assume $X = f_X(Z) + \epsilon_X, Y = f_Y(Z) + \epsilon_Y$ for all known common causes $Z$
    * Residuals $r_X = X - \hat{f_X}(Z), r_Y = Y - \hat{f_Y}(Z)$
    * $dCor(r_X, r_Y) = \frac{dCov(r_X, r_Y)}{\sqrt{dVar(r_X) dVar(r_Y)}}$
    * Theoretical test distribution not known, use permutation testing for $dCor(r_X, r_Y)$

Conditional Mutual Information as general, non-parametric dependence measure:
* CMI-kNN: $\hat{I}(X; Y | Z) = \psi(k) + \frac{1}{n} \sum_{i = 1}^n [\psi(k_i^z) - \psi(k_i^{xz}) - \psi(k_i^{yz})]$
* Use the local nearest neighbor permutation scheme introduced for performing a permutation test

Discussion:

> More general tests with less strict assumptions tend to be computationally more expensive, have smaller power for the same sample size, and are statistically hard.

> CI-testing: Keep $X - Y$, if $H_0$ (independence assumption) is rejected

> Higher power -> Less probability of assuming independence when it is not true

## Algorithms

Features of Algorithms:

* Soundness: All found solutions are correct
* Completeness: All correct solutions are found

DAG enumeration algorithm:

1. $G = \{Gi | \text{Gi is a DAG with vertices X}\}$
1. $G' = \{Gi | (X \bowtie Y | Z)_{Gi} \leftrightarrow (X \perp\!\!\!\perp Y | Z)\}$
1. Orient edges in any skeleton $\hat{G}$ shared by all DAGs in $G'$

* Criticism: Infeasible to iterate over all DAGs

SGS algorithm:

Idea: 

* If $\exists S$ s.t. $X \perp\!\!\!\perp Y | S$: 
    * $Z \notin S \iff X \to Z \leftarrow Y$
    * All triples between $X$ and $Y$ should be blocked
* Else: $X - Y$

1. $\hat{G}$ as fully connected unoriented graph
1. $\forall X - Y$ do
1. $\exists S \subset X \backslash \{X, Y\} . X \perp\!\!\!\perp Y | S \implies$
    1. Remove edge $X - Y$ from $\hat{G}$
    1. $S$ = Sepset($X - Y$) = Sepset($Y - X$)
1. $\forall X - Z - Y$ do
1. $Z \notin$ Sepset($X - Y$) $\implies$
    1. $X \to Z \leftarrow Y$
1. Orient while applies:
    1. $\forall X \to Z - Y$ do 
        * $X \to Z \to Y$
    1. $\forall X \to Z \to Y - X$ do 
        * $X \to Z \to Y \leftarrow X$
    1. $\forall (X - Y, W, Z$ and $Y, Z \to W)$ do
        * $X - Y, Z$ and $X, Y, Z \to W$

* Criticism: Part 1 is still infeasible

PC algorithm:

Idea: Remove $X - Y$ iff $\exists S \subset T_{XY} \subset X \backslash \{X, Y\}$ s.t. $X \perp\!\!\!\perp Y | S$

* $Y \leftrightarrow_n X$ excluded due to acyclicity
* If $Y \not\to_n X$: $T_{XY} := pa(G, Y)$
* If $X \not\to_n Y$: $T_{XY} := pa(G, X)$

But: unknown parents

* $T_X \supset pa(G, X), T_Y \supset pa(G, Y)$ 
* No $X - Y$ iff $\exists S \subset T_X \lor S \subset T_Y . (X \perp\!\!\!\perp Y | S)$

1. $\hat{G}$ as fully connected unoriented graph
1. $p := 0$
1. while $\exists X \to Y . |adj(\hat{G}, Y) \backslash \{X\}| \ge p$ do
1. $\forall X \to Y$ 
1. $\exists S \subset adj(\hat{G}, Y) \backslash \{X\} . (|S| = p, X \perp\!\!\!\perp Y | S) \implies$
    1. Remove edge $X \to Y$ from $\hat{G}$
    1. $S$ = Sepset($X \to Y$) = Sepset($Y \to X$)
1. $p := p + 1$

Extensions, Relaxing assumptions:

Independence tests:
> Unknown ground-truth independencies

> Thus, $X \perp\!\!\!\perp Y | S$ to $CI-Test(X, Y, S)$

> No guarantee to find true CPDAG

> Still converges for $n \to \infty$ samples to true CPDAG

Order independence:
> Output of PC algorithm may depend on the order of input variables

> Thus, PC-stable algorithm

Weaker faithfulness:
> Strong assumption

> Thus, conservative PC algorithm requires weaker adjacency faithfulness

Causal Insufficiency:
> Unobserved variables

> Thus, FCI algorithm includes latent / selection variables

Cyclicity:
> Applicability of PC algorithm

> Still, P remains sound and complete


## Causal Discovery via Restricted SCMs

Previously:
* D := iid sample from P(X)
* I(D) := independence tests
* G(I) := faithfulness markov 

Now:
* D := iid sample from P(X)
* R(D) := fit restricted SCM
* G(R) := read off causal graph

> Constraint-based causal discovery can only determine Markov eq.class 

Non-identifiability of graph structures:

$\forall p(x, y) . \exists SCM . X := \eta_X, Y := f_Y(X, \eta_Y)$ with measurable $f_Y$ and $\eta_X \perp\!\!\!\perp \eta_Y$

Proof for all $p(x, y)$:
* Select $f_Y(x, \eta_Y) := F_{Y | X}^{-1}(\eta_Y)$
* $F_{Y | x}^{-1}(y) = P(Y \le y | X = x)$ cond. cum. dist.
* Draw $y \sim \eta_Y$ uniformly and independent of $X$

Linear Additive Models
* $p(X, Y)$ admits a linear additive noise model from X to Y if generated by a SCM, s.t. $X := \eta_X$ and $Y := \alpha X + \eta_Y$ with $\eta_X \perp\!\!\!\perp \eta_Y$ 
* Additionally, $p(X, Y)$ admits from Y to X iff $X$ and $\eta_Y$ are Gaussian 
    * If this is the case, non-identifiable in which direction

Non-identifiability:
* Let $\perp\!\!\!\perp (X_1, …, X_n)$
* Let $a_j, b_j \ne 0$ s.t. 
* If $(I_1 := \sum_j a_jX_j) \perp\!\!\!\perp (I_2 := \sum_j b_jX_j)$
* Then all $X_j$ are Gaussian
* Further assume $Y = \alpha X + \eta_Y$ and $X = \beta Y + \eta_X$
* With $Y \perp\!\!\!\perp \eta_X$ and Gaussian $X \perp\!\!\!\perp \eta_Y$ 
* Then $I_1 := Y = \alpha X + \eta_Y$ is Gaussian 
* And $I_2 := \eta_X = (1 - \alpha \beta) X - \beta \eta_Y$ is Gaussian

LiNGAM:
* Assume $p(x, y)$ admits a linear non-gaussian additive noise $X - Y$
* Linearly regress $X$ on $Y$ (and $Y$ on $X$)
* Predictors: $\hat{Y}(X)$ (and $\hat{X}(Y)$)
* Residuals: $r_Y = Y - \hat{Y}(X)$ (and $r_X = X - \hat{X}(Y)$)
* Test if $r_Y \perp\!\!\!\perp X$ (and $r_X \perp\!\!\!\perp Y$)
* If $r_Y \perp\!\!\!\perp X$ but $r_X \not\perp\!\!\!\perp Y$: $X \to Y$

Principle of Independent Mechanisms:
1. The functional assignments $X_i := f(PA_i, \eta_i)$ of individual variables correspond to autonomous physical mechanisms
1. Say $X \to Y$. Then $p_X$ and $p_{Y | X}$ should contain no information about each other
1. Assume there are $l$ many different $p_X$ and $k$ many different $p_{Y | X}$ s.t. there are $kl$ many different $p_{X, Y}$
1. Based on (2), there are $kl$ many $p_{Y}$ and $p_{X | Y}$
1. This means $p_Y$ and $p_{X | Y}$ (a-causal factorization) are more complicated than $p_X$ and $p_{Y | X}$ (causal factorization)
1. We cannot apply the same argument with X and Y swapped because when $X \to Y$, we do not assume $p_Y$ and $p_{X | Y}$ do not contain information about each other

Nonlinear Additive Models

* $p(X, Y)$ admits a non-linear additive noise model from X to Y if generated by a SCM, s.t. $X := \eta_X$ and $Y := f_y(X) + \eta_Y$ with $\eta_X \perp\!\!\!\perp \eta_Y$

Non-identifiability:
* Assume that $p(x, y)$ admits an additive noise model from X to Y
* And assume $\forall x \in X_\N . \exists y . (\log''(p_{\eta_Y}) (y - f_y(x)) f_y'(x) \ne 0)$, also $f_y$ is non-linear
* Then $p(x, y)$ also admits from Y to X iff $p(x, y)$ obeys certain non-generic constraints

NonLiNGAM:
* Assume $p(x, y)$ admits a non-linear additive noise $X - Y$
* Non-linearly regress $X$ on $Y$ (and $Y$ on $X$)
* Predictors: $\hat{f}_Y(X)$ (and $\hat{f}_X(Y)$)
* Residuals: $r_Y = Y - \hat{f}_Y(X)$ (and $r_X = X - \hat{f}_X(Y)$)
* Test if $r_Y \perp\!\!\!\perp X$ (and $r_X \perp\!\!\!\perp Y$)
* If $r_Y \perp\!\!\!\perp X$ but $r_X \not\perp\!\!\!\perp Y$: $X \to Y$

Post-Nonlinear Additive Models

* $p(X, Y)$ admits a post-nonlinear additive noise model from X to Y if generated by a SCM, s.t. $X := \eta_X$ and $Y := g_y(f_y(X) + \eta_Y)$ with $\eta_X \perp\!\!\!\perp \eta_Y$

Non-identifiability:
* Assume that $p(x, y)$ admits an post-linear model from X to Y
* Then, $p(x, y)$ also admits from Y to X iff $p_x, f_y, g_y$ obey a certain differential equation
* Thus, $p(x, y)$ must be non-generic
* Nonlinearity of $f_y$ is necessary for identifiability


| **Name** | **Model** | **Functions** | **Identifiable** |
|-------------------|-------------------------|-------------------------|-------------------------|
| **General SCM** | $X_j := f_j(PA_j, \eta_j)$ | any | no |
| **ANM** | $X_j := f_j(PA_j) + \eta_j$ | nonlinear | yes |
| **CAM** | $X_j := \sum_{PA_j} f_j(X_i) + \eta_j$ | nonlinear | yes |
| **Linear Gaussian** | $X_j := \sum_{PA_j} c_{ij} X_i + \eta_j$ | linear | only if eq. err. var. |
| **Linear non-Gaussian** | $X_j := \sum_{PA_j} c_{ij} X_i + \eta_j$ | linear | yes |

Information-Geometric Causal Inference (IGCI):

* If $X \sim p_X, Y := f(X)$ is invertible
* Then $p_X$ and $f$ do not contain information about each other
* Given $X \in [0, 1]$, the joint distribution $p(x, y)$ admits an IGCI model 
    * if $p_X$ is smooth and continuous and $Y = f(X)$ 
    * with $f$ smooth, invertible, strictly monotonic function 
    * with $f(0) = 0$ and $f(1) = 1$, such that 
* $cov[\log f', p_X] = \int_0^1 \log f'(x) p_X(x) dx - \int_0^1 \log f'(x) dx = 0$

Non-Identifiability:

* If $p(x, y)$ admits an IGCI model from X to Y:
    * Then from Y to X as well iff $f = id$

Intuition:

* Given $Y, (X_1, …, X_n)$ variables
* Given $p_e(x, y) := $ samples from environment $e$
* If $\exists PA_y \subseteq \{X_i, …, X_n\} . \forall e, f . (p_e(y | PA_y) = p_f(y | PA_y))$
    * Then $p_e(y | PA_y)$ is a stable physical mechanism
    * And $PA_y$ are causal parents of Y
* Ex. $p_X$: altitude, $p_{Y | X}$ mean temperature given altitude
    * $p_X$ is variable, but $p_{Y | X}$ is stable

> Other approaches might use online observational or interventional data to learn causal relationships using ML techniques.

## Latent and Selection Variables

Selection Bias, Latent Confounding

* $X = \{0, 1\}$ treatment, $Y = \{0, 1\}$ patient recovery
* Suppose $X \not\to Y$
* But $X = 1$ has unpleasant side-effects $Z$
* Its severity is influenced by unobserved general health level $L$
* $L$ is the only cause of patient recovery
* Patients suffering from side-effects drop out of study $S = 0$
* Thus, data only contains patients with $S = 1$ conditioned
* So, $X$ and $Y$ is falsely correlated, because collider $Z$ is opened

Ancestral Graphs:

* DAGs with unobserved variables would have infinite nodes
* Don't constrain graph and topology, represent them with ancestral relationships
* For any DAG there is a MAG

MAG (Maximal Ancestral Graph):
* Edge Types:
    * Directed (Parent $\to$ Child)
    * Bidirected (Spouse $\leftrightarrow$ Spouse): $L \to X, Y$
    * Undirected (Neighbor $-$ Neighbor): $X, Y \to S$ 
* Directed cycle if $X_i \to X_j$ and $X_j$ ancestor of $X_i$
* Almost directed cycle if $X_i \leftrightarrow X_j$ and $X_j$ ancestor of $X_i$
* Collider $X_k$ if $X_i *\to X_k \leftarrow* X_j$
* Ancestral graph: 
    * If no directed or almost directed cycles 
    * And $\forall X_i - X_j$, both $X_i$ and $X_j$ have no parents or spouses
* $X \in An(\{Y\} \cup S)$ means $X$ is a cause of $Y$ or any selection variable $S$
* Inducing path $p$ between $X_i, X_j$ relative to $\{L, S\}$ with:
    * $X_k$ is collider $\implies X_k \in An(\{X_i, X_j\} \cup S)$
    * $X_k$ is non-collider $\implies X_k \in L$
    * If $X_i$ and $X_j$ are adjacent, trivially inducing
* Active (m-connecting) path relative to $Z$ with $X_i, X_j \notin Z$:
    * $X_k$ is non-collider and $X_k \notin Z$
    * $X_k$ is collider and $X_k \in An(Z)$
* $X_i$ and $X_j$ m-separated by Z if no active path between them
* Maximal AG: 
    * iff $\forall X_i, X_j$ non-adjacent $\exists Z$ m-separating them
    * iff $\forall X_i, X_j$ no inducing path relative to $\{\}$

* $X_i$ and $X_j$ m-connected relative to $Z \cup S$:
    * iff $\exists$ inducing path between $X_i, X_j$ rel. to $\{L, S\}$
    
DAG to MAG:
* Add edge $X_i, X_j \in X$ in MAG iff $\exists$ inducing path rel. to $\{L, S\}$
* Apply orientation:
    * $X_i \to X_j$ if $X_i \in An(X_j \cup S)$ and $X_j \notin An(X_i \cup S)$
    * $X_i \leftrightarrow X_j$ if $X_i \notin An(X_j \cup S)$ and $X_j \notin An(X_i \cup S)$
    * $X_i - X_j$ if $X_i \in An(X_j \cup S)$ and $X_j \in An(X_i \cup S)$

PAG:
* Similar to CPDAG of DAGs
* MAG with undirected edges as well, denoted as circle marks
* Maximally informative for [M] if every circle in $P$ is a variant mark

Discovery Algorithms for L, S Variables

Fast Causal Inference (FCI) Algorithm:

1. Skeleton Discovery (for finding initial skeleton and separating sets, => PC)
2. Orienting Colliders (=> PC)
3. Further Update Skeleton
4. Further Orient Colliders
5. Exhaustively Apply 10 Orientation Rules

Order of Analysis
1. Observed Dependence: Possible causal models?
1. Causal Markov Condition: statistical dependence => causal connectedness
1. Add Z and assume no unobserved variables
1. Observed conditional independence: $X \perp\!\!\!\perp Y | Z \iff p(X, Y | Z) = p(X | Z)p(Y | Z)$
1. Faithfulness Assumption: statistical independence => no causal connectedness
1. Markov Equivalence: Cannot distinguish graphs
1. Suppose $(V \perp\!\!\!\perp W), (X \perp\!\!\!\perp Y | Z), (X \perp\!\!\!\perp V | Z), (X \perp\!\!\!\perp W | Z), (Y \perp\!\!\!\perp V | Z), (Y \perp\!\!\!\perp W | Z)$
1. Assuming no latent variables: Which causal models explain these?
1. Only possible model: $V, W \to Z \to X, Y$
1. Only possible model (accounting for latents): $V, W \circ\to Z \to X, Y$

Selection Bias

1. For $X \perp\!\!\!\perp Y$, observe samples: 
    1. For $S = (Y > 0)$: $Y \to S$
    1. For $S = (X > 0)$: $X \to S$
    1. For $S = (X > 0) \lor (Y > 0)$: $X, Y \to S$ 
1. For $X \to Y$, observe samples:
    1. For $S = (Y > 0)$: $X \to Y \to S$
    1. For $S = (X > 0)$: $X \to S, Y$
    1. For $S = (X > 0) \lor (Y > 0)$: $S \leftarrow X \to Y \to S$


## Time Series

Constraint-based Causal Discovery

* Multivariate time series: $X := (X_1, …, X_n)$ observed at time steps $t \in \Z$
* Time series graph: One vertex per variable per time step
* Assume stationarity
    * Repetitive structure
    * No need for absolute timestamps
    * Multiple samples in a sliding window manner
* Observe $X$ in time window $[t_0 - T, t_0]$
* Learn the time series graph by testing CI
* Sufficient to learn the subgraph within finite time $[t - \tau_{\max}, t]$

Time Series Graphs
* Lagged Edges: $X_{i, s - \tau} \to X_{j, s}$ with lag $\tau \ge 0$
* Contemporaneous Edges: $X_{i, s} \to X_{j, s}$
* Autodependency Edges: $X_{i, s - \tau} \to X_{i, s}$
* Process order: Maximal lag in graph
* Maximal considered lag $\tau_{\max}$ 
    * Specifies the time window $[t - \tau_{\max}, t]$
    * Must be at least as large as the process order
    * Chosen through scientific domain knowledge or in a data driven way      

Causal Discovery:
* Modify an algorithm (take PC) having the following phases: 
    * Skeleton link removal phase using CI-tests:
        * Partial Correlation
        * Distance Correlation
        * Conditional Mutual Information
    * Orientation phase
* With the following background knowledge:
    * Repetitive structure
    * Causation cannot go back in time
    * Only learn edges that end at $t$
    * Possible ambiguity of contemporaneous edges during the orientation
* Problem:
    * Low statistical detection power: Such tests judge independence although dependence is true
    * Inflated false positives: Samples not iid because autocorrelated

Statistical power for detecting $X \not\perp\!\!\!\perp Y | S$:
1. Sample size (Given by dataset)
2. Significance level $\alpha$ (Given as hyperparameter by researcher)
3. Condition dimension/complexity $|S|$ (PC optimizes this)
4. Effect size, dependency strength

Given $Z \to X \to Y \leftarrow W$, then $I(X; Y | Z) \le I(X; Y) \le I(X; Y | W)$:
* Proof of $I(X; Y | Z) \le I(X; Y)$
    * $I(X; Y | Z) \le I(X; Y)$
    * $ \iff I(X; Y) + I(X; Z | Y) - I(X; Z) \le I(X; Y)$
    * $ \iff I(X; Z | Y) \le I(X; Z)$ (and $X \perp\!\!\!\perp Z | Y$)
    * $ \iff 0 \le I(X; Z)$ (true)
* Proof of $I(X; Y | W) \ge I(X; Y)$
    * $I(X; Y | W) \ge I(X; Y)$
    * $ \iff I(X; Y) + I(X; W | Y) - I(X; W) \ge I(X; Y)$
    * $ \iff I(X; W | Y) \ge I(X; W)$ (and $X \perp\!\!\!\perp W$)
    * $ \iff I(X; W | Y) \ge 0$ (true)
* Conditioning on $Z$ (latent variable):
    * Reduced effect size
    * More judgments of independence
    * Less $1 - \beta$, but more $1 - \alpha$
* Conditioning on $W$ (selection variable):
    * Increased effect size
    * Less judgments of independence
    * More $1 - \beta$, but less $1 - \alpha$

Modify PC algorithm as:
* Test $X \not\perp\!\!\!\perp Y | S$ for all $X,Y$ adjacent conditions $S$
* Remove link if $\min_S I(X; Y | S) < I_{\alpha_{PC}}$
* Previous PC algorithm: 
    * $\exists S . I(X; Y | S) = 0$ 
    * $\iff \min_S I(X; Y | S) = 0$ 
    * $\implies \min_S I(X; Y | S) < I_{\alpha_{PC}}$

PCMCI+:
* Choice of sepsets influence the performance at orientation phase,
* Improve the reliability of CI tests by optimizing the choice of cond. sets
* Phases:
    * $PC_1$ lagged phase:
        1. Iterate through lagged links only
        1. Sort them by association strength with $X_{j, t}$
        1. Select top $p$ conditions
        
        Result: $\forall p . S := \{{\cal A}(X_{j, t})\}_{l=1}^p$
        * Less likely to cond. on effect-size weakening parents of $X_{j, t - \tau}$
        * Still sufficient to block most paths
        * Converges to lagged parents + parents of contemporaneous ancestors $\hat{\beta}^{-}_t({X_{j, t}})$
            * Also all contemporaneous ancestors are replaced by their parents
            * No more contemporaneous links
        
        
    * $MCI$ contemporaneous phase
        * Goal: Remove spurious links due to contemporaneous drivers and store sepsets
        * Initialize with lagged links $\hat{\beta}^{-}_t({X_{j, t}})$ and all contemporaneous links
        * Conduct further CI tests for link removals
        * $\forall i, j, \tau$
            * $X_{i, t - \tau} \perp\!\!\!\perp X_{j, t} | S$ (block contemporaneous paths)
            * $\cup \hat{\beta}^{-}_t({X_{j, t}}) \backslash \{X_{i, t - \tau}\}$ (block lagged paths, see $X_{i, t - \tau}$ might be a parent of $X_{j, t}$)
            * $\cup \hat{\beta}^{-}_{t - \tau}({X_{i, t - \tau}})$ (block lagged paths, see $X_{j, t}$ is not a parent of $X_{i, t - \tau}$)
        * Then the effect size larger than PC effect size
    * Orientation phase

* PCMCI+ is sound and complete
* MCI tests are well-calibrated also for autocorrelated data

Granger Causality

* Let ${\cal X}_{\ne i, <t}$ the past of all time series but that of $X_i$
* Let $X_{i, <t}$ the past of $X_i$
* $X_i$ Granger cause $X_j$
    * If $X_{j, t} \not\perp\!\!\!\perp X_{i, <t} | {\cal X}_{\ne i, <t}$
    * If all edges are lagged, then it suffices to find $\tau > 0$ s.t. $X_{i, t - \tau} \to X_{j, t}$ 
    * Or statistically, if model 2 has a significantly smaller error:
        * Model 1: $X_{j, t} = \sum_{\tau = 1}^m \sum_{k \ne i} a_{\tau k}X_{k, t - \tau} + \eta_t$
        * Model 2: $X_{j, t} = \sum_{\tau = 1}^m \sum_{k \ne i} a_{\tau k}X_{k, t - \tau} + \sum_{\tau = 1}^m a_{\tau i}X_{i, t - \tau} + \eta_t'$
* Does not allow unobserved confounders and contemporaneous causal relationships.


## Causal Effect

> Quantifying causal relationships, like: how strong of an effect does an intervention on X have on Y?

Causal effect of X on Y: $\tau_{X \to Y} = \frac{\partial}{\partial x} \Bbb{E}[Y | do(X = x)]$

Assume linear model: $\\ = \Bbb{E}[Y | do(X = x_0 + 1)] - \Bbb{E}[Y | do(X = x_0 + 1)] \\ = \Bbb{E}[Y | do(X = 1)] - \Bbb{E}[Y | do(X = 0)]$ 

* Non-zero means of noise variables cancels out

> Effect of a (directed) path: Product of structural coefficients along the path

> Total effect: Sum of all effects of all paths

Given causal model: 
* $Z := \eta_Z, \eta_Z \sim N(0, \sigma_Z^2)$
* $X := c Z + \eta_X, \eta_X \sim N(0, \sigma_X^2)$ 
* $Y := a X + b Z + \eta_Y, \eta_Y \sim N(0, \sigma_Y^2)$ 

Perform as follows:
1. Regress out influence of Z on X: Residual $\Delta X \\ = X - \frac{\text{cov}(X, Z)}{\text{var}(Z)} Z \\ = X - \frac{c \cdot \text{cov}(\eta_Z, \eta_Z)}{\text{cov}(\eta_Z, \eta_Z)} Z \\ = X - c Z = \eta_X$
1. Regress out influence of Z on Y: Residual $\Delta Y \\ = Y - \frac{\text{cov}(Y, Z)}{\text{var}(Z)} Z \\ = Y - \frac{(ac + b) \cdot \text{cov}(\eta_Z, \eta_Z)}{\text{cov}(\eta_Z, \eta_Z)} Z \\ = Y - (ac + b) Z = a \eta_X + \eta_Y$
1. Linearly regress $\Delta Y$ on $\Delta X$: $\frac{\text{cov}(\Delta Y, \Delta X)}{\text{var}(\Delta X)} \\ = \frac{a \cdot \text{cov}(\eta_X, \eta_X)}{\text{cov}(\eta_X, \eta_X)} = a$

Definition: $Z \cap \{X, Y\} = \{\}$ is adjustment set for $(X, Y)$ if:

$p(y | do(x)) = \begin{cases}p(y | x) & Z = \{\} \\ \int p(y | x, z) p(z) dz & \text{else} \end{cases}$

$\frac{\partial}{\partial x} \Bbb{E}[Y | do(X = x)] \\ = \frac{\partial}{\partial x} \int y p(Y | do(X = x)) dy \\ = \frac{\partial}{\partial x} \int y \int p(Y | X = x, Z = z) p(z) dz dy \\ = \int p(z) \frac{\partial}{\partial x} \int y p(Y | X = x, Z = z) dz dy \\ = \int p(z) \frac{\partial}{\partial x} \Bbb{E}[Y | X = x, Z = z] dz ( = \Bbb{E}_Z \{\frac{\partial}{\partial x} \Bbb{E}[Y | X = x, Z = z]\}) \\ = \int p(z) \frac{\partial}{\partial x} (ax + bz + c) dz \\ = \int p(z) a dz = a$

> Also, the causal effect of X and Y is the coefficient in the multivariate linear regression of Y on X and Z. Also: $\hat{\tau}_{X \to Y}$ as the coefficient in front of X in regression of Y on X and Z.

Adjustment sets in Bayesian network: 
* $p(y | do(x)) = \int p(y | x, pa_X) p(pa_X) dpa_X$ with subpopulations $pa_X$ of the parents $PA_X$
* Causal effect is the avg. causal effects in subpopulations

Let: 
* Cause: $X$
* Effect: $Y$
* Parents: $PA_X$
* Other: $W$

Integrate over $x$, $w$ and $pa_X$:

$p(y | do(x_0)) \\ = \int p(y, pa_X | do(x_0)) dpa_X \\ = \int \int p(y, pa_X, w | do(x_0)) dw dpa_X \\ = \int \int \int p(x, y, pa_X, w | do(x_0)) dx dw dpa_X \\ = \int \int \int \frac{p(x, y, pa_X, w)}{p(x | pa_X)} \delta (x - x_0) dx dw dpa_X \\ = \int \int \int \frac{p(x, y, pa_X, w)}{p(x, pa_X)} p(pa_X) \delta (x - x_0) dx dw dpa_X \\ = \int \int \int p(y, w | x, pa_X) p(pa_X) \delta (x - x_0) dx dw dpa_X \\ = \int p(pa_X) \int \int p(y, w | x, pa_X) \delta (x - x_0) dx dw dpa_X \\ = \int p(pa_X) \int p(y, w | x_0, pa_X) dw dpa_X \\ = \int p(pa_X) p(y | x_0, pa_X) dpa_X$

Backdoor path: Path between $X$ and $Y$ that contains an arrow into $X$

Backdoor criterion: $Z$ is adjustment set for $(X, Y)$ iff:
1. Z blocks all backdoor paths relative to $(X, Y)$
1. No node in $Z$ is a descendant of $X$

Backdoor adjustment sets: $p(y | do(x)) = \int p(y | x, z) p(z) dz$ with $PA_X$ as a special case of $Z$

* $M(X \to Y) := $ the set of all variables on a directed path $X \to Y$ (other than $X, Y$)
* Path causal if $X \to … \to Y$ else non-causal
* $\text{forb}(X \to Y) := $ the set of all variables descendant of any node in $Y \cup M(X \to Y)$ and $X$

Generalized adjustment criterion: $Z$ is adjustment set for $(X, Y)$ iff:
1. Z blocks all non-causal paths relative to $(X, Y)$
    * Block non-causal dependencies introduced by general non-causal paths
1. No node in $Z$ is in $\text{forb}(X \to Y)$
    * $Y \cup M(X \to Y) \subset \text{forb}(X \to Y)$: Do not block causal dependencies
    * $\text{forb}(X \to Y) \backslash Y \cup M(X \to Y)$: Do not introduce collider bias

> $Z$ is an adjustment set $\iff$ Z satisfies the generalized adjustment criterion 

Different $Z$ lead to different estimators $\hat{\tau}_{X \to Y}$ with different variance.

Optimal adjustment set: $O(X \to Y) = pa(Y \cup M(X \to Y)) \backslash \text{forb}(X \to Y)$ has the smallest asymptotic variance

Variance of linear OLS estimator: $\text{var}(\hat{\tau}_{X \to Y}) = \sqrt{\frac{1}{n}\frac{\text{var}(Y | X, Z)}{\text{var}(X | Z)}}$

$\min \text{var}(\hat{\tau}_{X \to Y})$:

$\implies (1) \min \text{var}(Y | X, Z)$ 

$\implies (2) \max \text{var}(X | Z)$

> Condition on close causal ancestors of Y without blocking causal paths and try to avoid the same for X.

## Causal Effect Continued

Beyond Adjustment: There are cases in which no adjustment sets exist.

Front-door criterion: A set $Z \subset X \backslash \{X, Y\}$ satisfies the front-door criterion for $(X, Y)$ if:
1. $Z$ intercepts all directed paths from $X$ to $Y$
2. There is no unblocked back-door path from $X$ to $Z$
3. All back-door paths from $Z$ to $Y$ are blocked by $X$

Front-door adjustment: $p(y | do(x)) = \int d z p(z | x) \int d x' p(y | x', z) p(x')$ 

Estimation of causal effect in linear models:

$\frac{\partial}{\partial x} \Bbb{E} [Y | do(X = x)] = \frac{\partial}{\partial x} \int d y y p(y | do(x)) \\ = \frac{\partial}{\partial x} \int d y y \int d z p(z | x) \int dx' p(y | x', z) p(x') \\ = \frac{\partial}{\partial x} \int d z p(z | x) \int dx' p(x') \int d y y p(y | x', z) \\ = \frac{\partial}{\partial x} \int d z p(z | x) \int dx' p(x') \Bbb{E} [Y | X = x', Z = z) \\ = \frac{\partial}{\partial x} \int d z p(z | x) \int dx' p(x') (a x' + b z) \\ = \frac{\partial}{\partial x} \int d z p(z | x) (a \Bbb{E}_X[X] + b z) \\ = \frac{\partial}{\partial x} (a \Bbb{E}_X[X] + b \Bbb{E} [Z | X = x]) \\ = b \frac{\partial}{\partial x} \Bbb{E} [Z | X = x] \\ = \nabla_Z \Bbb{E} [Y | X = x', Z = z] \frac{\partial}{\partial x} \Bbb{E} [Z | X = x] \\ = \sum_{Z_i \in Z} \frac{\partial}{\partial z_i} \Bbb{E} [Y | X = x', Z = z] \frac{\partial}{\partial x} \Bbb{E} [Z_i | X = x] \\ ( = \sum_{Z_i \in Z} \frac{\partial}{\partial z_i} \Bbb{E} [Y | do(Z = z)] \cdot \frac{\partial}{\partial x} \Bbb{E} [Z_i | do(X = x)]) \\ = \frac{\partial}{\partial x} \Bbb{E} [\Bbb{E} [Y | Z] | X = x]$

Assume $|Z| = 1$, then:
* No unblocked back-door paths from $X$ to $Z$: $\frac{\partial}{\partial x} \Bbb{E} [Z | do(X = x)] = \frac{\partial}{\partial x} \Bbb{E} [Z | X = x]$
* $X$ blocks all back-door paths from $Z$ to $Y$: $\frac{\partial}{\partial z} \Bbb{E} [Y | do(Z = z)] = \frac{\partial}{\partial z} \Bbb{E} [Y | X = x', Z = z]$

Do-calculus:

Given notations:
* $G_{\overline{X}}$: Remove all $\to X$
* $G_{\overline{X} \underline{Z}}$: Remove all $\to X$ and $Z \to$
* $G_{\overline{X} \overline{Z(W)}}$: Remove all $\to X$ and $Z \to$ except ancestors of $W$

```go
for notDoFree() {
    select {
    case matchesRule1(): // insert / delete observation
        applyRule1()
        break
    case matchesRule2(): // exchange observation and action
        applyRule2()
        break
    case matchesRule3(): // insert / delete action
        applyRule3()
        break
    }
}
```

1. $p(y | do(x), z, w) = p(y | do(x), w)$ if $Y \perp\!\!\!\perp Z | X, W$ in $G_{\overline{X}}$
2. $p(y | do(x), do(z), w) = p(y | do(x), z, w)$ if $Y \perp\!\!\!\perp Z | X, W$ in $G_{\overline{X}\underline{Z}}$
3. $p(y | do(x), do(z), w) = p(y | do(x), w)$ if $Y \perp\!\!\!\perp Z | X, W$ in $G_{\overline{X}\overline{Z(W)}}$

Do-Calculus Properties:
* $p(y | do(x), z)$ is identifiable iff it can be reduced to a do-free expression involving observed variables by a do-calculus algorithm.

Instrumental variables:

* Assume $I \overset{a}{\to} X \overset{c}{\to} Y \overset{d}{\leftarrow} \gray{L} \overset{b}{\to} X$ non-identifiable
* Assume further Linear Gaussian model: 
  * $L = \eta_L \sim N(0, \sigma_L^2)$
  * $I = \eta_I \sim N(0, \sigma_U^2)$
  * $X = a I + b L + \eta_X \sim N(0, \sigma_X^2)$
  * $Y = c X + d L + \eta_Y \sim N(0, \sigma_Y^2)$
* So $\frac{\partial}{\partial x} \Bbb{E}[Y | do(X = x)] = c$
* Identify: $\text{cov}(I, X) = a \text{var}(I) + b \text{cov}(L, I) + \text{cov}(\eta_X, I) = a \text{var}(I)$
* Identify: $\text{cov}(I, Y) = c a \text{var}(I) + (c b + d)\text{cov}(L, I) + c \text{cov}(\eta_X, I) + \text{cov}(\eta_Y, I) = c a \text{var}(I)$
* Then $c = \frac{\text{cov}(I, Y)}{\text{cov}(I, X)}$
* Predict coefficients in regression of $Y$ and $X$ on $I$

CPDAG causal estimation for non-identifiability:
1. Causal discovery -> CPDAG of the causal graph
2. Causal effect estimation using CPDAG

Generalized adjustment criterion: Sets of variables which are adjustment sets in every DAG in the Markov equivalence class represented by the CPDAG

IDA (Intervention Calculus when the DAG is Absent) algorithm: 
* Goal: Bound the causal effect when there is no adjustment set in CPDAG
* List the DAGs in Markov equivalence class
* For each DAG: determine an adjustment set, estimate the causal effect
* Select from a set of possible causal effects

Outlook:
* Identification of causal effects via surrogate experiments
* Transferring causal knowledge from one system to another
* Identification of causal effects in the presence of latent / selection variables
* Do-calculus in case of unknown DAG

## Big Ideas

<hr/>

Confusion Matrix:

|                   | **Actual Positive (P)** | **Actual Negative (N)** |
|-------------------|-------------------------|-------------------------|
| **Predicted Positive (P')** | True Positive (TP): $1 - \alpha$  | False Positive (FP): $\beta$  |
| **Predicted Negative (N')** | False Negative (FN): $\alpha$  | True Negative (TN): $1 - \beta$  |

* Precision: $\frac{TP}{TP + FP}$ - given all positive predictions, which are true positive
* Recall (Sensitivity): $\frac{TP}{TP + FN}$ - given all positive cases, which are predicted positive
* Specificity: $\frac{TN}{TN + FP}$ - given all negative cases, which are predicted negative

* F1-Score: Harmonic mean of Precision and Recall - $\frac{2}{\frac{1}{\text{Precision}} + \frac{1}{\text{Recall}}} = \frac{2}{\frac{TP + FP}{TP} + \frac{TP + FN}{TP}} = \frac{2}{\frac{2 * TP + FP + FN}{TP}} = \frac{2 * TP}{2 * TP + FP + FN} = \frac{2 * (1 - \alpha)}{2 * (1 - \alpha) + \alpha + \beta} = \frac{1 - \alpha}{1 - (\alpha + \beta) / 2}$

<hr/>

Degrees of Freedom:

* If a LES has n equations and n parameters, exactly one solution
* If a LES has n-1 equations, (linearly) infinitely many solutions (1 free variable)
* If n-2 equations, (quadratically) infinitely many solutions (2 free variable)
* … 
* More degrees of freedom, more multitudes of infinity
* In an extreme case, A NN would have millions of parameters to estimate

<hr/>

Bonferroni Correction:

Given $m > 1$ tests:
* Assume $\alpha = \Sigma_i \alpha_i$
* $P(\bigvee_i \alpha_i) = 1 - P(\bigwedge_i \neg \alpha_i) = 1 - \Pi_i P(\alpha_i) = 1 - \Pi_i (1 - \alpha_i)$
* $\max \frac{\log P(\bigvee_i \alpha_i)}{\Sigma_i \alpha_i} = \max \frac{\log (1 - \Pi_i (1 - \alpha_i))}{\Sigma_i \alpha_i} = \min \frac{\log (\Pi_i (1 - \alpha_i))}{\Sigma_i \alpha_i} = \min \frac{\Sigma_i \log(1 - \alpha_i)}{\Sigma_i \alpha_i}$
* $0 = \frac{\partial}{\partial \alpha_i}(\frac{\Sigma_i \log(1 - \alpha_i)}{\Sigma_i \alpha_i}) = \frac{\partial}{\partial \alpha_i}(\frac{\Sigma_i \log(1 - \alpha_i)}{\Sigma_i \alpha_i}) = \frac{\frac{1}{\alpha_i - 1}(\Sigma_j \alpha_j) - \Sigma_j \log(1 - \alpha_j)}{(\Sigma_j \alpha_j)^2} \sim \frac{1}{\alpha_i - 1}(\Sigma_j \alpha_j) - \Sigma_j \log(1 - \alpha_j) = \Sigma_j \frac{\alpha_j}{\alpha_i - 1} - \log(1 - \alpha_j) = \Sigma_j \frac{\alpha_j}{1 - \alpha_i} - \log(\frac{1}{1 - \alpha_j}) \sim \Sigma_j \alpha_j - (1 - \alpha_i) \log(\frac{1}{1 - \alpha_j}) = \alpha - (1 - \alpha_i) \Sigma_j \log(\frac{1}{1 - \alpha_j}) \sim \alpha - (1 - \alpha_i) k$ for some strictly positive constant $k$.

Then for all $i$: $\alpha = (1 - \alpha_i)k \iff \alpha_i = 1 - \frac{\alpha}{k}$
* Also, upper bound estimation yields $\alpha_1 = … = \alpha_m = \frac{\alpha}{m}$
* $1 - \Pi_i (1 - \alpha_i) = 1 - \Pi_i (1 - \frac{\alpha}{m}) \le 1 - (1 - \frac{\alpha}{m})^m = 1 - P(\bigwedge \neg \frac{\alpha}{m}) = P(\bigvee \frac{\alpha}{m}) = m * P(\frac{\alpha}{m}) = m * \frac{\alpha}{m} = \alpha$
* Interestingly, if $m \to \infty$, then $1 - (1 - \frac{\alpha}{m})^m \to 1 - e^{-\alpha}$

<hr/>

Causal Sufficiency:

* $p(w) = \int p(w, l) dl$ with $L = X \backslash W$ latent variables.
* Implies there is no pair of distinct observed vertices with a hidden common cause (a latent variable).
* All needed variables are within model, so the causal model is sufficient
* Alternatively: $p((\eta_i)_i) = \Pi_i p(\eta_i)$, also the noise independence

<hr/>

CMI (Conditional Mutual Independence):

* $\forall x, y, z$:
    * $p(x, y | z) \ge p(x | z) p(y | z)$
    * $\frac{p(x, y | z)}{p(x | z) p(y | z)} \ge 1$
    * $\log \frac{p(x, y | z)}{p(x | z) p(y | z)} \ge 0$
* $\int_z \int_y \int_x p(x, y, z) \log \frac{p(x, y | z)}{p(x | z) p(y | z)} dx dy dz \ge 0$
* $\int_z p(z) dz \int_y \int_x p(x, y | z) \log \frac{p(x, y | z)}{p(x | z) p(y | z)} dx dy \ge 0$
* $I(X; Y | Z) \ge 0$

* If independent, make $\ge$ to $=$.

* Also express $I(X; Y | Z)$ using $H$-entropy as 
    * $\int_z p(z) dz \int_y \int_x p(x, y | z) \log \frac{p(x, y | z)}{p(x | z) p(y | z)} dx dy$
    * $ = \int_z p(z) dz \int_y \int_x p(x, y | z) \log \frac{p(x | y, z)}{p(x | z)} dx dy$
    * $ = -\int_z p(z) dz \int_y \int_x p(x, y | z) (\log p(x | z) - \log p(x | y, z)) dx dy$
    * $ = H(X | Z) - H(X | Y, Z)$

* Chain Rule:
    * $I(X; Y) + I(X; Z | Y)$ 
    * $ = (H(X) - H(X | Y)) + (H(X | Y) - H(X | Y, Z))$
    * $ = H(X) - H(X | Y, Z)$
    * $ = (H(X) - H(X | Z)) + (H(X | Z) - H(X | Y, Z))$
    * $ = I(X; Z) + I(X; Y | Z)$

<hr/>

Determinism:

* $X = f(Z)$, then $X$ and $Z$ would have to be merged as a single node.
* See Proof:
    * $l(f(Z); Y | Z) = \int_z p(z) dz \int_y \int_x p(f(z), y | z) \log \frac{p(f(z), y | z)}{p(f(z) | z) p(y | z)} dx dy$
    * $p(f(z) | z) = 1$, because $X := f(z)$ is always satisfied if $Z := z$ 
    * $p(f(z), y | z) = p(y | z)$, because similarly $X := f(z)$ on the left side doesn't change anything.
    * Then $p(f(z), y | z) \log \frac{p(f(z), y | z)}{p(f(z) | z) p(y | z)} = p(y | z) \log \frac{p(y | z)}{p(y | z)} = p(y | z) \log 1 = 0$
* Category Theory suggests that if two variables have the exactly same relationships with all other nodes, then they are the same. 
* Regard noises of variable as the characteristics of a person, and if a person is married to another to another, they are fully codependent and may be counted as one person.  


### Irrelevante Ideen

<hr/>

Mehrfache Integration ist ein Funktional, die eine Funktion als Input nimmt und eine neue Funktion (oder eine Konstante) zurückgibt. 

Die unteren und oberen Grenzen der Integration müssen partiell geordnet sein, d.h. $\int_a^b \int_{\alpha(x)}^{\beta(x)} \int_{\rho(x, y)}^{\sigma(x, y)} … dz dy dx $. 

* Reflexiv
* Transitiv
* Antisymmetrisch

In dem Fall, dass ein Zyklus entsteht, wie z.B. $\int_{\alpha(y)}^{\beta(y)} \int_{\rho(x)}^{\sigma(x)} … dy dx$, ist es nicht mehr antisymmetrisch, und deshalb ist der vorliegende Term keine Funktion mehr, weil die Funktionen als eindeutige Zuordnungen definiert sind. 

Dann muss man alle partiellen Ordnungen in der Integration suchen und die Integration als Summe aller Integrationspfade aufschreiben. 

wie z.B.

$\int \int_{\text{Einheitskreis}} … dy dx = \int_{-1}^{1} \int_0^{\sqrt{1 - x^2}} … dy dx + \int_{-1}^{1} \int_0^{\sqrt{1 - y^2}} … dx dy = 2 * \int_{-1}^{1} \int_0^{\sqrt{1 - x^2}} … dy dx$

So werden alle Pfade einzeln integriert. Das ergibt im schlimmsten Fall $\alpha_i(x_{\ne i}), \beta_i(x_{\ne i})$ für alle $i$-ten Dimensionen, also alle Permutationen von n Dimensionen ergibt $n!$ Pfade.

<hr/>

$p_i : \Omega \to [0, 1] \forall i \in [n]$ seien Wahrscheinlichkeitsfunktionen. 

$\Bbb{E}[p_i | i \in [n]]$ ist auch eine Wahrscheinlichkeitsfunktion, denn $\int … \int \frac{1}{n} \sum_{i = 1}^n p_i d\vec{x} = \frac{1}{n} \sum_{i = 1}^n (\int … \int p_i d\vec{x}) = \frac{1}{n} * n = 1$

$\Pi_{i = 1}^n p_i$ ist auch eine Wahrscheinlichkeitsfunktion, denn $\int … \int \Pi_{i = 1}^n p_i d\vec{x} = 1 \iff \log \int … \int \Pi_{i = 1}^n p_i d\vec{x} = 0 \overset{\ge 0}{\implies} \int … \int \log \Pi_{i = 1}^n p_i d\vec{x} = 0 \iff \int … \int \Sigma_{i = 1}^n \log p_i d\vec{x} = 0 \Sigma_{i = 1}^n \iff \int … \int \log p_i d\vec{x} = 0 \iff \Sigma_{i = 1}^n 0 = 0 \iff 0 = 0$ (Satisfied)

In [2]:
# create a Graph class which has the following methods:

class Vertex:
    def __init__(self, value, conditioned: bool = False):
        self.value = value
        self.conditioned = conditioned

    def __repr__(self):
        value = str(self.value)
        if self.conditioned:
            value += '[*]'
        return value

class Graph:
    def __init__(self):
        self.edges = {}

    def add_node(self, *args, **kwargs):
        v = Vertex(*args, **kwargs)
        self.edges[v] = []
        return v

    def add_edge(self, start_vertex, end_vertex):
        if start_vertex not in self.edges:
            raise KeyError('Start Vertex not in Graph')
        if end_vertex not in self.edges:
            raise KeyError('End Vertex not in Graph')

        self.edges[start_vertex].append((end_vertex, '->'))
        self.edges[end_vertex].append((start_vertex, '<-'))

    def get_nodes(self):
        return self.edges.keys()

    def get_neighbors(self, vertex):
        return self.edges[vertex]

    def size(self):
        return len(self.edges)

    def __repr__(self):
        output = ''
        for vertex in self.edges.keys():
            output += vertex.value
            output += ' -> '
            output += str(self.edges[vertex])
            output += '\n'
        return output 

In [3]:
graph = Graph()

a = graph.add_node('A')
b = graph.add_node('B')
c = graph.add_node('C')
d = graph.add_node('D')
e = graph.add_node('E')
f = graph.add_node('F')
g = graph.add_node('G')
h = graph.add_node('H')

graph.add_edge(a, b)
graph.add_edge(b, c)
graph.add_edge(d, c)
graph.add_edge(d, e)
graph.add_edge(f, e)
graph.add_edge(e, g)
graph.add_edge(f, h)


In [None]:
graph

In [5]:
class GraphAnalyser:
    def __init__(self, graph: Graph):
        self.graph = graph
        
    def find_all_triples(self):
        triples = set()
        triples_list = []
        for x in self.graph.get_nodes():
            for (y, arrow1) in self.graph.get_neighbors(x):
                for (z, arrow2) in self.graph.get_neighbors(y):
                    arrow1_reverse = '<-' if arrow1 == '->' else '->'
                    arrow2_reverse = '<-' if arrow2 == '->' else '->'
                    
                    type_of_triple = None
                    if arrow1 == '<-' and arrow2 == '<-':
                        continue
                    elif arrow1 == '->' and arrow2 == '->':
                        type_of_triple = 'chain'
                    elif arrow1 == '<-' and arrow2 == '->':
                        type_of_triple = 'fork'
                    else:
                        type_of_triple = 'collider'
                    
                    triple = ' '.join([str(x), arrow1, str(y), arrow2, str(z)])
                    reverse_triple = ' '.join([str(z), arrow2_reverse, str(y), arrow1_reverse, str(x)])
                    
                    if z != x:
                        if triple not in triples and reverse_triple not in triples:
                            triples.add(triple)
                            triples_list.append((triple, type_of_triple))
        return triples_list
    

In [None]:
GraphAnalyser(graph).find_all_triples()