# Motivating example: Figure 4

In [2]:
!pip install git+https://github.com/y0-causal-inference/eliater.git@linear-regression

Collecting git+https://github.com/y0-causal-inference/eliater.git@linear-regression
  Cloning https://github.com/y0-causal-inference/eliater.git (to revision linear-regression) to c:\users\pnava\appdata\local\temp\pip-req-build-f3onojbm
  Resolved https://github.com/y0-causal-inference/eliater.git to commit f666788b42cf32722a3bef39754a9bb19a375a92
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'


  Running command git clone --filter=blob:none --quiet https://github.com/y0-causal-inference/eliater.git 'C:\Users\pnava\AppData\Local\Temp\pip-req-build-f3onojbm'
  Running command git checkout -b linear-regression --track origin/linear-regression
  branch 'linear-regression' set up to track 'origin/linear-regression'.
  Switched to a new branch 'linear-regression'

[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Figure below is the motivating example in this paper: *Eliater: an open source software for causal query estimation from observational measurements of biomolecular networks*. This graph contains one mediator $M_1$ that connects the exposure $X$ to the outcome $Y$.

In [13]:
import numpy as np

from eliater.examples.frontdoor_backdoor_discrete import (
    single_mediator_with_multiple_confounders_nuisances_discrete_example,
)

In [2]:
graph = single_mediator_with_multiple_confounders_nuisances_discrete_example.graph

In [21]:
data = single_mediator_with_multiple_confounders_nuisances_discrete_example.generate_data(
    num_samples=500, seed=500
)

In [22]:
data.head()

Unnamed: 0,X,M1,Z1,Z2,Z3,R1,R2,R3,Y
0,1,1,1,1,1,1,1,1,1
1,1,1,1,0,0,1,1,1,1
2,1,0,1,0,1,1,1,0,1
3,1,1,1,1,1,1,1,1,0
4,1,1,1,1,0,1,1,1,1


## Step 1: Verify correctness of the network structure

In [23]:
from eliater.network_validation import print_graph_falsifications

In [24]:
print_graph_falsifications(graph, data, method="chi-square", verbose=True, significance_level=0.01)

Failed tests: 0/26 (0.00%)
Reject null hypothesis when p<0.01
left    right    given        stats         p    dof    p_adj  p_adj_significant
X       Z2       Z1       0.34346    0.842206      2        1  False
X       Z3       Z1       1.12103    0.570914      2        1  False
X       Y        M1|Z1    0.0346805  0.999851      4        1  False
M1      Z3       Z1       0.630044   0.729773      2        1  False
M1      Z2       Z1       1.51285    0.469341      2        1  False
Y       Z1       X|Z2     3.842      0.427811      4        1  False
R3      Z3       R1|Y     1.20928    0.876569      4        1  False
R1      Z2       Z1       0.988663   0.609979      2        1  False
M1      Z1       X        1.44905    0.484554      2        1  False
R2      Z1       R1       0.151588   0.927007      2        1  False
R2      X        R1       0.0114721  0.99428       2        1  False
R1      Z3       Z1       1.39424    0.498017      2        1  False
Y       Z2       Z1|Z3    2.6

All the d-separations implied by the network are validated by the data. No test failed. Hence, we can proceed to step 2.

## Step 2: Check query identifiability

The causal query of interest is the average treatment effect of $X$ on $Y$, defined as:
$E[Y|do(X=1)] - E[Y|do(X=0)]$.

In [5]:
from y0.algorithm.identify import Identification
from y0.dsl import P, Variable

id_in = Identification.from_expression(
    query=P(Variable("Y") @ Variable("X")),
    graph=graph,
)
id_in

Identification(outcomes="{Y}, treatments="{X}",conditions="set()",  graph="NxMixedGraph(directed=<networkx.classes.digraph.DiGraph object at 0x12d599390>, undirected=<networkx.classes.graph.Graph object at 0x12d083510>)", estimand="P(M1, R1, R2, R3, X, Y, Z1, Z2, Z3)")

The query is identifiable. Hence we can proceed to step 3.

## Step 3: Find nuisance variables and mark them as latent

In [6]:
from eliater.discover_latent_nodes import find_nuisance_variables, mark_nuisance_variables_as_latent

This function finds the nuisance variables for the input graph.

In [7]:
nuisance_variables = find_nuisance_variables(
    graph, treatments=Variable("X"), outcomes=Variable("Y")
)
nuisance_variables

{R1, R2, R3}

The nuisance variables are $R_1$, $R_2$, and $R_3$.

## Step 4: Simplify the network

The following function finds the nuisance variable (step 3), marks them as latent and then applies Evan's simplification rules to remove the nuisance variables. As a result, running the 'find_nuisance_variables' and 'mark_nuisance_variables_as_latent' functions in step 3 is not necessary to get the value of step 4. However, we called them to illustrate the results. The new graph obtained in step 4 does not contain the nuisance variables. 

In [8]:
from eliater.discover_latent_nodes import remove_nuisance_variables

In [9]:
new_graph = remove_nuisance_variables(graph, treatments=Variable("X"), outcomes=Variable("Y"))

## Step 5: Estimate the query

In [10]:
from y0.algorithm.estimation import estimate_ace

In [19]:
ATE_value = estimate_ace(
    graph=new_graph, treatments=Variable("X"), outcomes=Variable("Y"), data=data
)
ATE_value

0.20915697893053198

The ATE amounts to 0.21 meaning that the average effect that $X$ has on $Y$ is negative.

#Evaluation Criterion
As we used synthetic data set, we were able to generate two interventional data sets where in
one X was set to 1, and the other one X is set to 0. The ATE was calculated by subtracting the average value of Y obtained from each interventional data,
resulting in the ground truth ATE=0.01. The ATE indicates that increase in X can increase  Y levels.

In [20]:
# get interventional data where EGFR is set to 1
intv_data_X_1 = single_mediator_with_multiple_confounders_nuisances_discrete_example.generate_data(
    num_samples=500, seed=500, treatments={Variable("X"): 1}
)

# get interventional data where EGFR is set to 0
intv_data_X_0 = single_mediator_with_multiple_confounders_nuisances_discrete_example.generate_data(
    num_samples=500, seed=500, treatments={Variable("X"): 0}
)

# get the true value of ATE
print(np.mean(intv_data_X_1["Y"]) - np.mean(intv_data_X_0["Y"]))

0.010000000000000009
