# Case study 2: The Sars-Cov2 model

In [1]:
!pip install git+https://github.com/y0-causal-inference/eliater.git@linear-regression

Collecting git+https://github.com/y0-causal-inference/eliater.git@linear-regression
  Cloning https://github.com/y0-causal-inference/eliater.git (to revision linear-regression) to /private/var/folders/fs/kx46_43x04ndj3yryggvkg5r0000gn/T/pip-req-build-yzdp9x6b
  Running command git clone --quiet https://github.com/y0-causal-inference/eliater.git /private/var/folders/fs/kx46_43x04ndj3yryggvkg5r0000gn/T/pip-req-build-yzdp9x6b
  Running command git checkout -b linear-regression --track origin/linear-regression
  Switched to a new branch 'linear-regression'
  Branch linear-regression set up to track remote branch linear-regression from origin.
  Resolved https://github.com/y0-causal-inference/eliater.git to commit 3de3be53d14e16b9ec8831bf35478f34ee04a76f
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting y0>=0.2.5 (from eliater==0.0.1.dev0)
  Obtainin

In [18]:
from eliater.frontdoor_backdoor_discrete import sars_large_example
import numpy as np

This is case study 1 in Figure 6 in this paper: Eliater: an open source software for causal query estimation from observational measurements of biomolecular networks. The figure below shows the SARS-CoV-2 network (Mohammad-Taheri et al., 2022; Zucker
et al., 2021), which models the activation of Cytokine Release Syndrome (Cytokine Storm), a known factor causing tissue damage in severely ill SARS-CoV-2 patients (Ulhaq and Soraya, 2020).

In [3]:
graph = sars_large_example.graph

This case study used synthetic observational data. The generation of this synthetic data was inspired by common biological practices. For each endogenous variable $X$ including $EGFR$, we represented  biomolecular reactions using Hill equations \cite{alon2019introduction}. Specifically, we generated observations of each node $X$ from a Binomial distribution with probability of $\frac{1}{1 + \exp(\mathbf{\theta}^{\prime} Pa(X) + \theta_0)}$, where $Pa(X)$ is a $q \times 1$ vector of measurements related to the parent of $X$, $\mathbf{\theta}'$ is a $1 \times q$ parameter vector, and $\theta_0$ is a scalar. The probability distributions for the exogenous variables were simulated from a Binomial distribution with a random probability between 0.4 to 0.8.

In [41]:
# get observational data
data = sars_large_example.generate_data(1000, seed=1)

In [42]:
data.head()

Unnamed: 0,SARS_COV2,ACE2,Ang,AGTR1,ADAM17,Toci,Sil6r,EGF,TNF,Gefi,EGFR,PRR,NFKB,IL6STAT3,IL6AMP,cytok
0,0,1,1,1,0,1,0,0,1,1,1,0,0,1,0,0
1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
2,1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0
3,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0
4,0,1,1,0,1,1,0,0,0,1,1,0,1,0,0,1


## Step 1: Verify correctness of the network structure

In [45]:
from eliater.network_validation import print_graph_falsifications

In [46]:
print_graph_falsifications(graph=graph, data=data, method="chi-square", verbose=True, significance_level=0.01)

Failed tests: 0/99 (0.00%)
Reject null hypothesis when p<0.01
left       right      given                 stats          p    dof    p_adj  p_adj_significant
AGTR1      SARS_COV2  Ang              0.277998    0.870229       2        1  False
IL6STAT3   NFKB       ADAM17|EGFR      0.654706    0.956795       4        1  False
ACE2       EGF        Ang              0.890781    0.640574       2        1  False
EGF        IL6STAT3   ADAM17           0           1              2        1  False
SARS_COV2  TNF        Ang              1.50292     0.471677       2        1  False
ADAM17     SARS_COV2  Ang              0.362015    0.834429       2        1  False
Ang        PRR        SARS_COV2        0.600869    0.740496       2        1  False
AGTR1      TNF        ADAM17           1.76737     0.413257       2        1  False
EGF        TNF        ADAM17           0.0655956   0.967734       2        1  False
IL6STAT3   TNF        ADAM17           0.705062    0.702907       2        1  False
PR

None of the tests failed for this graph. The reason is that the data is simulated according to the graph structure, hence we expect the d-separations implied by the network be validated by the data. 

## Step 2: Check query identifiability

In [8]:
from y0.algorithm.identify import Identification
from y0.dsl import P, Variable

id_in = Identification.from_expression(
    query=P(Variable('cytok') @ Variable('EGFR')),
    graph=graph,
)
id_in

Identification(outcomes="{cytok}, treatments="{EGFR}",conditions="set()",  graph="NxMixedGraph(directed=<networkx.classes.digraph.DiGraph object at 0x12316ec50>, undirected=<networkx.classes.graph.Graph object at 0x1231706d0>)", estimand="P(ACE2, ADAM17, AGTR1, Ang, EGF, EGFR, Gefi, IL6AMP, IL6STAT3, NFKB, PRR, SARS_COV2, Sil6r, TNF, Toci, cytok)")

The query is identifiable.

## Step 3: Find nuisance variables and mark them as latent

In [9]:
from eliater.discover_latent_nodes import find_nuisance_variables

This function finds the nuisance variables for the input graph.

In [10]:
nuisance_variables = find_nuisance_variables(graph, treatments=Variable("EGFR"), outcomes=Variable("cytok"))
nuisance_variables

set()

No variable is identified as the nuisance variable. Hence the simplified network in the next step will produce a graph similar to the original graph.

## Step 4: Simplify the network

The following function finds the nuisance variables (step 3), marks them as latent and then applies Evan's simplification rules to remove the nuisance variables. As there are no nuisance variables, the new graph will be the same as the original graph.

In [11]:
from eliater.discover_latent_nodes import remove_nuisance_variables

In [12]:
new_graph = remove_nuisance_variables(graph, treatments=Variable("EGFR"), outcomes=Variable("cytok"))

## Step 5: Estimate the query

In [13]:
from y0.algorithm.estimation import estimate_ace

In [40]:
estimate_ace(new_graph, treatments=Variable("EGFR"), outcomes=Variable("cytok"), data=data)

-0.4476229407307878

## Evaluation criterion
As we used synthetic data set, we were able to generate two interventional data sets where in
one EGFR was set to 1, and the other one EGFR is set to 0. The ATE was calculated by subtracting the average value of Cytokine Storm obtained from each interventional data,
resulting in the ground truth ATE=-0.44. The negative ATE indicates that the Gef itinib drug can reduce the increase in Cytokine Storm levels, hence can help in treating patients having SARS-Cov-2.

In [29]:
# get interventional data where EGFR is set to 1
intv_data_EGFR_1 = sars_large_example.generate_data(num_samples=1000, seed=1, treatments = {Variable('EGFR'): 1})

# get interventional data where EGFR is set to 0
intv_data_EGFR_0 = sars_large_example.generate_data(num_samples=1000, seed=1, treatments = {Variable('EGFR'): 0})

In [30]:
#get the true value of ATE
print(np.mean(intv_data_EGFR_1['cytok']) - np.mean(intv_data_EGFR_0['cytok'])) 

-0.018000000000000016


The estimated $\widehat{\mathrm{ATE}}=-0.02$, comparable in sign and magnitute to the ground truth ATE=-0.44. The discrepancy in the value of results is due to non-linear, and complex data generation procedure, which is similar to the real-life experimental artifacts, and due to the approximate nature of the modeling assumption.