# Simulation with a SARS-CoV-2 Model

In [1]:
import numpy as np

from eliater import version_df
from eliater.discover_latent_nodes import find_nuisance_variables, remove_nuisance_variables
from eliater.examples import sars_cov_2_example_discrete as example
from eliater.network_validation import print_graph_falsifications
from y0.algorithm.estimation import estimate_ace
from y0.algorithm.identify import Identification, identify_outcomes
from y0.dsl import P, Variable

version_df()

Unnamed: 0,key,value
0,eliater,0.0.1-dev-554fd195
1,y0,0.2.8-dev-05cee105
2,Run at,2024-01-19 09:29:20


In [2]:
treatment = Variable("EGFR")
outcome = Variable("cytok")
SEED = 1

This is case study 1 in Figure 6 in this paper: Eliater: an open source software for causal query estimation from observational measurements of biomolecular networks. The figure below shows the SARS-CoV-2 network (Mohammad-Taheri et al., 2022; Zucker
et al., 2021), which models the activation of Cytokine Release Syndrome (Cytokine Storm), a known factor causing tissue damage in severely ill SARS-CoV-2 patients (Ulhaq and Soraya, 2020).

![sars](../img/sars_cov2.png)

In [3]:
graph = example.graph

This case study used synthetic observational data. The generation of this synthetic data was inspired by common biological practices. For each endogenous variable $X$ including $EGFR$, we represented  biomolecular reactions using Hill equations \cite{alon2019introduction}. Specifically, we generated observations of each node $X$ from a Binomial distribution with probability of $\frac{1}{1 + \exp(\mathbf{\theta}^{\prime} Pa(X) + \theta_0)}$, where $Pa(X)$ is a $q \times 1$ vector of measurements related to the parent of $X$, $\mathbf{\theta}'$ is a $1 \times q$ parameter vector, and $\theta_0$ is a scalar. The probability distributions for the exogenous variables were simulated from a Binomial distribution with a random probability between 0.4 to 0.8.

In [4]:
# get observational data
data = example.generate_data(1000, seed=SEED)
data.head()

Unnamed: 0,SARS_COV2,ACE2,Ang,AGTR1,ADAM17,Toci,Sil6r,EGF,TNF,Gefi,EGFR,PRR,NFKB,IL6STAT3,IL6AMP,cytok
0,0,1,1,1,0,1,0,0,1,1,1,0,0,1,0,0
1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
2,1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0
3,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0
4,0,1,1,0,1,1,0,0,0,1,1,0,1,0,0,1


## Step 1: Verify correctness of the network structure

In [5]:
print_graph_falsifications(
    graph=graph, data=data, method="chi-square", verbose=True, significance_level=0.01
)

All 99 d-separations implied by the network's structure are consistent with the data, meaning that none of the data-driven conditional independency tests' null hypotheses were rejected at p<0.01.

Finished in 12.13 seconds.


Unnamed: 0,left,right,given,stats,p,dof,p_adj,p_adj_significant
0,ACE2,IL6STAT3,AGTR1,7.227271,0.026954,2,1.0,False
1,Ang,EGFR,EGF,2.169821,0.337932,2,1.0,False
2,Ang,Toci,,0.710932,0.399135,1,1.0,False
3,Sil6r,cytok,IL6AMP,1.034914,0.596034,2,1.0,False
4,IL6STAT3,Toci,Sil6r,3.610024,0.164472,2,1.0,False
...,...,...,...,...,...,...,...,...
94,ADAM17,IL6STAT3,Sil6r,0.267183,0.874947,2,1.0,False
95,Gefi,IL6STAT3,,0.092733,0.760730,1,1.0,False
96,Ang,Gefi,,0.028232,0.866566,1,1.0,False
97,ADAM17,IL6AMP,IL6STAT3|NFKB,1.631462,0.442316,2,1.0,False


None of the tests failed for this graph. The reason is that the data is simulated according to the graph structure, hence we expect the d-separations implied by the network be validated by the data. 

## Step 2: Check query identifiability

In [6]:
identify_outcomes(graph=graph, treatments=treatment, outcomes=outcome)

Sum[ACE2, ADAM17, AGTR1, Ang, IL6AMP, IL6STAT3, NFKB, PRR, SARS_COV2, Sil6r, TNF, Toci](P(ACE2 | SARS_COV2) * P(AGTR1 | ACE2, Ang, SARS_COV2) * P(IL6AMP | ACE2, ADAM17, AGTR1, Ang, EGF, EGFR, Gefi, IL6STAT3, NFKB, PRR, SARS_COV2, Sil6r, TNF, Toci) * P(IL6STAT3 | ACE2, ADAM17, AGTR1, Ang, SARS_COV2, Sil6r, Toci) * P(TNF | ACE2, ADAM17, AGTR1, Ang, SARS_COV2) * P(cytok | ACE2, ADAM17, AGTR1, Ang, EGF, EGFR, Gefi, IL6AMP, IL6STAT3, NFKB, PRR, SARS_COV2, Sil6r, TNF, Toci) * Sum[ACE2, ADAM17, AGTR1, Ang, EGF, EGFR, Gefi, IL6AMP, IL6STAT3, NFKB, PRR, SARS_COV2, Sil6r, TNF, cytok](P(ACE2, ADAM17, AGTR1, Ang, EGF, EGFR, Gefi, IL6AMP, IL6STAT3, NFKB, PRR, SARS_COV2, Sil6r, TNF, Toci, cytok)) * P(ADAM17 | ACE2, AGTR1, Ang, SARS_COV2, Toci) * P(Sil6r | ACE2, ADAM17, AGTR1, Ang, SARS_COV2, Toci) * P(Ang | ACE2, SARS_COV2) * P(SARS_COV2) * P(NFKB | ACE2, ADAM17, AGTR1, Ang, EGF, EGFR, Gefi, PRR, SARS_COV2, TNF) * P(PRR | ACE2, Gefi, SARS_COV2))

The query is identifiable.

## Step 3: Find nuisance variables and mark them as latent

This function finds the nuisance variables for the input graph.

In [7]:
nuisance_variables = find_nuisance_variables(graph, treatments=treatment, outcomes=outcome)
nuisance_variables

set()

No variable is identified as the nuisance variable. Hence the simplified network in the next step will produce a graph similar to the original graph.

## Step 4: Simplify the network

The following function finds the nuisance variables (step 3), marks them as latent and then applies Evan's simplification rules to remove the nuisance variables. As there are no nuisance variables, the new graph will be the same as the original graph.

In [8]:
new_graph = remove_nuisance_variables(graph, treatments=treatment, outcomes=outcome)

## Step 5: Estimate the query

In [9]:
estimate_ace(new_graph, treatments=treatment, outcomes=outcome, data=data)

-0.21594360092118517

## Evaluation criterion
As we used synthetic data set, we were able to generate two interventional data sets where in
one EGFR was set to 1, and the other one EGFR is set to 0. The ATE was calculated by subtracting the average value of Cytokine Storm obtained from each interventional data,
resulting in the ground truth ATE=-0.44. The negative ATE indicates that the Gef itinib drug can reduce the increase in Cytokine Storm levels, hence can help in treating patients having SARS-Cov-2.

In [10]:
def get_background_ace(seed=None) -> float:
    # get interventional data where treatment is set to 1
    data_1 = example.generate_data(1000, {treatment: 1.0}, seed=seed)
    # get interventional data where treatment is set to 0
    data_0 = example.generate_data(1000, {treatment: 0.0}, seed=seed)
    return data_1.mean()[outcome.name] - data_0.mean()[outcome.name]


# get the true value of ATE
get_background_ace(seed=SEED)

-0.018000000000000016

The estimated $\widehat{\mathrm{ATE}}=-0.02$, comparable in sign and magnitute to the ground truth ATE=-0.44. The discrepancy in the value of results is due to non-linear, and complex data generation procedure, which is similar to the real-life experimental artifacts, and due to the approximate nature of the modeling assumption.