# Case study 1: Simulation with a SARS-CoV-2 Model

In [1]:
!pip install git+https://github.com/y0-causal-inference/eliater.git 

Collecting git+https://github.com/y0-causal-inference/eliater.git
  Cloning https://github.com/y0-causal-inference/eliater.git to /private/var/folders/fs/kx46_43x04ndj3yryggvkg5r0000gn/T/pip-req-build-b8rjedil
  Running command git clone --quiet https://github.com/y0-causal-inference/eliater.git /private/var/folders/fs/kx46_43x04ndj3yryggvkg5r0000gn/T/pip-req-build-b8rjedil
  Resolved https://github.com/y0-causal-inference/eliater.git to commit 722814a686ef4771548cdbb0b9c59804014438c5
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import numpy as np

# from eliater import version_df
from eliater.discover_latent_nodes import find_nuisance_variables, remove_nuisance_variables
from src.eliater.examples.sars_cov2 import sars_large_example as example
from src.eliater.examples.sars_cov2 import generate_continuous
from eliater.network_validation import print_graph_falsifications
from y0.algorithm.estimation import estimate_ace
from y0.algorithm.identify import Identification, identify_outcomes
from y0.dsl import P, Variable

# version_df()

In [3]:
treatment = Variable("EGFR")
outcome = Variable("cytok")
SEED = 100

This is case study 1 in Figure 6 in this paper: Eliater: an open source software for causal query estimation from observational measurements of biomolecular networks. The figure below shows the SARS-CoV-2 network (Mohammad-Taheri et al., 2022; Zucker
et al., 2021), which models the activation of Cytokine Release Syndrome (Cytokine Storm), a known factor causing tissue damage in severely ill SARS-CoV-2 patients (Ulhaq and Soraya, 2020).

In [4]:
graph = example.graph

This case study used synthetic observational data. The generation of this synthetic data was inspired by common biological practices. For each endogenous variable $X$ including $EGFR$, we represented  biomolecular reactions using Hill equations \cite{alon2019introduction}. Specifically, we generated observations of each node $X$ from a Binomial distribution with probability of $\frac{1}{1 + \exp(\mathbf{\theta}^{\prime} Pa(X) + \theta_0)}$, where $Pa(X)$ is a $q \times 1$ vector of measurements related to the parent of $X$, $\mathbf{\theta}'$ is a $1 \times q$ parameter vector, and $\theta_0$ is a scalar. The probability distributions for the exogenous variables were simulated from a Binomial distribution with a random probability between 0.4 to 0.8.

In [5]:
# get observational data
#data = example.generate_data(1000, seed=SEED)
data = example.generate_data(1000, seed=SEED)
data.head()

Unnamed: 0,SARS_COV2,ACE2,Ang,AGTR1,ADAM17,Toci,Sil6r,EGF,TNF,Gefi,EGFR,PRR,NFKB,IL6STAT3,IL6AMP,cytok
0,73.679109,17.57684,93.861006,101.151905,98.54018,56.005886,69.424218,96.505141,99.296894,33.424504,1,97.370407,29.751989,29.366957,38.082656,59.631856
1,56.620967,31.727878,78.949305,99.479725,99.297518,56.022003,64.051996,98.54297,98.511394,47.897558,1,89.538987,38.158193,22.415073,37.129663,56.431981
2,65.01323,24.644308,89.7309,100.509177,99.437188,49.753317,82.287237,91.929421,99.807326,52.808541,1,94.206983,34.403547,44.666545,52.56607,77.488187
3,76.249522,19.484148,98.286645,98.466683,98.52875,55.121697,74.968228,97.613234,99.470404,50.439736,1,97.49369,29.059414,37.505798,41.889477,63.836998
4,76.961345,18.504758,92.974745,99.68678,97.174635,45.153245,55.506413,95.726874,99.440437,35.386174,1,98.792294,32.300719,19.319097,33.277613,52.521325


In [11]:
from sklearn.preprocessing import KBinsDiscretizer
# discretization transform the raw data
kbins = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
data_trans = kbins.fit_transform(data)

AttributeError: 'DataFrame' object has no attribute 'reshape'

In [16]:
import pandas as pd
data_trans = pd.DataFrame(data_trans, columns = data.columns)

In [17]:
data_trans

Unnamed: 0,SARS_COV2,ACE2,Ang,AGTR1,ADAM17,Toci,Sil6r,EGF,TNF,Gefi,EGFR,PRR,NFKB,IL6STAT3,IL6AMP,cytok
0,2.0,0.0,2.0,2.0,2.0,1.0,1.0,2.0,1.0,0.0,2.0,2.0,0.0,1.0,1.0,1.0
1,1.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,1.0,0.0,0.0,1.0
2,1.0,0.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,2.0
3,2.0,0.0,2.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0,0.0,1.0,1.0,1.0
4,2.0,0.0,2.0,2.0,2.0,1.0,1.0,2.0,1.0,0.0,2.0,2.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,1.0,1.0,2.0,2.0,0.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0
996,1.0,1.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,2.0
997,1.0,1.0,2.0,2.0,2.0,1.0,2.0,2.0,1.0,0.0,2.0,2.0,1.0,2.0,2.0,2.0
998,1.0,0.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,0.0,2.0,2.0,2.0,1.0,1.0,2.0


## Step 1: Verify correctness of the network structure

In [18]:
print_graph_falsifications(
    graph=graph, data=data_trans, method="chi-square", verbose=True, significance_level=0.01
)

ValueError: using binary data test (chi-square) on continuous data

None of the tests failed for this graph. The reason is that the data is simulated according to the graph structure, hence we expect the d-separations implied by the network be validated by the data. 

## Step 2: Check query identifiability

In [18]:
identify_outcomes(graph=graph, treatments=treatment, outcomes=outcome)

Sum[ACE2, ADAM17, AGTR1, Ang, IL6AMP, IL6STAT3, NFKB, PRR, SARS_COV2, Sil6r, TNF, Toci](P(ACE2 | SARS_COV2) * P(AGTR1 | ACE2, Ang, SARS_COV2) * P(IL6AMP | ACE2, ADAM17, AGTR1, Ang, EGF, EGFR, Gefi, IL6STAT3, NFKB, PRR, SARS_COV2, Sil6r, TNF, Toci) * P(IL6STAT3 | ACE2, ADAM17, AGTR1, Ang, SARS_COV2, Sil6r, Toci) * P(TNF | ACE2, ADAM17, AGTR1, Ang, SARS_COV2) * P(cytok | ACE2, ADAM17, AGTR1, Ang, EGF, EGFR, Gefi, IL6AMP, IL6STAT3, NFKB, PRR, SARS_COV2, Sil6r, TNF, Toci) * Sum[ACE2, ADAM17, AGTR1, Ang, EGF, EGFR, Gefi, IL6AMP, IL6STAT3, NFKB, PRR, SARS_COV2, Sil6r, TNF, cytok](P(ACE2, ADAM17, AGTR1, Ang, EGF, EGFR, Gefi, IL6AMP, IL6STAT3, NFKB, PRR, SARS_COV2, Sil6r, TNF, Toci, cytok)) * P(ADAM17 | ACE2, AGTR1, Ang, SARS_COV2, Toci) * P(Sil6r | ACE2, ADAM17, AGTR1, Ang, SARS_COV2, Toci) * P(Ang | ACE2, SARS_COV2) * P(SARS_COV2) * P(NFKB | ACE2, ADAM17, AGTR1, Ang, EGF, EGFR, Gefi, PRR, SARS_COV2, TNF) * P(PRR | ACE2, Gefi, SARS_COV2))

The query is identifiable.

## Step 3: Find nuisance variables and mark them as latent

This function finds the nuisance variables for the input graph.

In [19]:
nuisance_variables = find_nuisance_variables(graph, treatments=treatment, outcomes=outcome)
nuisance_variables

set()

No variable is identified as the nuisance variable. Hence the simplified network in the next step will produce a graph similar to the original graph.

## Step 4: Simplify the network

The following function finds the nuisance variables (step 3), marks them as latent and then applies Evan's simplification rules to remove the nuisance variables. As there are no nuisance variables, the new graph will be the same as the original graph.

In [48]:
new_graph = remove_nuisance_variables(graph, treatments=treatment, outcomes=outcome)

## Step 5: Estimate the query

In [49]:
estimate_ace(new_graph, treatments=treatment, outcomes=outcome, data=data)

0.7826734394202646

## Evaluation criterion
As we used synthetic data set, we were able to generate two interventional data sets where in
one EGFR was set to 1, and the other one EGFR is set to 0. The ATE was calculated by subtracting the average value of Cytokine Storm obtained from each interventional data,
resulting in the ground truth ATE=-0.018. The negative ATE indicates that the Gefitinib drug can reduce the increase in Cytokine Storm levels, hence can help in treating patients having SARS-Cov-2.

In [50]:
def get_background_ace(seed=None) -> float:
    # get interventional data where treatment is set to 1
    data_1 = generate_continuous(1000, {treatment: 1.0}, seed=seed)
    # get interventional data where treatment is set to 0
    data_0 = generate_continuous(1000, {treatment: 0.0}, seed=seed)
    return data_1.mean()[outcome.name] - data_0.mean()[outcome.name]

# get the true value of ATE
get_background_ace(seed=SEED)

0.7851941596613301

In [82]:
# def get_background_ace(seed=None) -> float:
#     # get interventional data where treatment is set to 1
#     data_1 = example.generate_data(1000, {treatment: 1.0}, seed=seed)
#     # get interventional data where treatment is set to 0
#     data_0 = example.generate_data(1000, {treatment: 0.0}, seed=seed)
#     return data_1.mean()[outcome.name] - data_0.mean()[outcome.name]
# 
# # get the true value of ATE
# get_background_ace(seed=SEED)

-0.024999999999999967

jThe estimated $\widehat{\mathrm{ATE}}=-0.2$, comparable in sign and magnitute to the ground truth ATE=-0.02. The discrepancy in the value of results is due to non-linear, and complex data generation procedure, which is similar to the real-life experimental artifacts, and due to the approximate nature of the modeling assumption.

In [65]:
#Relative change
((-0.22+0.018)/-0.018)

1122.2222222222224