# Case study 1: Simulation with a SARS-CoV-2 Model

In [32]:
!pip install git+https://github.com/y0-causal-inference/eliater.git 

Collecting git+https://github.com/y0-causal-inference/eliater.git
  Cloning https://github.com/y0-causal-inference/eliater.git to /private/var/folders/fs/kx46_43x04ndj3yryggvkg5r0000gn/T/pip-req-build-nsw1l29c
  Running command git clone --quiet https://github.com/y0-causal-inference/eliater.git /private/var/folders/fs/kx46_43x04ndj3yryggvkg5r0000gn/T/pip-req-build-nsw1l29c
  Resolved https://github.com/y0-causal-inference/eliater.git to commit 33eb330e6faaa0fce6a0526d4764b800b1c8434c
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [39]:
import numpy as np

# from eliater import version_df
from eliater.discover_latent_nodes import find_nuisance_variables, remove_nuisance_variables
from src.eliater.examples.sars_cov2 import sars_large_example as example
from eliater.network_validation import print_graph_falsifications
from y0.algorithm.estimation import estimate_ace
from y0.algorithm.identify import Identification, identify_outcomes
from y0.dsl import P, Variable

# version_df()

In [40]:
treatment = Variable("EGFR")
outcome = Variable("cytok")
SEED = 100

This is case study 1 in Figure 6 in this paper: Eliater: an open source software for causal query estimation from observational measurements of biomolecular networks. The figure below shows the SARS-CoV-2 network (Mohammad-Taheri et al., 2022; Zucker
et al., 2021), which models the activation of Cytokine Release Syndrome (Cytokine Storm), a known factor causing tissue damage in severely ill SARS-CoV-2 patients (Ulhaq and Soraya, 2020).

![sars](../img/SARS_COV.png)

In [41]:
graph = example.graph

This case study used synthetic observational data. The generation of this synthetic data was inspired by common biological practices.  The exogenous variables were modeled with a Gaussian distribution. For each endogenous variable $X$, we represented  biomolecular reactions using Hill equations \cite{alon2019introduction}, and were approximated with a sigmoid function as follows,

$\mathcal{N}(\frac{100}{1 + \exp(\mathbf{\theta}^{\prime} Pa(X) + \theta_0)})$ 

where $Pa(X)$ is a $q \times 1$ vector of measurements related to the parent of $X$, $\mathbf{\theta}'$ is a $1 \times q$ parameter vector, and $\theta_0$ is a scalar. The $EGFR$ was generated from a Binomial distribution with probability of $\frac{1}{1 + \exp(\mathbf{\theta}^{\prime} Pa(X) + \theta_0)}$. Hence, the observational data is mixed-type where the $EGFR$ column is binary (discrete), and rest of columns are continuous.

In [42]:
# get observational data
#data = example.generate_data(1000, seed=SEED)
data = example.generate_data(1000, seed=SEED)
data.head()

Unnamed: 0,SARS_COV2,ACE2,Ang,AGTR1,ADAM17,Toci,Sil6r,EGF,TNF,Gefi,EGFR,PRR,NFKB,IL6STAT3,IL6AMP,cytok
0,73.679109,17.57684,93.861006,101.151905,98.54018,56.005886,69.424218,96.505141,99.296894,33.424504,0,97.370407,28.459292,29.366957,37.477632,58.754197
1,56.620967,31.727878,78.949305,99.479725,99.297518,56.022003,64.051996,98.54297,98.511394,47.897558,0,89.538987,36.78086,22.415073,36.491974,55.497898
2,65.01323,24.644308,89.7309,100.509177,99.437188,49.753317,82.287237,91.929421,99.807326,52.808541,0,94.206983,33.08097,44.666545,51.904992,76.795535
3,76.249522,19.484148,98.286645,98.466683,98.52875,55.121697,74.968228,97.613234,99.470404,50.439736,0,97.49369,27.816667,37.505798,41.280651,63.00009
4,76.961345,18.504758,92.974745,99.68678,97.174635,45.153245,55.506413,95.726874,99.440437,35.386174,0,98.792294,30.996968,19.319097,32.713035,51.675823


## Step 1: Verify correctness of the network structure

We checked the consistency of the network structure against observational data with significance level of 0.01 by first discretizing the data into a binary data and used the $\chi$-square test. This is because \texttt{Eliater} does not support mixed-type data.  

In [43]:
from sklearn.preprocessing import KBinsDiscretizer
# discretization transform the raw data
kbins = KBinsDiscretizer(n_bins=2, encode='ordinal', strategy='uniform')
data_trans = kbins.fit_transform(data)

In [44]:
import pandas as pd
data_trans = pd.DataFrame(data_trans, columns = data.columns)

In [45]:
data_trans

Unnamed: 0,SARS_COV2,ACE2,Ang,AGTR1,ADAM17,Toci,Sil6r,EGF,TNF,Gefi,EGFR,PRR,NFKB,IL6STAT3,IL6AMP,cytok
0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0
2,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0
3,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
996,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0
997,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0
998,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0


In [46]:
print_graph_falsifications(
    graph=graph, data=data_trans, method="chi-square", verbose=True, significance_level=0.01
)

ValueError: using binary data test (chi-square) on continuous data

Among all the 99 possible tests, 9 failed (9 $\%$). As the data was synthetically generated based on the network structure, we expected all the tests to pass. However the failed tests are due to noise inherited by randomly sampling the data points. 

## Step 2: Check query identifiability

In [None]:
identify_outcomes(graph=graph, treatments=treatment, outcomes=outcome)

The query is identifiable.

## Step 3: Find nuisance variables and mark them as latent

This function finds the nuisance variables for the input graph.

In [None]:
nuisance_variables = find_nuisance_variables(graph, treatments=treatment, outcomes=outcome)
nuisance_variables

No variable is identified as the nuisance variable. Hence the simplified network in the next step will produce a graph similar to the original graph.

## Step 4: Simplify the network

In eliater, step 3, and 4 are both combined into a single function. Hence, the following function finds the nuisance variable (step 3), marks them as latent and then applies Evan's simplification rules (Step 4) to remove the nuisance variables. As a result, running the 'find_nuisance_variables' and 'mark_nuisance_variables_as_latent' functions is not necessary to get the value of step 4. However, we called them to illustrate the results.  As there are no nuisance variables, the new graph will be the same as the original graph.

In [None]:
new_graph = remove_nuisance_variables(graph, treatments=treatment, outcomes=outcome)

## Step 5: Estimate the query

In [None]:
estimate_ace(new_graph, treatments=treatment, outcomes=outcome, data=data)

## Evaluation criterion
As we used synthetic data set, we were able to generate two interventional data sets where in
one EGFR was set to 1, and the other one EGFR is set to 0. The ATE was calculated by subtracting the average value of Cytokine Storm obtained from each interventional data,
resulting in the ground truth ATE=0.796. The positive ATE indicates that the Gefitinib drug can not reduce the Cytokine Storm levels, hence can't help in treating patients having SARS-Cov-2.

In [95]:
def get_background_ace(seed=None) -> float:
    # get interventional data where treatment is set to 1
    data_1 = generate_continuous(1000, {treatment: 1.0}, seed=seed)
    # get interventional data where treatment is set to 0
    data_0 = generate_continuous(1000, {treatment: 0.0}, seed=seed)
    return data_1.mean()[outcome.name] - data_0.mean()[outcome.name]

# get the true value of ATE
get_background_ace(seed=SEED)

0.7965175191822595

The estimated $\widehat{\mathrm{ATE}}=0.605$, comparable in sign and magnitute to the ground truth ATE=0.796. The discrepancy in the value of results is due to non-linear, and complex data generation procedure, which is similar to the real-life experimental artifacts, and due to the approximate nature of the modeling assumption.

In [96]:
#Relative change
((0.605 - 0.796)/0.796)

-0.23994974874371866

### Random Sampling Evaluation

In [47]:
# Population => Generate D = 10000 data points
D = example.generate_data(10000, seed=SEED)

In [48]:
# Samples => Generate 1000 datasets with 1000 points each (d) using random sampling

d_count = 1000
d_size = 1000
d = [D.sample(d_size) for _ in range(d_count)]

In [49]:
ate = [estimate_ace(new_graph, treatments=treatment, outcomes=outcome, data=data) for data in d]

PerfectSeparationError: Perfect separation detected, results not available