# Case study 2: The Sars-Cov2 model

In [1]:
!pip install git+https://github.com/y0-causal-inference/eliater.git@linear-regression

Collecting git+https://github.com/y0-causal-inference/eliater.git@linear-regression
  Cloning https://github.com/y0-causal-inference/eliater.git (to revision linear-regression) to c:\users\pnava\appdata\local\temp\pip-req-build-qtw9eg_j
  Resolved https://github.com/y0-causal-inference/eliater.git to commit 3de3be53d14e16b9ec8831bf35478f34ee04a76f
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'


  Running command git clone --filter=blob:none --quiet https://github.com/y0-causal-inference/eliater.git 'C:\Users\pnava\AppData\Local\Temp\pip-req-build-qtw9eg_j'
  Running command git checkout -b linear-regression --track origin/linear-regression
  branch 'linear-regression' set up to track 'origin/linear-regression'.
  Switched to a new branch 'linear-regression'

[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [17]:
from eliater.frontdoor_backdoor_discrete import sars_large_example

This is case study 2 in Figure 6 in this paper: Eliater: an analytical workflow and open source implementation for causal query estimation in biomolecular networks.

In [3]:
graph = sars_large_example.graph

We get the observational data below. However, the data is mixed type, where the exposure (EGFR) is binary and rest of the variables are continuous. Currently, the conditional independence tests in eliater does not support mixed type data. Hence we discretize the data only to use for step 1. Rest of the steps support this type of data, so we use the original data for those steps.

In [4]:
# get observational data
data = sars_large_example.generate_data(1000, seed=100)

In [5]:
data.head()

Unnamed: 0,SARS_COV2,ACE2,Ang,AGTR1,ADAM17,Toci,Sil6R,EGF,TNF,Gefi,EGFR,PRR,NFKB,IL6STAT3,IL6AMP,cytok
0,0,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0
1,0,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0
2,0,0,1,1,1,1,0,1,0,1,0,0,0,0,0,0
3,0,1,1,0,0,1,0,0,0,1,0,0,0,0,0,1
4,0,1,1,0,0,1,0,0,1,0,0,1,0,0,0,0


In [31]:
# # discritize the data
# from sklearn.preprocessing import KBinsDiscretizer
# # discretization transform the raw data
# kbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
# data_trans = kbins.fit_transform(data)
# print(data_trans)
# column_names = ["SARS_COV2",
#                 "ACE2",
#                 "Ang",
#                 "AGTR1",
#                 "ADAM17",
#                 "Toci",
#                 "Sil6R",
#                 "EGF",
#                 "TNF",
#                 "Gefi",
#                 "EGFR",
#                 "PRR",
#                 "NFKB",
#                 "IL6STAT3",
#                 "IL6AMP",
#                 "cytok",]
# obs_data_discrete = pd.DataFrame(data_trans, columns = [])
# # summarize first few rows


## Step 1: Verify correctness of the network structure

In [6]:
from eliater.network_validation import print_graph_falsifications

In [9]:
print_graph_falsifications(graph=graph, data=data, method="chi-square", verbose=True, significance_level=0.01)

Failed tests: 0/99 (0.00%)
Reject null hypothesis when p<0.01
left       right      given                   stats           p    dof     p_adj  p_adj_significant
NFKB       SARS_COV2  Ang|PRR          13.3242       0.00979556      4  0.969761  False
EGFR       cytok      IL6AMP            8.92288      0.0115457       2  1         False
Toci       cytok      IL6AMP            0.136108     0.93421         2  1         False
SARS_COV2  Sil6r      Ang               0.396373     0.820217        2  1         False
NFKB       Sil6r      ADAM17            1.84486      0.397552        2  1         False
ACE2       IL6STAT3   Ang               1.25701      0.533388        2  1         False
AGTR1      Sil6r      ADAM17            3.61848      0.163779        2  1         False
SARS_COV2  Toci                         0.225658     0.634762        1  1         False
ACE2       Sil6r      Ang               2.45978      0.292325        2  1         False
PRR        Sil6r      Ang               1.1702

None of the tests failed for this graph. The reason is that the data is simulated according to the graph structure, hence we expect the d-separations implied by the network be validated by the data. 

## Step 2: Check query identifiability

In [10]:
from y0.algorithm.identify import Identification
from y0.dsl import P, Variable

id_in = Identification.from_expression(
    query=P(Variable('cytok') @ Variable('EGFR')),
    graph=graph,
)
id_in

Identification(outcomes="{cytok}, treatments="{EGFR}",conditions="set()",  graph="NxMixedGraph(directed=<networkx.classes.digraph.DiGraph object at 0x0000027755227550>, undirected=<networkx.classes.graph.Graph object at 0x0000027755227490>)", estimand="P(ACE2, ADAM17, AGTR1, Ang, EGF, EGFR, Gefi, IL6AMP, IL6STAT3, NFKB, PRR, SARS_COV2, Sil6r, TNF, Toci, cytok)")

The query is identifiable.

## Step 3: Find nuisance variables and mark them as latent

In [11]:
from eliater.discover_latent_nodes import find_nuisance_variables

This function finds the nuisance variables for the input graph.

In [12]:
nuisance_variables = find_nuisance_variables(graph, treatments=Variable("EGFR"), outcomes=Variable("cytok"))
nuisance_variables

set()

No variable is identified as the nuisance variable. Hence the simplified network in the next step will produce a graph similar to the original graph.

## Step 4: Simplify the network

The following function find the nuisance variable (step 3), marks them as latent and then applies Evan's simplification rules to remove the nuisance variables. As there are no nuisance variables, the new graph will be the same as the original graph.

In [13]:
from eliater.discover_latent_nodes import remove_nuisance_variables

In [14]:
new_graph = remove_nuisance_variables(graph, treatments=Variable("EGFR"), outcomes=Variable("cytok"))

## Step 5: Estimate the query

In [15]:
from y0.algorithm.estimation import estimate_ace

In [16]:
estimate_ace(new_graph, treatments=Variable("EGFR"), outcomes=Variable("cytok"), data=data)

-0.4476229407307878