# Case study 2: The Sars-Cov2 model

In [1]:
!pip install git+https://github.com/y0-causal-inference/eliater.git@linear-regression

Collecting git+https://github.com/y0-causal-inference/eliater.git@linear-regression
  Cloning https://github.com/y0-causal-inference/eliater.git (to revision linear-regression) to c:\users\pnava\appdata\local\temp\pip-req-build-ju295epo
  Resolved https://github.com/y0-causal-inference/eliater.git to commit 38aa8e4daafb832ded39bf03488fcf0db1d32dfb
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'


  Running command git clone --filter=blob:none --quiet https://github.com/y0-causal-inference/eliater.git 'C:\Users\pnava\AppData\Local\Temp\pip-req-build-ju295epo'
  Running command git checkout -b linear-regression --track origin/linear-regression
  branch 'linear-regression' set up to track 'origin/linear-regression'.
  Switched to a new branch 'linear-regression'

[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import pandas as pd
from eliater.frontdoor_backdoor.sars_cov2_discrete import sars_large_example

ModuleNotFoundError: No module named 'src.frontdoor_backdoor'

This is case study 2 in Figure 6 in this paper: Eliater: an analytical workflow and open source implementation for causal query estimation in biomolecular networks.

In [10]:
graph = sars_large_example.graph

We get the observational data below. However, the data is mixed type, where the exposure (EGFR) is binary and rest of the variables are continuous. Currently, the conditional independence tests in eliater does not support mixed type data. Hence we discretize the data only to use for step 1. Rest of the steps support this type of data, so we use the original data for those steps.

In [15]:
# get observational data
data = pd.read_csv(
    "C:\\Users\\pnava\\PycharmProjects\\eliater\\src\\eliater\\data\\SARS_COV2_obs_data_discrete.csv",
    index_col=0
)

In [16]:
data.head()

Unnamed: 0,SARS_COV2,ACE2,Ang,AGTR1,ADAM17,Toci,Sil6r,EGF,TNF,Gefi,EGFR,PRR,NFKB,IL6STAT3,IL6AMP,cytok
0,0,1,1,0,0,1,0,1,0,1,1,0,1,0,0,0
1,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,1
2,0,1,0,0,0,1,1,1,0,1,0,1,0,0,0,1
3,0,1,1,1,0,1,0,0,0,0,0,1,0,0,1,1
4,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0


In [31]:
# # discritize the data
# from sklearn.preprocessing import KBinsDiscretizer
# # discretization transform the raw data
# kbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
# data_trans = kbins.fit_transform(data)
# print(data_trans)
# column_names = ["SARS_COV2",
#                 "ACE2",
#                 "Ang",
#                 "AGTR1",
#                 "ADAM17",
#                 "Toci",
#                 "Sil6R",
#                 "EGF",
#                 "TNF",
#                 "Gefi",
#                 "EGFR",
#                 "PRR",
#                 "NFKB",
#                 "IL6STAT3",
#                 "IL6AMP",
#                 "cytok",]
# obs_data_discrete = pd.DataFrame(data_trans, columns = [])
# # summarize first few rows


## Step 1: Verify correctness of the network structure

In [13]:
from eliater.network_validation import print_graph_falsifications

In [17]:
print_graph_falsifications(graph=graph, data=data, method="chi-square", verbose=True, significance_level=0.01)

Failed tests: 0/99 (0.00%)
Reject null hypothesis when p<0.01
left       right      given                   stats          p    dof    p_adj  p_adj_significant
AGTR1      IL6AMP     IL6STAT3|NFKB     0.0324134    0.983924       2        1  False
Gefi       Sil6r                        0.0201231    0.887194       1        1  False
ACE2       ADAM17     AGTR1             1.41006      0.494094       2        1  False
IL6AMP     SARS_COV2  IL6STAT3|NFKB     0.0902653    0.955871       2        1  False
ACE2       IL6AMP     IL6STAT3|NFKB     1.04406      0.593314       2        1  False
EGFR       Toci                         0            1              1        1  False
IL6STAT3   PRR        Sil6r             2.04415      0.359848       2        1  False
AGTR1      Toci                         2.2175       0.136454       1        1  False
SARS_COV2  Toci                         0.811494     0.367679       1        1  False
EGF        Sil6r      ADAM17            0.527229     0.76827      

None of the tests failed for this graph. The reason is that the data is simulated according to the graph structure, hence we expect the d-separations implied by the network be validated by the data. 

## Step 2: Check query identifiability

In [18]:
from y0.algorithm.identify import Identification
from y0.dsl import P, Variable

id_in = Identification.from_expression(
    query=P(Variable('cytok') @ Variable('EGFR')),
    graph=graph,
)
id_in

Identification(outcomes="{cytok}, treatments="{EGFR}",conditions="set()",  graph="NxMixedGraph(directed=<networkx.classes.digraph.DiGraph object at 0x000002D0D715B710>, undirected=<networkx.classes.graph.Graph object at 0x000002D0D7159F10>)", estimand="P(ACE2, ADAM17, AGTR1, Ang, EGF, EGFR, Gefi, IL6AMP, IL6STAT3, NFKB, PRR, SARS_COV2, Sil6r, TNF, Toci, cytok)")

The query is identifiable.

## Step 3: Find nuisance variables and mark them as latent

In [19]:
from eliater.discover_latent_nodes import find_nuisance_variables

This function finds the nuisance variables for the input graph.

In [20]:
nuisance_variables = find_nuisance_variables(graph, treatments=Variable("EGFR"), outcomes=Variable("cytok"))
nuisance_variables

set()

No variable is identified as the nuisance variable. Hence the simplified network in the next step will produce a graph similar to the original graph.

## Step 4: Simplify the network

The following function find the nuisance variable (step 3), marks them as latent and then applies Evan's simplification rules to remove the nuisance variables. As there are no nuisance variables, the new graph will be the same as the original graph.

In [21]:
from eliater.discover_latent_nodes import remove_nuisance_variables

In [22]:
new_graph = remove_nuisance_variables(graph, treatments=Variable("EGFR"), outcomes=Variable("cytok"))

## Step 5: Estimate the query

In [23]:
from y0.algorithm.estimation import estimate_ace

In [25]:
estimate_ace(new_graph, treatments=Variable("EGFR"), outcomes=Variable("cytok"), data=data)

-0.13376185743219493