# Case study 2: The T signaling pathway

In [1]:
!pip install git+https://github.com/y0-causal-inference/eliater.git@linear-regression

Collecting git+https://github.com/y0-causal-inference/eliater.git@linear-regression

  Running command git clone --filter=blob:none --quiet https://github.com/y0-causal-inference/eliater.git 'C:\Users\pnava\AppData\Local\Temp\pip-req-build-9801nic1'
  Running command git checkout -b linear-regression --track origin/linear-regression
  branch 'linear-regression' set up to track 'origin/linear-regression'.
  Switched to a new branch 'linear-regression'

[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip



  Cloning https://github.com/y0-causal-inference/eliater.git (to revision linear-regression) to c:\users\pnava\appdata\local\temp\pip-req-build-9801nic1
  Resolved https://github.com/y0-causal-inference/eliater.git to commit f666788b42cf32722a3bef39754a9bb19a375a92
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'


In [2]:
import pandas as pd

from eliater.examples import t_cell_signaling_example
from eliater.network_validation import print_graph_falsifications

This is case study 2 in this paper: Eliater: an open source software for causal query estimation from observational measurements of biomolecular networks. The Figure below is the protein signalling network (G) of the cell signaling pathway presented in (Sachs et al., 2005). It models the molecular
mechanisms and regulatory processes involved in T cell activation, proliferation, and function.

In [None]:
graph = t_cell_signaling_example.graph
graph.draw()

The observational data consisted of quantitative multivariate flow cytometry measurements of phosphorylated proteins derived from thousands of individual primary immune system cells. The cells were subjected to general stimuli meant to activate the desired paths. The distributions of measurements of individual proteins were skewed, and pairs of proteins exhibited nonlinear relationships. To account for that, the data were binned into
two levels corresponding to low, and high concentrations using Harteminkâ€™s approach (Hartemink, 2001) to preserve
the dependence structure of the original data.

In [21]:
# Get the data
# TODO: Change from local path to Github url
data = pd.read_csv(
    "C:\\Users\\pnava\\PycharmProjects\\eliater\\src\\eliater\\data\\sachs_discretized_2bin.csv",
    index_col=0,
)
data.head()

Unnamed: 0,Raf,Mek,Plcg,PIP2,PIP3,Erk,Akt,PKA,PKC,P38,Jnk
0,0,0,0,1,1,1,0,1,1,1,1
1,0,1,1,1,1,1,1,1,1,1,1
2,1,0,0,1,1,1,1,1,1,0,0
3,1,0,0,1,1,1,1,1,1,0,0
4,0,0,1,1,0,1,1,1,1,1,1


## Step 1: Verify correctness of the network structure

In [22]:
print_graph_falsifications(graph, data, method="chi-square", verbose=True, significance_level=0.01)

Failed tests: 6/35 (17.14%)
Reject null hypothesis when p<0.01
left    right    given               stats           p    dof        p_adj  p_adj_significant
Jnk     P38      PKA|PKC       171.716      0               4  0            True
Erk     PIP2     PKC            89.0334     0               2  0            True
Plcg    Raf      PKC           249.869      0               2  0            True
Erk     PIP3     PKC           478.923      0               2  0            True
Mek     Plcg     PKC           208.487      0               2  0            True
Akt     PKC      Erk|PIP3|PKA   38.1759     2.8056e-06      7  8.41681e-05  True
PIP3    PKC      PIP2|Plcg      17.2052     0.00176329      4  0.0511354    False
P38     Plcg     PKC             8.17906    0.0167471       2  0.468918     False
Erk     Plcg     PKC             5.37205    0.0681513       2  1            False
Mek     P38      PKA|PKC         3.01208    0.555806        4  1            False
Akt     Mek      Erk|PKA|PKC 

Out of 35 d-separations implied by the network, six failed. As the precentage of failed tests is below 30 percent, its effect on the estimation of causal query is minor. Hence, we proceed to the next step.

## Step 2: Check query identifiability

In [23]:
from y0.algorithm.identify import Identification
from y0.dsl import P, Variable

id_in = Identification.from_expression(
    # query=P(Variable('Erk') @ [Variable('Raf'), Variable('Mek')]),
    query=P(Variable("Erk") @ Variable("Raf")),
    graph=graph,
)
id_in

Identification(outcomes="{Erk}, treatments="{Raf}",conditions="set()",  graph="NxMixedGraph(directed=<networkx.classes.digraph.DiGraph object at 0x0000021A78E5BD50>, undirected=<networkx.classes.graph.Graph object at 0x0000021A78E583D0>)", estimand="P(Akt, Erk, Jnk, Mek, P38, PIP2, PIP3, PKA, PKC, Plcg, Raf)")

The query is identifiable. Hence, we can proceed to the next step.

## Step 3: Find nuisance variables and mark them as latent

This function finds the nuisance variables for the input graph.

In [24]:
from eliater.discover_latent_nodes import find_nuisance_variables, mark_nuisance_variables_as_latent

In [25]:
nuisance_variables = find_nuisance_variables(
    graph, treatments=Variable("Raf"), outcomes=Variable("Erk")
)
nuisance_variables

{Akt}

The nuisance variable is $Akt$.

In [26]:
latent_variable_dag = mark_nuisance_variables_as_latent(
    graph,
    treatments=Variable("Raf"),
    outcomes=Variable("Erk"),
)

## Step 4: Simplify the network

In eliater, step 3, and 4 are both combined into a single function. Hence, the following function finds the nuisance variable (step 3), marks them as latent and then applies Evan's simplification rules (Step 4) to remove the nuisance variables. As a result, running the 'find_nuisance_variables' and 'mark_nuisance_variables_as_latent' functions is not necessary to get the value of step 4. However, we called them to illustrate the results. The new graph obtained in step 4 does not contain nuisance variables. 

In [27]:
from eliater.discover_latent_nodes import remove_nuisance_variables

In [28]:
new_graph = remove_nuisance_variables(graph, treatments=Variable("Raf"), outcomes=Variable("Erk"))

## Step 5: Estimate the query

In [30]:
from y0.algorithm.estimation import estimate_ace

In [31]:
estimate_ace(new_graph, treatments=Variable("Raf"), outcomes=Variable("Erk"), data=data)

-0.30580881280669625