# Causal Discovery with WhyNot

WhyNot provides tools to automatically construct *the causal graph* associated with runs of the dynamical system simulators.
Building off of work in [automatic differentiation](https://github.com/HIPS/autograd), WhyNot traces the evoluation of the state variables
during simulation and builds up the corresponding causal graph. This allows the developer to write complicated simulators using raw Python
and Numpy and then automatically extract the graph of the dynamics in a way that is more flexible and less error-prone than tracking the dynamics by hand.


In this notebook, we leverage these tools to test causal discovery algorithms. In particular, we run an experiment to discover the causal structure of the dynamics for the [HIV simulator](https://whynot-docs.readthedocs-hosted.com/en/latest/simulators.html#adams-hiv-simulator). We then evaluate the performance of the IC* (Inductive Causation with latent variables) algorithm from Pearl, 2000. We use the independence tests and IC implementation provided by the [causality](https://www.github.com/akelleh/causality) package.


**Note**: This feature is still experimental, and there are likely a few rough edges.

In [3]:
%load_ext autoreload
%autoreload 2

import itertools
import whynot as wn
import numpy as np
import pandas as pd

import scripts.causal_search as utils

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Learning the dynamics of the HIV simulator

We run an experiment to discover the causal structure of the dynamics for the HIV simulator. Rather than try to learn the structure for the entire unrolled dynamics, we instead focus on learning the causal structure of the dynamics *for a single time step*. 

There are 6 states in the simulator. Given a state $x_t \in \mathbb{R}^6$, the dynamics evolve according to $x_{t+1} = f(x_t) \in \mathbb{R}^6$, and we wish to uncover how each component of $x_{t+1}$, e.g. "infected CD4+ T-lymphocytes", depends on the components of $x_t$, e.g. "infected macrophages." 

Hence, in this experiment, there are 12 nodes (one for each component of the state at time steps $t$ and $t+1$), and 20 directed edges between them determined by tracing the simulator execution.

## Generating data

We first generate the initial state. Each initial state is an IID draw from the state distribution given by
randomly perturbing the default simulator state.



In [4]:
def sample_initial_state():
    """Sample initial state by randomly perturbing the default state."""
    state = wn.hiv.State().values()
    perturbed = state * np.random.uniform(low=0.95, high=1.05, size=6)
    return wn.hiv.State(*perturbed)

initial_states = [sample_initial_state() for _ in range(500)]

Given initial states, we run the simulator for a single time step forward.

In [6]:
config = wn.hiv.Config(delta_t=1.0, start_time=0, end_time=1)
runs = [wn.hiv.simulate(init_state, config) for init_state in initial_states]

## Extracting the causal graph

WhyNot provides tools to automatically construct the causal graph from simulator executions. 

In [11]:
true_graph = wn.causal_graphs.build_dynamics_graph(wn.hiv, runs, config)

print(f"Number of nodes: {len(true_graph.nodes)}, Number of edges: {len(true_graph.edges)}")
print(true_graph.nodes)

Number of nodes: 12, Number of edges: 20
['uninfected_T1_0.0', 'infected_T1_0.0', 'uninfected_T2_0.0', 'infected_T2_0.0', 'free_virus_0.0', 'immune_response_0.0', 'uninfected_T1_1.0', 'infected_T1_1.0', 'uninfected_T2_1.0', 'infected_T2_1.0', 'free_virus_1.0', 'immune_response_1.0']


## Running the IC* algorithm for causal discovery 

We first reformat the generated data into a dataframe, and then we pass this dataframe to the IC* algorithm
to learn the underlying structure between the variables.

In [13]:
# Generate a dataset consisting of all of the simulator covariates, unrolled over time
def flatten(run):
    """Flatten the covariates into a single long observation"""
    return np.concatenate([state.values() for state in run.states])

data = np.array([flatten(run) for run in runs]) 
columns = [f"{name}_{time}" for time, name in itertools.product(runs[0].times, wn.hiv.State.variable_names())]
df_hiv = pd.DataFrame(data, columns=columns)
df_hiv.head()

Unnamed: 0,uninfected_T1_0.0,infected_T1_0.0,uninfected_T2_0.0,infected_T2_0.0,free_virus_0.0,immune_response_0.0,uninfected_T1_1.0,infected_T1_1.0,uninfected_T2_1.0,infected_T2_1.0,free_virus_1.0,immune_response_1.0
0,994539.8,9.6e-05,3156.788491,0.0001,1.046378,10.106029,994592.4,1.485004,3156.50239,0.589193,8.163717,10.109272
1,1005778.0,0.0001,3111.359607,9.8e-05,0.997685,10.311317,1005719.0,1.448924,3111.55989,0.560389,7.900592,10.294816
2,958916.5,0.000103,3229.39354,9.6e-05,1.044209,9.528145,959323.7,1.351149,3228.407837,0.568524,7.591201,9.58492
3,1038134.0,9.7e-05,3261.808979,0.000101,0.995414,10.05176,1037752.0,1.660425,3260.407173,0.652166,9.01232,10.061156
4,950757.2,0.0001,3156.497745,0.000102,0.965206,9.537344,951245.8,1.194011,3156.32313,0.495346,6.701102,9.591946


### Run the search algorithm. 

This might take a while.

In [14]:
ic_algorithm = IC(RobustRegressionTest)
estimated_graph = ic_algorithm.search(df_hiv, variable_types={column: 'c' for column in columns})

## Evaluating the results

How well did the causal structure learning algorithm perform?

In [15]:
print("Original Graph: {} edges, Estimated Graph: {} edges".format(len(true_graph.edges), len(estimated_graph.edges)))
print("Undirected Edge F1 Score: {:.2f}".format(undirected_f1(true_graph, estimated_graph)))
print("Directed Edge F1 Score: {:.2f}".format(directed_f1(true_graph, estimated_graph)))

Original Graph: 20 edges, Estimated Graph: 14 edges
Undirected Edge F1 Score: 0.29
Directed Edge F1 Score: 0.09
