## Setting up the data and causal model: CausalityDataset

This notebook demonstrates how to use and configure `CausalityDataset` using an arbitrary `pd.DataFrame`.


In [2]:
%load_ext autoreload
%autoreload 2
import os, sys
import warnings
warnings.filterwarnings('ignore') # suppress sklearn deprecation warnings for now..

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# the below checks for whether we run dowhy, causaltune, and FLAML from source
root_path = root_path = os.path.realpath('../..')
try:
    import causaltune
except ModuleNotFoundError:
    sys.path.append(os.path.join(root_path, "auto-causality"))

try:
    import dowhy
except ModuleNotFoundError:
    sys.path.append(os.path.join(root_path, "dowhy"))

try:
    import flaml
except ModuleNotFoundError:
    sys.path.append(os.path.join(root_path, "FLAML"))
    
    
    
from causaltune import CausalTune
from causaltune.datasets import synth_ihdp
from causaltune.data_utils import CausalityDataset


In [3]:
# this makes the notebook expand to full width of the browser window
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [4]:
%%javascript

// turn off scrollable windows for large output
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

We illustrate the model setup with a subset of data from the Infant Health and Development Program (IHDP).

In [8]:
df = synth_ihdp(return_df=True).iloc[:,:5]
display(df.head())

Unnamed: 0,treatment,y_factual,x1,x2,x3
0,1,5.599916,-0.528603,-0.343455,1.128554
1,0,6.875856,-1.736945,-1.802002,0.383828
2,0,2.996273,-0.807451,-0.202946,-0.360898
3,0,1.366206,0.390083,0.596582,-1.85035
4,0,1.963538,-1.045229,-0.60271,0.011465


Generally, at least three arguments have to be supplied to `CausalityDataset`:
- `data`: input dataframe
- `treatment`: name of treatment column
- `outcomes`: list of names of outcome columns; provide as list even if there's just one outcome of interest

In addition, if the propensities to treat are known, then provide the corresponding column name(s) via `propensity_modifiers`.

In [9]:
cd = CausalityDataset(data=df, treatment='treatment', outcomes=['y_factual'])

The next (highly recommended but not always strictly necessary) step is to use `cd.preprocess_dataset()` to deal with missing values, remove outliers etc.

In [10]:
cd.preprocess_dataset()

Subsequently, use the preprocessed `CausalityDataset` object for training as follow: `CausalTune.fit(cd, outcome='y_factual')`.

In [11]:
ct = CausalTune(
    propensity_model='auto',
    components_time_budget=5,
    verbose=0
)   
ct.fit(data=cd, outcome='y_factual')

Fitting a Propensity-Weighted scoring estimator to be used in scoring tasks
Initial configs: [{'estimator': {'estimator_name': 'backdoor.causaltune.models.NaiveDummy'}}, {'estimator': {'estimator_name': 'backdoor.causaltune.models.Dummy'}}, {'estimator': {'estimator_name': 'backdoor.econml.metalearners.SLearner'}}, {'estimator': {'estimator_name': 'backdoor.econml.metalearners.DomainAdaptationLearner'}}, {'estimator': {'estimator_name': 'backdoor.econml.dr.ForestDRLearner', 'min_propensity': 1e-06, 'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 5, 'min_weight_fraction_leaf': 0.0, 'max_features': 'auto', 'min_impurity_decrease': 0.0, 'max_samples': 0.45, 'min_balancedness_tol': 0.45, 'honest': True, 'subforest_size': 4}}, {'estimator': {'estimator_name': 'backdoor.econml.dml.CausalForestDML', 'drate': True, 'n_estimators': 100, 'criterion': 'mse', 'min_samples_split': 10, 'min_samples_leaf': 5, 'min_weight_fraction_leaf': 0.0, 'max_features': 'auto', 'min_impurity_decr

The causal graph that CausalTune uses is 

In other problems of causal inference, one may seek to follow an instrumental variable approach. 

If there are well-known propensity modifiers, it is also possible to make those explicit. This can, e.g., be used to pass them 