# Setting up the data and causal model: CausalityDataset

This notebook demonstrates how to use and configure `CausalityDataset` using an arbitrary `pd.DataFrame`.


In [51]:
%load_ext autoreload
%autoreload 2
import os, sys
import warnings
warnings.filterwarnings('ignore') # suppress sklearn deprecation warnings for now..

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# the below checks for whether we run dowhy, causaltune, and FLAML from source
root_path = root_path = os.path.realpath('../..')
try:
    import causaltune
except ModuleNotFoundError:
    sys.path.append(os.path.join(root_path, "causaltune"))

try:
    import dowhy
except ModuleNotFoundError:
    sys.path.append(os.path.join(root_path, "dowhy"))

try:
    import flaml
except ModuleNotFoundError:
    sys.path.append(os.path.join(root_path, "FLAML"))
    
    
    
from causaltune import CausalTune
from causaltune.datasets import synth_ihdp, iv_dgp_econml, generate_non_random_dataset
from causaltune.data_utils import CausalityDataset
from causaltune.dataset_processor import CausalityDatasetProcessor

In [2]:
# this makes the notebook expand to full width of the browser window
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

### Random assignment 
We first illustrate the model setup with a subset of data from the Infant Health and Development Program (IHDP).

In [3]:
df = synth_ihdp(return_df=True).iloc[:,:5]
display(df.head())

Unnamed: 0,treatment,y_factual,x1,x2,x3
0,1,5.599916,-0.528603,-0.343455,1.128554
1,0,6.875856,-1.736945,-1.802002,0.383828
2,0,2.996273,-0.807451,-0.202946,-0.360898
3,0,1.366206,0.390083,0.596582,-1.85035
4,0,1.963538,-1.045229,-0.60271,0.011465


Generally, at least three arguments have to be supplied to `CausalityDataset`:
- `data`: input dataframe
- `treatment`: name of treatment column
- `outcomes`: list of names of outcome columns; provide as list even if there's just one outcome of interest

In addition, if the propensities to treat are known, then provide the corresponding column name(s) via `propensity_modifiers`.

In [4]:
cd = CausalityDataset(data=df, treatment='treatment', outcomes=['y_factual'])

The next step is to use `cd.preprocess_dataset()` to deal with missing values, remove outliers etc.

In [5]:
cd.preprocess_dataset()

The causal model is built by assuming that all remaining features are `effect_modifiers`

In [6]:
print(cd.effect_modifiers)

['x1', 'x2', 'x3']


Subsequently, use the preprocessed `CausalityDataset` object for training as follow: `CausalTune.fit(cd, outcome='y_factual')`.

In [7]:
ct = CausalTune(components_time_budget=5,)   
ct.fit(data=cd, outcome='y_factual')

Fitting a Propensity-Weighted scoring estimator to be used in scoring tasks
Propensity Model Fitted Successfully


The causal graph that CausalTune uses is 

In [8]:
%matplotlib inline
ct.causal_model.view_model()

*Note that the variable `random` can be ignored and has no real meaning for the causal model.*

#### Adding common causes

If we had reason to assume that for instance `x1` and `x2` are `common causes` instead of `effect modifiers`, this can be made explicit:


In [9]:
cd = CausalityDataset(data=df, treatment='treatment', outcomes=['y_factual'], common_causes=['x1', 'x2'])

The causal graph becomes

In [10]:
cd.preprocess_dataset()
ct = CausalTune(components_time_budget=5,)   
ct.fit(data=cd, outcome='y_factual')
ct.causal_model.view_model()

Fitting a Propensity-Weighted scoring estimator to be used in scoring tasks
Propensity Model Fitted Successfully


For how to proceed further with CausalTune, see for instance [here](https://github.com/py-why/causaltune/blob/main/notebooks/Random%20assignment%2C%20binary%20CATE%20example.ipynb)

### Instrumental variable identification

In other problems of causal inference, one may seek to follow an instrumental variable approach ([Example notebook](https://github.com/py-why/causaltune/blob/main/notebooks/Comparing%20IV%20Estimators.ipynb)). 

In [11]:
#load data
df = iv_dgp_econml(p=4).data
del df['random']
print(df.head(5))

         x1        x2        x3        x4          y  treatment  Z
0 -0.662658  1.124321 -1.699940 -0.379268   5.236122          0  0
1 -0.788565  1.336684 -0.539586 -0.785838  12.039615          1  1
2 -0.344655 -0.204201 -1.267158  0.898114  23.469351          1  1
3  0.125284 -0.557028  0.403744  0.579168   5.300115          0  0
4  0.356507  0.330607  0.430286  1.201554  12.855370          0  0


Suppose we want to use $Z$ as an instrument.

In [12]:
cd = CausalityDataset(
    data=df, 
    treatment='treatment',
    outcomes=['y'],
    instruments=['Z']
    )
cd.preprocess_dataset()

In [13]:
print('Outcomes:', cd.outcomes)
print('Treatment:', cd.treatment)
print('Instruments:', cd.instruments)
print('Effect modifiers:', cd.effect_modifiers)

Outcomes: ['y']
Treatment: treatment
Instruments: ['Z']
Effect modifiers: ['x1', 'x2', 'x3', 'x4']


In [14]:
ct = CausalTune(
    components_time_budget=5,
    estimator_list=['iv.econml.iv.dml.DMLIV']
    )   
ct.fit(data=cd, outcome='y')

In [15]:
ct.causal_model.view_model()

### Propensity modifiers

If there are well-known propensity modifiers, it is also possible to make those explicit. This can, e.g., be used to pass them directly into the model instead of fitting a propensity weight model (for more details, see [here](https://github.com/py-why/causaltune/blob/main/notebooks/Propensity%20Model%20Selection.ipynb)).

In [16]:
#load data
df = generate_non_random_dataset().data
del df['random']
print(df.head(5))

   T         Y        X1        X2        X3        X4        X5  propensity
0  0  1.650705  0.521524 -1.393497  0.010672 -0.828778  1.019257    0.245100
1  0 -0.888552 -0.782541 -1.384920 -0.233656  0.150249 -0.495169    0.205945
2  0 -0.516344 -0.154831 -0.098985  2.335176 -1.888928 -0.594854    0.235870
3  1  0.601679  0.109516  0.092910  0.525252 -1.172202 -0.177947    0.439021
4  0  0.569122 -0.365630 -0.343061 -0.420554 -0.995160  1.548502    0.335151


In [17]:
cd = CausalityDataset(
    data=df, 
    treatment='T',
    outcomes=['Y'],
    propensity_modifiers=['propensity']
    )
cd.preprocess_dataset()

In [18]:
print('Outcomes:', cd.outcomes)
print('Treatment:', cd.treatment)
print('Propensity Modifiers:', cd.propensity_modifiers)
print('Effect modifiers:', cd.effect_modifiers)

Outcomes: ['Y']
Treatment: T
Propensity Modifiers: ['propensity']
Effect modifiers: ['X1', 'X2', 'X3', 'X4', 'X5']


In [19]:
ct = CausalTune(
    components_time_budget=5,
)   
ct.fit(data=cd, outcome='Y')

Fitting a Propensity-Weighted scoring estimator to be used in scoring tasks
Propensity Model Fitted Successfully


In [20]:
ct.causal_model.view_model()

### Pre-processing of the test dataset based on the training set
You can also preprocess the data in the CausalityDataset using one of the popular category encoders: OneHot, WoE, Label, Target.

In [21]:
unique_values_1 = ['A', 'B', 'C', 'D', 'E']
unique_values_2 = ['F', 'G', 'H', 'I', 'J', 'K']
unique_values_3 = ['L', 'M', 'N', 'O', 'P', 'Q', 'R']

In [22]:
df_train = synth_ihdp(return_df=True).iloc[:,:5]
display(df_train.head())

Unnamed: 0,treatment,y_factual,x1,x2,x3
0,1,5.599916,-0.528603,-0.343455,1.128554
1,0,6.875856,-1.736945,-1.802002,0.383828
2,0,2.996273,-0.807451,-0.202946,-0.360898
3,0,1.366206,0.390083,0.596582,-1.85035
4,0,1.963538,-1.045229,-0.60271,0.011465


In [23]:
df_test = synth_ihdp(return_df=True).iloc[:,:5]
display(df_test.head())

Unnamed: 0,treatment,y_factual,x1,x2,x3
0,1,5.599916,-0.528603,-0.343455,1.128554
1,0,6.875856,-1.736945,-1.802002,0.383828
2,0,2.996273,-0.807451,-0.202946,-0.360898
3,0,1.366206,0.390083,0.596582,-1.85035
4,0,1.963538,-1.045229,-0.60271,0.011465


In [24]:
# Adding the category columns with random values
df_train['category_col1'] = np.random.choice(unique_values_1, len(df_train))
df_train['category_col2'] = np.random.choice(unique_values_2, len(df_train))
df_train['category_col3'] = np.random.choice(unique_values_3, len(df_train))

df_test['category_col1'] = np.random.choice(unique_values_1, len(df_test))
df_test['category_col2'] = np.random.choice(unique_values_2, len(df_test))
df_test['category_col3'] = np.random.choice(unique_values_3, len(df_test))

In [28]:
cd_train = CausalityDataset(
    data=df_train,
    treatment='treatment',
    outcomes=['y_factual'],
    effect_modifiers=['x1', 'x2', 'x3', 'category_col1', 'category_col2', 'category_col3']
)

cd_test = CausalityDataset(
    data=df_test,
    treatment='treatment',
    outcomes=['y_factual'],
    effect_modifiers=['x1', 'x2', 'x3', 'category_col1', 'category_col2', 'category_col3']
)

In [29]:
cd_train.data.head()

Unnamed: 0,treatment,y_factual,x1,x2,x3,category_col1,category_col2,category_col3,random
0,1,5.599916,-0.528603,-0.343455,1.128554,E,K,R,1
1,0,6.875856,-1.736945,-1.802002,0.383828,A,F,M,1
2,0,2.996273,-0.807451,-0.202946,-0.360898,D,H,O,0
3,0,1.366206,0.390083,0.596582,-1.85035,D,K,R,0
4,0,1.963538,-1.045229,-0.60271,0.011465,C,K,Q,0


In [30]:
cd_test.data.head()

Unnamed: 0,treatment,y_factual,x1,x2,x3,category_col1,category_col2,category_col3,random
0,1,5.599916,-0.528603,-0.343455,1.128554,B,H,M,1
1,0,6.875856,-1.736945,-1.802002,0.383828,C,I,M,0
2,0,2.996273,-0.807451,-0.202946,-0.360898,B,K,R,0
3,0,1.366206,0.390083,0.596582,-1.85035,C,H,P,1
4,0,1.963538,-1.045229,-0.60271,0.011465,A,H,O,1


You can select one of the categorical encoders: `"onehot", "label", "target", "woe"`

In [31]:
dataset_processor = CausalityDatasetProcessor()
dataset_processor.fit(
    cd=cd_train,
    encoder_type="label"
)
cd_train = dataset_processor.transform(cd_train)
cd_test = dataset_processor.transform(cd_test)

In [32]:
cd_train.data.head()

Unnamed: 0,treatment,y_factual,x1,x2,x3,random,category_col1,category_col2,category_col3
0,1,5.599916,-0.528603,-0.343455,1.128554,1.0,1,1,1
1,0,6.875856,-1.736945,-1.802002,0.383828,1.0,2,2,2
2,0,2.996273,-0.807451,-0.202946,-0.360898,0.0,3,3,3
3,0,1.366206,0.390083,0.596582,-1.85035,0.0,3,1,1
4,0,1.963538,-1.045228,-0.60271,0.011465,0.0,4,1,4


In [33]:
cd_test.data.head()

Unnamed: 0,treatment,y_factual,x1,x2,x3,random,category_col1,category_col2,category_col3
0,1,5.599916,-0.528603,-0.343455,1.128554,1.0,5,3,2
1,0,6.875856,-1.736945,-1.802002,0.383828,0.0,4,4,2
2,0,2.996273,-0.807451,-0.202946,-0.360898,0.0,5,1,1
3,0,1.366206,0.390083,0.596582,-1.85035,1.0,4,3,6
4,0,1.963538,-1.045228,-0.60271,0.011465,1.0,2,3,3


### Example of model training on transformed data
Now if `outcome_model="auto"` in the CausalTune constructor, we search over a simultaneous search space for the EconML estimators and for FLAML wrappers for common regressors. The old behavior is now achieved by `outcome_model="nested"` (Refitting AutoML for each estimator).

In [34]:
# training configs

# set evaluation metric
metric = "energy_distance"

# it's best to specify either time_budget or components_time_budget, 
# and let the other one be inferred; time in seconds
components_time_budget = 10

# specify training set size
train_size = 0.7

In [35]:
ct = CausalTune(
    estimator_list=[
        "DomainAdaptationLearner",
        "CausalForestDML",
        "ForestDRLearner",
    ],
    metric=metric,
    verbose=1,
    components_time_budget=components_time_budget,
    train_size=train_size,
    outcome_model="auto",
)

In [50]:
# run causaltune
ct.fit(data=cd_train, outcome=cd_train.outcomes[0])

print('---------------------')
# return best estimator
print(f"Best estimator: {ct.best_estimator}")
# config of best estimator:
print(f"Best config: {ct.best_config}")
# best score:
print(f"Best score: {ct.best_score}")

In [37]:
predictions = ct.predict(cd_test)

In [49]:
predictions

### Using pre-processing in the model object
- You can also use `preprocess = True` in the `CausalTune` fit method to do preprocessing automatically
- You should specify `encoder_type`
- You should also specify `encoder_outcome` (binary target column) for the `"woe", "target"` encoders, no need for `"onehot", "label"`

In [39]:
unique_values_1 = ['A', 'B', 'C', 'D', 'E']
unique_values_2 = ['F', 'G', 'H', 'I', 'J', 'K']
unique_values_3 = ['L', 'M', 'N', 'O', 'P', 'Q', 'R']

df_train = synth_ihdp(return_df=True).iloc[:,:5]
df_test = synth_ihdp(return_df=True).iloc[:,:5]

df_train['category_col1'] = np.random.choice(unique_values_1, len(df_train))
df_train['category_col2'] = np.random.choice(unique_values_2, len(df_train))
df_train['category_col3'] = np.random.choice(unique_values_3, len(df_train))

df_test['category_col1'] = np.random.choice(unique_values_1, len(df_test))
df_test['category_col2'] = np.random.choice(unique_values_2, len(df_test))
df_test['category_col3'] = np.random.choice(unique_values_3, len(df_test))

In [40]:
df_train.head()

Unnamed: 0,treatment,y_factual,x1,x2,x3,category_col1,category_col2,category_col3
0,1,5.599916,-0.528603,-0.343455,1.128554,A,J,N
1,0,6.875856,-1.736945,-1.802002,0.383828,B,J,P
2,0,2.996273,-0.807451,-0.202946,-0.360898,A,J,P
3,0,1.366206,0.390083,0.596582,-1.85035,E,F,M
4,0,1.963538,-1.045229,-0.60271,0.011465,D,G,Q


In [41]:
df_test.head()

Unnamed: 0,treatment,y_factual,x1,x2,x3,category_col1,category_col2,category_col3
0,1,5.599916,-0.528603,-0.343455,1.128554,C,K,N
1,0,6.875856,-1.736945,-1.802002,0.383828,D,I,N
2,0,2.996273,-0.807451,-0.202946,-0.360898,A,G,P
3,0,1.366206,0.390083,0.596582,-1.85035,D,J,M
4,0,1.963538,-1.045229,-0.60271,0.011465,E,H,P


In [42]:
cd_train = CausalityDataset(
    data=df_train,
    treatment='treatment',
    outcomes=['y_factual'],
    effect_modifiers=['x1', 'x2', 'x3', 'category_col1', 'category_col2', 'category_col3']
)

cd_test = CausalityDataset(
    data=df_test,
    treatment='treatment',
    outcomes=['y_factual'],
    effect_modifiers=['x1', 'x2', 'x3', 'category_col1', 'category_col2', 'category_col3']
)

In [43]:
ct = CausalTune(
    estimator_list=[
        "DomainAdaptationLearner",
        "CausalForestDML",
        "ForestDRLearner",
    ],
    metric=metric,
    verbose=1,
    components_time_budget=components_time_budget,
    train_size=train_size,
    outcome_model="auto"
)

In [48]:
# run causaltune
ct.fit(data=cd_train, outcome=cd_train.outcomes[0], preprocess=True, encoder_type = "label")

print('---------------------')
# return best estimator
print(f"Best estimator: {ct.best_estimator}")
# config of best estimator:
print(f"Best config: {ct.best_config}")
# best score:
print(f"Best score: {ct.best_score}")

In [45]:
predictions = ct.predict(cd_train, preprocess=True)

In [47]:
predictions