# Exploring Instrumental Variables with the [HIV Simulator](https://whynot-docs.readthedocs-hosted.com/en/latest/simulators.html#adams-hiv-simulator)


This notebook demonstrates how to generate observational datasets with non-trivial confounding and uses these datasets to explore instrumental variables.

In [2]:
import whynot as wn
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression

  import pandas.util.testing as tm


# Instrumental Variables Background

Suppose that we measure a set of features $X_1,\dots,X_n$, and a target outcome $Y$, for multiple different units. Some fraction of the units receives a treatment; hence, we also have access to a binary variable $T$ which indicates whether the given unit was treated or not.

We are interested in finding the average causal effect of treating a unit. In the language of causality, we want to find
$$$$
$$\mathbb{E}[Y|\text{do}(T=1)] - \mathbb{E}[Y|\text{do}(T=0)].$$

We assume that the outcome is generated as a linear function of the features and the treatment:
$$$$
$$Y = \alpha T + \sum_{i=1}^n \beta_i X_i.$$

If the treatment is uncorrelated with the feature variables, ordinary least squares (OLS) yields unbiased results, giving $\alpha$ in expectation. However, the treatment is often correlated with the features; the fact that a unit receives a treatment indicates that a treatment was necessary in the first place.

One way to get around this issue is by using instrumental variables (IVs). A valid instrument $Z$ is a variable which is independent of $X_1,\dots,X_n$, and affects $Y$ only through $T$. Then, one way to estimate $\alpha$ is to first "guess" $T$ from $Z$ (denoted $\hat T$), and then regress $Y$ onto $\{\hat T, X_1,\dots,X_n\}$ (instead of $\{T, X_1,\dots,X_n\}$). When $T$ is continuous, one common approach to estimating $\alpha$ is using two-stage least-squares (2SLS), in which $\hat T$ is obtained by regressing $T$ onto $Z$.

# Setting up the simulator

We design an experiment on the [HIV simulator](https://whynot-docs.readthedocs-hosted.com/en/latest/simulators.html#adams-hiv-simulator) to demonstrate how to use instrumental variables to solve non-trivial causal inference problems.
We consider an experiment where units (in this case, people) are more likely to receive effective treatment if their indicators of infection are worse. In other words, **treatment status is confounded with indicators of infection.**

First, we write a function to generate the initial state (covariates) for each unit.

In [10]:
def initial_covariate_distribution(rng):
    """Sample initial state by randomly perturbing the default state.
    
    Parameters
    ----------
        rng: numpy random number generator.
        
    Return
    ------
        wn.hiv.State: Initial state of the simulator.
    """
    state = wn.hiv.State()
    state.uninfected_T1 *= rng.uniform(0.45, 2.15)
    state.infected_T1 *=  rng.uniform(0.45, 2.15)
    state.uninfected_T2 *=  rng.uniform(0.45, 2.15)
    state.infected_T2 *=  rng.uniform(0.45, 2.15)
    state.free_virus *=  rng.uniform(0.45, 2.15)
    state.immune_response *=  rng.uniform(0.45, 2.15)
    
    # Whether or not the unit is "enrolled in the study"
    state.instrument = int(rng.rand() < 0.5)
    return state

Next, we write a function describing the probability of treatment assignment.

In our model, the probability of treatment is higher if immune response and free virus are above a critical threshold. As an instrument, we suppose each unit is enrolled in the trial with some probability. Only "enrolled" units are actually treated.

In [11]:
def treatment_propensity(intervention, untreated_run):
    """Probability of treating each unit.

    We are more likely to treat units with high immune response and free virus
    at the time of intervention.
    
    Parameters
    -----------
        intervention: whynot.simulator.hiv.Intervention
        untreated_run: whynot.dynamics.run
            Rollout of the simulator without treatment.

    Returns
    -------
        treatment_prob: Probability of assigning the unit to treatment.

    """
    # Only treat units if they are enrolled in the study
    run = untreated_run
    if run.initial_state.instrument > 0:
        if run[intervention.time].immune_response > 10 and run[intervention.time].free_virus > 1:
            return 0.8
        return 0.2
    return 0.

Finally, we put these pieces together into a `DynamicsExperiment`. The covariates we have access to are 6 variables which are indicative of the individual's health, along with the instrument. The target outcome is the amount of infected macrophages (which should be lower after receiving treatment).

For detailed information on the space of configuration and intervention parameters, see [here](https://whynot-docs.readthedocs-hosted.com/en/latest/simulator_configs/hiv.html).

In [13]:
experiment = wn.DynamicsExperiment(
    name="hiv_confounding",
    description="Study effect of increasing drug efficacy on infected macrophages (cells/ml) under confounding.",
    # Which simulator to use
    simulator=wn.hiv,
    # Configuration parameters for each rollout. Run for 150 steps.
    simulator_config=wn.hiv.Config(epsilon_1=0.1, end_time=150),
    # What intervention to perform in the simulator. 
    # In time step 100, increase drug efficacy from 0.1 to 0.5
    intervention=wn.hiv.Intervention(time=100, epsilon_1=0.5),     
    # Initial distribution over covariates
    state_sampler=initial_covariate_distribution,
    # Treatment assignment rule
    propensity_scorer=treatment_propensity,
    # Measured outcome: Infected macrophages (cells/ml) at step 150
    outcome_extractor=lambda run: run[149].infected_T2,
    # Observed covariates: Covariates of each unit at time of treatment and the instrument
    covariate_builder=lambda intervention, run: np.append(run[100].values(), run.initial_state.instrument))

## Generating data

We gather data from 500 individuals, who are more likely to receive treatment if they show signs of severe infection.

In [None]:
dset = experiment.run(num_samples=500)

Since we can simulate counterfactual outcomes, we get the exact causal effect of receiving treatment for each individual, as well as the average causal effect.

In [8]:
print("The average causal effect of receiving treatment is: {:.2f}".format(dset.sate))

The average causal effect of receiving treatment is: 0.33


## Estimating treatment effects with OLS

In [9]:
# Split into covariates and the instrument
(observations, T, Y) = dset.covariates, dset.treatments, dset.outcomes
X, Z = observations[:, :-1], observations[:, -1:]

First we run plain OLS to estimate the average causal effect.

In [10]:
ols_predictors = X
ols_predictors = np.concatenate([T.reshape(-1,1), ols_predictors], axis=1)
ols_model = sm.OLS(Y, ols_predictors)
ols_results = ols_model.fit()
est_ols = ols_results.params[0] # treatment is the first predictor
ols_rel_error = np.abs((est_ols - dset.sate) / dset.sate)
print("Relative Error in causal estimate of OLS: {:.2f}".format(ols_rel_error))

Relative Error in causal estimate of OLS: 1.99


## Estimating treatment effects with instrumental variables

To eliminate the bias, we turn to instrumental variables. "Enrollment" in the study $Z$ is a valid instrumental variable in this setting. We first predict the treatment indicator $\hat T$ from the instrument $Z$ using logistic regression, and then run OLS to regress $Y$ onto $\hat T$ and the other variables.

In [11]:
instrument = Z - np.mean(Z)
logistic_model = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(instrument.reshape(-1,1),T)
T_hat = logistic_model.predict(instrument.reshape(-1,1))

In [12]:
iv_features = np.concatenate([T_hat.reshape(-1,1), X], axis=1)
iv_model = sm.OLS(Y, iv_features)
iv_results = iv_model.fit()

est_iv = iv_results.params[0]
iv_rel_error = np.abs((est_iv - dset.sate) / dset.sate)
print("Relative Error in causal estimate of IV: {:.5f}".format(iv_rel_error))

Relative Error in causal estimate of IV: 0.70935
