# Parametric g-formula: stochastic interventions
In the previous tutorial we went over the basics of the parametric g-formula using `TimeFixedGFormula` for basic interventions. Additionally, we can use the g-formula to look at stochastic interventions. Stochastic interventions are treatment plans under which not necessarily everyone is treated, but some random percentage are treated.

To estimate the g-formula for stochastic treatments, the process is fairly similar. However, instead of treating everyone, some percentage are treated. A random percentage are treated and then $\hat{Y_i^a}$ are predicted and averaged. This process is repeated some number times and the average of the averaged potential outcomes is returned.

For our example, we will return to the previous data set on ART among HIV-infected individuals and all-cause mortality. First, we will load the data (again ignoring missing data)

In [1]:
import numpy as np
import pandas as pd

import zepid 
from zepid import load_sample_data, spline
from zepid.causal.gformula import TimeFixedGFormula

print(zepid.__version__)

0.9.0


In [2]:
df = load_sample_data(timevary=False).drop(columns=['cd4_wk45'])
dfs = df.dropna(subset=['dead']).reset_index().copy()
dfs.info()

dfs[['cd4_rs1', 'cd4_rs2']] = spline(dfs, 'cd40', n_knots=3, term=2, restricted=True)
dfs[['age_rs1', 'age_rs2']] = spline(dfs, 'age0', n_knots=3, term=2, restricted=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   index   517 non-null    int64  
 1   id      517 non-null    int64  
 2   male    517 non-null    int64  
 3   age0    517 non-null    int64  
 4   cd40    517 non-null    int64  
 5   dvl0    517 non-null    int64  
 6   art     517 non-null    int64  
 7   dead    517 non-null    float64
 8   t       517 non-null    float64
dtypes: float64(2), int64(7)
memory usage: 36.5 KB


Similar to the previous tutorial, we initialize the `TimeFixedGFormula` with the data set (`dfs`), our treatment variable (`art`), and binary outcome (`dead`). Then we fit a regression model predicting all-cause mortality as a function of ART and our set of confounding variables (age, CD4 T-cell count, detectable viral load, gender)

In [3]:
g = TimeFixedGFormula(dfs, exposure='art', outcome='dead')
g.outcome_model(model='art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')

Outcome Model
                 Generalized Linear Model Regression Results                  
Dep. Variable:                   dead   No. Observations:                  517
Model:                            GLM   Df Residuals:                      507
Model Family:                Binomial   Df Model:                            9
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -202.83
Date:                Wed, 30 Dec 2020   Deviance:                       405.67
Time:                        08:22:23   Pearson chi2:                     534.
No. Iterations:                     6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -3.9822      2.621     -

However, this time we do some backgound research and find that one potential intervention to increase ART prescriptions increases the probability of ART treatment to 80%. As a result, it is potentially misleading to compare the treat-all vs treat-none scenarios. Instead, we will compare the stochastic treatment where 80% of individuals are treated with ART to the scenario where no one is treated.

## Stochastic Treatment Plans
To do this using `TimeFixedGFormula` we will instead call `fit_stochastic()` function instead of `fit()`. This function allows us to estimate a stochastic treatment. We specify `p=0.8` to have 80% of the population treated at random. By default, `fit_stochastic()` repeats this process 100 times and takes the average of these repeated random treatments. I will also use the `seed` argument to get replicable results. Let's look at the example

In [4]:
g.fit_stochastic(p=0.8, seed=1000191)
r_80 = g.marginal_outcome

g.fit(treatment='none')
r_none = g.marginal_outcome

print('RD:', r_80 - r_none)

RD: -0.060414048704137496


Under the treatment plan where 80% of people are randomly treated, the risk of all-cause mortality would have been 6.0% points lower than if no one was treated. 

After reading some more articles, we find an alternative treatment plan. Under this plan, 75% of men and 90% of women start using HIV. For this plan, we are interested in a conditional stochastic treatment. Again, we want to compare this to the scenario where no one is treated

## Conditional Stochastic Treatment Plans
For conditionally stochastic treatments, we instead provide `p` a list of probabilities. Additionally, we specify the `conditional` argument with the group restrictions. Again, we will need to use the magic-g functionality. Below is the example of the stochastic plan where 75% of men are treated and 90% of women are treated

In [5]:
g.fit_stochastic(p=[0.75, 0.90], conditional=["g['male']==1", "g['male']==0"], seed=518012)
r_cs = g.marginal_outcome

print('RD:', r_cs - r_none)

RD: -0.05865619552516273


Under the treatment plan where 75% of men and 90% of women are randomly treated, the risk of all-cause mortality would have been 5.9% points lower than if no one was treated. This plan reduces the marginal mortality less than the previous stochastic plan because our HIV-infected population is predominantly men. 

# Conclusion
In this tutorial, I detailed stochastic treatment plans using the g-formula. While presented for a binary outcome, the same procedure can also be used to estimate stochastic treatments for continuous outcomes. Please view other tutorials for information other functions in *zEpid*

## Further Readings
Ahern et al. (2016). Predicting the population health impacts of community interventions: the case of alcohol outlets and binge drinking. *AJPH*, 106(11), 1938-1943.

Snowden et al. (2011) "Implementation of G-computation on a simulated data set: demonstration of a causal inference technique." *AJE* 173.7: 731-738.

Robins. (1986) "A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect." *Mathematical modelling* 7.9-12: 1393-1512