# Iterative-Conditional G-Formula
For longitudinal data, an alternative to the Monte-Carlo g-formula is the iterative conditional g-formula. This expression of the g-formula has some nice features. First, we note that we can rewrite the g-formula from
$$E[Y^\bar{a}] = \sum_\bar{l} E[Y_t | \bar{A}_t = \bar{a}, \bar{L_t}] \times \prod_{k=0}^{K} \Pr(L_t = l_t | \bar{A}_t = \bar{a}, \bar{L}_{t-1}) $$
as 
$$E[Y^\bar{a}] = E[\dots E[E[Y_t | \bar{A}_t = \bar{a}, \bar{L}_t] | \bar{A}_{t-1} = \bar{a}, \bar{L}_{t-1}] \dots]$$
This different form means we can use a different estimation procedure to obtain an estimate of the counterfactual. Specifically, we see that we need to estimate a series (iterations) of expected values. The main advantage of this approach is that we no longer need to specify regression models *for each* time-varying variable. Rather, we only need to specify outcome models

## Estimation Procedure
For estimation, we will be working backwards in time. We will start at the inner core of the iterated expectations then work out. Below is a breakdown of the estimation procedure
1. Estimate $E[Y_t | \bar{A}_t = \bar{a}, L_t]$
2. Predict $Q_t$ using the model fit in step 1 and under the counterfactual treatment plan $a^*$
3. Estimate $E[Q_t| \bar{A}_{t-1} = \bar{a}, L_{t-1}]$
4. Predict Q_{t-1} using the model fit in step 3 and under plan $a^*$
5. Repeat steps 3-4 until $Q_1$
6. Predict the mean outcome for time $t$ as $\bar{Q}_0 = E[Q_1]$
For individuals who have the event prior to time $t$, they do not contribute to the iterative conditional procedure until they are observed in the sample. At the first time they are observed, their observed outcome is used instead of their predicted outcome. Afterwards, predicted outcomes are used for the remainder.

In the following example, use a simulated data set that comes with *zEpid*. For our example, we are interested in the time-varying risk of $Y$ under several different treatment strategies. First we will load the longitudinal data.

In [1]:
import numpy as np
import pandas as pd

import zepid
from zepid import load_longitudinal_data
from zepid.causal.gformula import IterativeCondGFormula

print(zepid.__version__)

0.9.0


In [2]:
df = load_longitudinal_data()
df.head()

Unnamed: 0,W,L1,A1,Y1,L2,A2,Y2,L3,A3,Y3,id
0,0.148227,0.500839,0,0,0.588373,1.0,0.0,-0.166033,1.0,1.0,0
1,0.353487,0.856948,1,0,-1.441675,1.0,0.0,-1.521839,1.0,0.0,1
2,-1.08725,-1.175678,0,0,0.401431,1.0,0.0,-0.802022,0.0,0.0,2
3,0.247096,-1.334343,0,0,-0.428034,0.0,0.0,-0.092409,0.0,0.0,3
4,-0.15684,0.768438,1,0,-0.519126,1.0,0.0,-1.125145,1.0,0.0,4


Note that this data is set up different from the data for `MonteCarloGFormula`. Instead, each row corresponds to a single individual, with columns for time-varying variables being indicated by numbers. This data format is referred to as a wide-format. The input data for `IterativeCondGFormula` must be this format. The wide-format allows specification of complex treatments and varying models.

## Initialize the g-formula
Now that our data is loaded, we will initialize the iterative-conditional g-formula. Instead of passing a single string object to the `exposures` and `outcomes` argument, we instead provide a list of strings correspond to our exposure and outcome columns. Not that the order of these columns is forward in time

In [3]:
icgf = IterativeCondGFormula(df, 
                             exposures=['A1', 'A2', 'A3'], 
                             outcomes=['Y1', 'Y2', 'Y3'])

This g-formula implementation will estimate the marginal outcome at the third time point ($E[Y_{t=3}^\bar{a}]$). For estimation at other time-points, we will need to modify the g-formula (which we will do later)

## Outcome Models
Our next step is to specify the outcome models. Again, we will pass a list of `patsy` regression models with time going forward. Note, that the treatment options can be added to each time point. In our data, we will assume the outcome only depends on the current treatment and the treatment from the prior time period

In [4]:
# Specifying regression models for each treatment-outcome pair
icgf.outcome_model(models=['A1 + L1',
                           'A2 + A1 + L2',
                           'A3 + A2 + L3'],
                   print_results=True)

These models will not be fit until `fit()` is called. They will be stored in the background

## Estimation
We can now estimate $E[Y_{t=3}^{\bar{a}}]$. We will begin by specifying that $\bar{a} = \{1, 1, 1\}$, or the always-treat counterfactual. Below is code to estimate this

In [5]:
# Estimating marginal ‘Y3’ under treat-all at every time
icgf.fit(treatments=[1, 1, 1])
print(icgf.marginal_outcome)

Sequential Outcome Model
                 Generalized Linear Model Regression Results                  
Dep. Variable:                     Y3   No. Observations:                  591
Model:                            GLM   Df Residuals:                      587
Model Family:                Binomial   Df Model:                            3
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -254.00
Date:                Wed, 30 Dec 2020   Deviance:                       508.00
Time:                        08:09:04   Pearson chi2:                     584.
No. Iterations:                     5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.2584      

I have the regression models display their output for this example. As you can see, each successive model has additional observations. This follows what I said in the introduction, with regards to individuals who were censored or had the event prior to that time point. Those individuals do not factor into the models until they are observed

We can also specify different treatments. For example, we may imagine that treatment $A$ is only provided at the initial time ($t=1$) and never again after that. Below is code to estimate this counterfactual

In [6]:
icgf.outcome_model(models=['A1 + L1',
                           'A2 + A1 + L2',
                           'A3 + A2 + L3'],
                   print_results=False)
icgf.fit(treatments=[1, 0, 0])
print(icgf.marginal_outcome)

0.5466511436458725


Or we can estimate only treating at $t=2$

In [7]:
icgf.fit(treatments=[0, 1, 0])
print(icgf.marginal_outcome)

0.5650101131697806


Under this format, we can estimate a multitude of different interventions. For confidence intervals, a bootstrapping procedure should be used. 

But how do you estimate at different time points? For example, we now want to estimate $E[Y_{t=2}^{\bar{a}}]$. We do this by re-specifying the iterative conditional g-formula, but only include the first two columns for our treatment and outcome. Below is code to estimate $E[Y_{t=2}^{\bar{a}}]$, where $\bar{a} = \{1, 1\}$

In [8]:
icgf = IterativeCondGFormula(df, exposures=['A1', 'A2'], outcomes=['Y1', 'Y2'])
icgf.outcome_model(models=['A1 + L1',
                           'A2 + A1 + L2'],
                   print_results=False)
icgf.fit(treatments=[1, 1])
print(icgf.marginal_outcome)

0.3496200889449614


We can repeat a similar process for $E[Y_{t=1}^{\bar{a}}]$. This should give the same results as `TimeFixedGFormula`, since the estimation procedure becomes the same in the scenario of a single $t$

In [9]:
icgf = IterativeCondGFormula(df, exposures=['A1'], outcomes=['Y1'])
icgf.outcome_model(models=['A1 + L1'],
                   print_results=False)
icgf.fit(treatments=[1])
print("Iterative-Conditional:\t", icgf.marginal_outcome)

# Demonstrating equivalence
from zepid.causal.gformula import TimeFixedGFormula
g = TimeFixedGFormula(df[['L1', 'A1', 'Y1']], exposure='A1', outcome='Y1')
g.outcome_model(model='A1 + L1', print_results=False)
g.fit(treatment='all')
print("Time-Fixed:\t\t", g.marginal_outcome)

Iterative-Conditional:	 0.228307987340576
Time-Fixed:		 0.228307987340576


# Conclusion
In this tutorial, I discussed the iterative conditional estimation procedure for the parametric g-formula. I detailed the use of `IterativeCondGFormula`. Please view other tutorials for further information on other functions in zEpid

## Further Readings
Kreif N et al. (2017). Estimating the comparative effectiveness of feeding interventions in the pediatric intensive care unit: a demonstration of longitudinal targeted maximum likelihood estimation. *American Journal of Epidemiology*, 186(12), 1370-1379