# Targeted Maximum Likelihood Estimator
The targeted maximum likelihood estimator (TMLE) is a doubly robust estimator. What distinguishes it from other doubly robust estimators (augmented-IPTW) is that it uses a secondary targeting (hence the name) step that optimizes the bias-variance tradeoff for the target parameter. A common target parameter is the sample average treatment effect, which compares to counterfactual where all individuals in the study sample were treated versus the counterfactual where no individual was treated. Throughout this section, I will use the notation that is common in the TMLE literature. This is a little different from the notation in other documents.

The target parameter is defined as
$$\psi = E_L\left[E\left[Y^{a=1}|L\right] - E\left[Y^{a=0}|L\right]\right]$$
We will focus on this parameter, but other estimates (like risk ratio and odds ratio) are also implemented

## Doubly Robust Estimators
Before continuing, I will briefly outline what a doubly-robust estimator is and why you would want to use one. In observational research with high-dimensional data, we (generally) are forced to use parametric models to adjust for many confounders. In this scenario, we assume that our parametric models are correctly specified. Our statistical model, $\mathcal{M}$, must include the distribution that the data came from. 

With other estimators, like IPTW or g-formula, we have one chance to specify $\mathcal{M}$ correctly. Doubly-robust estimators use a model to predict the treatment (like IPTW) and another model to predict the outcome (like g-formula). The estimator then combines the estimates, such that if either is correct, then our estimate will be consistent. Essentially, we get two chances to get the statistical model correct.

A more in-depth description of doubly robust estimators is available in [this pre-print](https://statnav.files.wordpress.com/2017/10/doublerobustness-preprint.pdf)

## TMLE Procedure
I will briefly outline how TMLE (complete-case) is estimated.

1) Initial estimates for $Y$ are predicted from a statistical model. Predicted values of $Y$ are generated for each treatment $Y^{a=1}$ and $Y^{a=0}$. This is commonly refered to as $Q_0$
$$E\left[Y | A, L\right]$$

2) Predicted probabilities of treatment are estimated from a second statistical model. 
$$\Pr(A=1|L)$$

3) Using the predicted probabilities from step 2, we calculate what is referred to as the "clever covariate". Clever covariates are calculated for each individual using the following formula
$$ H(A=a,L) = \frac{I(A=1)}{\pi_1} - \frac{I(A=0)}{\pi_0}$$

4) Calculate the updated counterfactual outcomes $Q_n$ through the targeting step. For a single targeting step, we fit the following logistic regression model
$$logit\left(E(Y|A,L)\right) = logit(\hat{Y}^a) + \sigma * H$$
where the predicted outcome is an offset. 

5) From the targeting step, we predict the targeted estimate via
$$\hat{Y}_*^1 = logit(\hat{Y}^1) + \sigma * H(A=1,L)$$
$$\hat{Y}_*^0 = logit(\hat{Y}^0) + \sigma * H(A=0,L)$$
then from the predicted individual outcomes, we generate the target parameter using
$$\psi = \frac{1}{n} \sum_{i=1}^n \left(\hat{Y}_*^1 - \hat{Y}_*^0\right)$$

For a more indepth discussion, please refer to [Schuler and Rose 2017](https://academic.oup.com/aje/article/185/1/65/2662306)

## An example
To motivate our example, we will use a simulated data set included with *zEpid*. In the data set, we have a cohort of HIV-positive individuals. We are interested in the sample average treatment effect of antiretroviral therapy (ART) on all-cause mortality at 45-weeks. Based on substantive background knowledge, we believe that the treated and untreated population are exchangeable based gender, age, CD4 T-cell count, and detectable viral load. 

In [3]:
from zepid import load_sample_data, spline
from zepid.causal.doublyrobust import TMLE

df = load_sample_data(False)
df[['age_rs1', 'age_rs2']] = spline(df, 'age0', n_knots=3, term=2, restricted=True)
df[['cd4_rs1', 'cd4_rs2']] = spline(df, 'cd40', n_knots=3, term=2, restricted=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 547 entries, 0 to 546
Data columns (total 13 columns):
id          547 non-null int64
male        547 non-null int64
age0        547 non-null int64
cd40        547 non-null int64
dvl0        547 non-null int64
art         547 non-null int64
dead        517 non-null float64
t           547 non-null float64
cd4_wk45    460 non-null float64
age_rs1     547 non-null float64
age_rs2     547 non-null float64
cd4_rs1     547 non-null float64
cd4_rs2     547 non-null float64
dtypes: float64(7), int64(6)
memory usage: 59.8 KB


To start, we will focus on a complete case analysis. Therefore, we will drop the `cd4_wk45` column and all the missing data in `dead`.

In [4]:
dfcc = df.drop(columns=['cd4_wk45']).dropna()
dfcc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 517 entries, 0 to 546
Data columns (total 12 columns):
id         517 non-null int64
male       517 non-null int64
age0       517 non-null int64
cd40       517 non-null int64
dvl0       517 non-null int64
art        517 non-null int64
dead       517 non-null float64
t          517 non-null float64
age_rs1    517 non-null float64
age_rs2    517 non-null float64
cd4_rs1    517 non-null float64
cd4_rs2    517 non-null float64
dtypes: float64(6), int64(6)
memory usage: 52.5 KB


Our data is now ready to conduct a complete case analysis using TMLE. First, we initialize TMLE with our complete-case data (`dfcc`), the treatment (`art`), and the outcome (`dead`)

In [5]:
tml = TMLE(dfcc, exposure='art', outcome='dead')

### Treatment Model
As the first step, we will estimate the treatment model. We believe the sufficient set for the treatment model is gender (`male`), age (`age0`), CD4 T-cell (`cd40`) and detectable viral load (`dvl0`). To relax the functional for assumptions, we will model age and CD4 using restricted quadratic splines

In [6]:
tml.exposure_model('male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')


----------------------------------------------------------------
MODEL: art ~ male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0
-----------------------------------------------------------------
                 Generalized Linear Model Regression Results                  
Dep. Variable:                    art   No. Observations:                  517
Model:                            GLM   Df Residuals:                      508
Model Family:                Binomial   Df Model:                            8
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -206.06
Date:                Fri, 22 Feb 2019   Deviance:                       412.12
Time:                        18:28:53   Pearson chi2:                     510.
No. Iterations:                     5   Covariance Type:             nonrobust
                 coef    std err          z      P>|z|      [0.025      0.975]

By default, `TMLE` uses a logistic regression model to estimate the probabilities of treatment and the corresponding summary of the model fit are printed to the console. 

### Outcome Model
Now, we will estimate the outcome model. We will model the outcomes as ART (`art`), gender (`male`), age (`age0`), CD4 T-cell (`cd40`) and detectable viral load (`dvl0`). Again, we will model age and CD4 using restricted quadratic splines

In [7]:
tml.outcome_model('male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')


----------------------------------------------------------------
MODEL: dead ~ male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0
-----------------------------------------------------------------
                 Generalized Linear Model Regression Results                  
Dep. Variable:                   dead   No. Observations:                  517
Model:                            GLM   Df Residuals:                      508
Model Family:                Binomial   Df Model:                            8
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -204.76
Date:                Fri, 22 Feb 2019   Deviance:                       409.52
Time:                        18:31:30   Pearson chi2:                     511.
No. Iterations:                     6   Covariance Type:             nonrobust
                 coef    std err          z      P>|z|      [0.025      0.975

When a binary outcome is input, `TMLE` uses a logistic regression to estimate the probabilities of the outcome. Model output is printed to the console by default.

### Targeting step
The targeting step and estimation is done through the `fit()` function. 

In [8]:
tml.fit()

We can view our results after the targeting step by using the `summary()` function

In [9]:
tml.summary()

----------------------------------------------------------------------
Risk Difference:  -0.084
95.0% two-sided CI: (-0.153 , -0.015)
----------------------------------------------------------------------
Risk Ratio:  0.536
95.0% two-sided CI: (0.28 , 1.025)
----------------------------------------------------------------------
Odds Ratio:  0.486
95.0% two-sided CI: (0.235 , 1.003)
----------------------------------------------------------------------


As seen, `TMLE` estimates the common target parameters of interest. Additionally, confidence intervals are estimated. Confidence intervals are estimated using efficient influence curves. At this point in my understanding, I cannot tell you much about influence curves or the theory underlying them.

## Tuning Parameters
Luckily in our example, we don't have an issue estimating the predicted probabilities of ART. This can possibly cause issues in the estimation procedure. One solution is to "trim" the estimated propensity scores.

In [13]:
tml = TMLE(dfcc, exposure='art', outcome='dead')
tml.exposure_model('male + age0 + cd40 + cd4_rs1 + cd4_rs2 + dvl0', bound=[0.01, 0.99], print_results=False)
tml.outcome_model('art + male + age0 + cd40 + cd4_rs1 + cd4_rs2 + dvl0', print_results=False)
tml.fit()
tml.summary()

----------------------------------------------------------------------
Risk Difference:  -0.082
95.0% two-sided CI: (-0.154 , -0.009)
----------------------------------------------------------------------
Risk Ratio:  0.552
95.0% two-sided CI: (0.285 , 1.07)
----------------------------------------------------------------------
Odds Ratio:  0.502
95.0% two-sided CI: (0.239 , 1.054)
----------------------------------------------------------------------


summarization here....

## Machine Learning
In the previous example, we used the default statistical model of `TMLE`. One of the major advantages of `TMLE` is the ability to use machine learning models to predict

I will demonstrate TMLE with super learner. Super learner is a generalized stacking algorith. Briefly, it allows us to estimate the ....
An implementation of super learner in Python can be downloaded from [this GitHub repo](https://github.com/alexpkeil1/SuPyLearner). 

summarization...

## Missing Outcome Data
While we conducted a complete-case analysis previously, TMLE can also natively handle missing outcome data. This is accomplished by using inverse probability of missing weights in the background. We can add this model by specifying the `missing_model()` function. 

We will compare the default logistic regression model results and super learner results for the missing outcome data. 

summarization...

## Continuous Outcomes
Continuous outcomes are also implemented. Mention bounding of $Y$ in the background

If no custom model is specified, users can request either Normal or Poisson distributions for continuous outcomes.

`cd4_wk45`

summarization...

# Conclusion
In this tutorial, I have described TMLE and its usage in *zEpid*. `TMLE` can be used to estimate binary or continuous outcomes. Additionally, estimation with outcome data missing at random is possible.

## References

Schuler, Megan S., and Sherri Rose. "Targeted maximum likelihood estimation for causal inference in observational studies." American Journal of Epidemiology 185.1 (2017): 65-73.

van der Laan, Mark J., and Sherri Rose. Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media, 2011.

van Der Laan, Mark J., and Daniel Rubin. "Targeted maximum likelihood learning." The International Journal of Biostatistics 2.1 (2006).

Gruber, S., & van der Laan, M. J. (2011). tmle: An R package for targeted maximum likelihood estimation.