/
shift_tmle.Rmd
275 lines (228 loc) 路 11.6 KB
/
shift_tmle.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
---
title: "Targeted Learning with Stochastic Treatment Regimes"
author: "[Nima Hejazi](https://nimahejazi.org), [Jeremy
Coyle](https://github.com/jeremyrcoyle), and [Mark van der
Laan](https://vanderlaan-lab.org)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: ../inst/REFERENCES.bib
vignette: >
%\VignetteIndexEntry{Targeted Learning with Stochastic Treatment Regimes}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r options, echo=FALSE}
options(scipen=999)
```
## Introduction
Stochastic treatment regimes present a relatively simple manner in which to
assess the effects of continuous treatments by way of parameters that examine
the effects induced by the counterfactual shifting of the observed values of a
treatment of interest. Here, we present an implementation of a new algorithm for
computing targeted minimum loss-based estimates of treatment shift parameters
defined based on a shifting function $d(A,W)$. For a technical presentation of
the algorithm, the interested reader is invited to consult @diaz2018stochastic.
For additional background on Targeted Learning and previous work on stochastic
treatment regimes, please consider consulting @vdl2011targeted,
@vdl2018targeted, and @diaz2012population.
To start, let's load the packages we'll use and set a seed for simulation:
```{r setup, message=FALSE, warning=FALSE}
library(data.table)
library(sl3)
library(tmle3)
library(tmle3shift)
set.seed(429153)
```
## Data and Notation
Consider $n$ observed units $O_1, \ldots, O_n$, where each random variable $O =
(W, A, Y)$ corresponds to a single observational unit. Let $W$ denote baseline
covariates (e.g., age, sex, education level), $A$ an intervention variable of
interest (e.g., nutritional supplements), and $Y$ an outcome of interest (e.g.,
disease status). Though it need not be the case, let $A$ be continuous-valued,
i.e. $A \in \mathbb{R}$. Let $O_i \sim \mathcal{P} \in \mathcal{M}$, where
$\mathcal{M}$ is the nonparametric statistical model defined as the set of
continuous densities on $O$ with respect to some dominating measure. To
formalize the definition of stochastic interventions and their corresponding
causal effects, we introduce a nonparametric structural equation model (NPSEM),
based on @pearl2000causality, to define how the system changes under posited
interventions:
\begin{align*}\label{eqn:npsem}
W &= f_W(U_W) \\ A &= f_A(W, U_A) \\ Y &= f_Y(A, W, U_Y),
\end{align*}
We denote the observed data structure $O = (W, A, Y)$
Letting $A$ denote a continuous-valued treatment, we assume that the
distribution of $A$ conditional on $W = w$ has support in the interval
$(l(w), u(w))$ -- for convenience, let this support be _a.e._ That is, the
minimum natural value of treatment $A$ for an individual with covariates
$W = w$ is $l(w)$; similarly, the maximum is $u(w)$. Then, a simple stochastic
intervention, based on a shift $\delta$, may be defined
\begin{equation}\label{eqn:shift}
d(a, w) =
\begin{cases}
a - \delta & \text{if } a > l(w) + \delta \\
a & \text{if } a \leq l(w) + \delta,
\end{cases}
\end{equation}
where $0 \leq \delta \leq u(w)$ is an arbitrary pre-specified value that
defines the degree to which the observed value $A$ is to be shifted, where
possible. For the purpose of using such a shift in practice, the present
software provides the functions `shift_additive` and `shift_additive_inv`,
which define a variation of this shift, assuming that the density of treatment
$A$, conditional on the covariates $W$, has support _a.e._
### Simulate Data
```{r sim_data}
# simulate simple data for tmle-shift sketch
n_obs <- 1000 # number of observations
n_w <- 1 # number of baseline covariates
tx_mult <- 2 # multiplier for the effect of W = 1 on the treatment
## baseline covariates -- simple, binary
W <- as.numeric(replicate(n_w, rbinom(n_obs, 1, 0.5)))
## create treatment based on baseline W
A <- as.numeric(rnorm(n_obs, mean = tx_mult * W, sd = 1))
## create outcome as a linear function of A, W + white noise
Y <- A + W + rnorm(n_obs, mean = 0, sd = 0.5)
```
The above composes our observed data structure $O = (W, A, Y)$. To formally
express this fact using the `tlverse` grammar introduced by the [`tmle3`
package](https://github.com/tlverse/tmle3), we create a single data object and
specify the functional relationships between the nodes in the _directed acyclic
graph_ (DAG) via _nonparametric structural equation models_ (NPSEMs), reflected
in the node list that we set up:
```{r data_nodes}
# organize data and nodes for tmle3
data <- data.table(W, A, Y)
node_list <- list(W = "W", A = "A", Y = "Y")
head(data)
```
We now have an observed data structure (`data`) and a specification of the role
that each variable in the data set plays as the nodes in a DAG.
## Methodology
To start, we will initialize a specification for the TMLE of our parameter of
interest (called a `tmle3_Spec` in the `tlverse` nomenclature) simply by calling
`tmle_shift`. We specify the argument `shift_val = 0.5` when initializing the
`tmle3_Spec` object to communicate that we're interested in a shift of $0.5$ on
the scale of the treatment $A$ -- that is, we specify $\delta = 0.5$ (note that
this is an arbitrarily chosen value for this example).
```{r spec_init}
# initialize a tmle specification
tmle_spec <- tmle_shift(shift_val = 0.5,
shift_fxn = shift_additive,
shift_fxn_inv = shift_additive_inv)
```
As seen above, the `tmle_shift` specification object (like all `tmle3_Spec`
objects) does _not_ store the data for our specific analysis of interest. Later,
we'll see that passing a data object directly to the `tmle3` wrapper function,
alongside the instantiated `tmle_spec`, will serve to construct a `tmle3_Task`
object internally (see the [`tmle3`
documentation](https://tlverse.org/tmle3) for details). Note that, by default,
the `tmle_spec` object is set up to facilitate cross-validated estimation of
likelihood components, ensuring certain empirical process conditions may be
circumvented by reducing the contribution of an empirical process term to the
estimated influence function [@zheng2011cross]. In practice, this automatic
incorporation of cross-validation (CV-TMLE) means that the user need not be
concerned with these theoretical conditions being satisfied; moreover,
cross-validated estimation of the efficient influence function is expected to
control the estimated variance.
### _Interlude:_ Constructing Optimal Stacked Regressions with `sl3`
To easily incorporate ensemble machine learning into the estimation procedure,
we rely on the facilities provided in the [`sl3` R
package](https://tlverse.org/sl3). For a complete guide on using the `sl3` R
package, consider consulting https://tlverse.org/sl3, or https://tlverse.org for
the [`tlverse` ecosystem](https://github.com/tlverse), of which `sl3` is a core
component.
Using the framework provided by the [`sl3` package](https://tlverse.org/sl3),
the nuisance parameters of the TML estimator may be fit with ensemble learning,
using the cross-validation framework of the Super Learner algorithm of
@vdl2007super. To estimate the treatment mechanism (denoted g, we must make use
of learning algorithms specifically suited to conditional density estimation; a
list of such learners may be extracted from `sl3` using `sl3_list_learners()`:
```{r sl3_density_lrnrs_search}
sl3_list_learners("density")
```
To proceed, we'll select two of the above learners, `Lrnr_haldensify` for using
the highly adaptive lasso for conditional density estimation, based on an
algorithm given by @diaz2011super, and `Lrnr_density_semiparametric`, an
approach for semiparametric conditional density estimation:
```{r sl3_density_lrnrs}
# learners used for conditional density regression (i.e., propensity score)
haldensify_lrnr <- Lrnr_haldensify$new(
n_bins = 3, grid_type = "equal_mass",
lambda_seq = exp(seq(-1, -9, length = 100))
)
hse_lrnr <- Lrnr_density_semiparametric$new(mean_learner = Lrnr_glm$new())
mvd_lrnr <- Lrnr_density_semiparametric$new(mean_learner = Lrnr_glm$new(),
var_learner = Lrnr_mean$new())
sl_lrn_dens <- Lrnr_sl$new(
learners = list(haldensify_lrnr, hse_lrnr, mvd_lrnr),
metalearner = Lrnr_solnp_density$new()
)
```
We also require an approach for estimating the outcome regression (denoted Q).
For this, we build a Super Learner composed of an intercept model, a main terms
GLM, and the [xgboost algorithm](https://xgboost.ai/) for gradient boosting:
```{r sl3_regression_lrnrs}
# learners used for conditional expectation regression (e.g., outcome)
mean_lrnr <- Lrnr_mean$new()
glm_lrnr <- Lrnr_glm$new()
xgb_lrnr <- Lrnr_xgboost$new()
sl_lrn <- Lrnr_sl$new(
learners = list(mean_lrnr, glm_lrnr, xgb_lrnr),
metalearner = Lrnr_nnls$new()
)
```
We can make the above explicit with respect to standard notation by bundling
the ensemble learners into a `list` object below.
```{r make_lrnr_list}
# specify outcome and treatment regressions and create learner list
Q_learner <- sl_lrn
g_learner <- sl_lrn_dens
learner_list <- list(Y = Q_learner, A = g_learner)
```
The `learner_list` object above specifies the role that each of the ensemble
learners we've generated is to play in computing initial estimators to be used
in building a TMLE for the parameter of interest here. In particular, it makes
explicit the fact that our `Q_learner` is used in fitting the outcome regression
while our `g_learner` is used in fitting our treatment mechanism regression.
### Targeted Estimation of Stochastic Interventions Effects
Note that, by default, the
```{r fit_tmle}
tmle_fit <- tmle3(tmle_spec, data, node_list, learner_list)
tmle_fit
```
The `print` method of the resultant `tmle_fit` object conveniently displays the
results from computing our TML estimator.
### Statistical Inference for Targeted Maximum Likelihood Estimates
Recall that the asymptotic distribution of TML estimators has been studied
thoroughly:
$$\psi_n - \psi_0 = (P_n - P_0) \cdot D(\bar{Q}_n^*, g_n) + R(\hat{P}^*, P_0),$$
which, provided the following two conditions:
1. If $D(\bar{Q}_n^{\star}, g_n)$ converges to $D(P_0)$ in $L_2(P_0)$ norm, and
2. the size of the class of functions considered for estimation of
$\bar{Q}_n^{\star}$ and $g_n$ is bounded (technically, $\exists \mathcal{F}$
s.t. $D(\bar{Q}_n^{\star}, g_n) \in \mathcal{F}$ *__whp__*, where
$\mathcal{F}$ is a Donsker class),
readily admits the conclusion that
$\psi_n - \psi_0 = (P_n - P_0) \cdot D(P_0) + R(\hat{P}^{\star}, P_0)$.
Under the additional condition that the remainder term $R(\hat{P}^*, P_0)$
decays as $o_P \left( \frac{1}{\sqrt{n}} \right),$ we have that
$$\psi_n - \psi_0 = (P_n - P_0) \cdot D(P_0) + o_P \left( \frac{1}{\sqrt{n}}
\right),$$
which, by a central limit theorem, establishes a Gaussian limiting distribution
for the estimator:
$$\sqrt{n}(\psi_n - \psi) \to N(0, V(D(P_0))),$$
where $V(D(P_0))$ is the variance of the efficient influence curve (canonical
gradient) when $\psi$ admits an asymptotically linear representation.
The above implies that $\psi_n$ is a $\sqrt{n}$-consistent estimator of $\psi$,
that it is asymptotically normal (as given above), and that it is locally
efficient. This allows us to build Wald-type confidence intervals in a
straightforward manner:
$$\psi_n \pm z_{\alpha} \cdot \frac{\sigma_n}{\sqrt{n}},$$
where $\sigma_n^2$ is an estimator of $V(D(P_0))$. The estimator $\sigma_n^2$
may be obtained using the bootstrap or computed directly via the following
$$\sigma_n^2 = \frac{1}{n} \sum_{i = 1}^{n} D^2(\bar{Q}_n^{\star}, g_n)(O_i)$$
Having now re-examined these facts, let's simply examine the results of
computing our TML estimator:
```{r tmle_inference}
tmle_fit
```
## References