/
pensim.Rmd
361 lines (295 loc) · 10.2 KB
/
pensim.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
---
title: "pensim package user guide"
author: "Levi Waldron"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{pensim package user guide}
%\VignetteEncoding{UTF-8}
bibliography: pensim.bib
output:
html_document:
number_sections: yes
toc: true
---
# Introduction
This package acts as a wrapper to the *penalized* R package to
add the following functionality to that package:
* repeated tuning on different data folds, with parallelization to
take advantage of multiple processors
* two-dimensional tuning of the Elastic Net
* calculation of cross-validated risk scores through a nested
cross-validation procedure, with parallelization to take advantage
of multiple processors
It also provides a function for simulation of collinear
high-dimensional data with survival or binary response.
This package was developed in support of the study by @waldron_optimized_2011.
This paper contains greater detail
on proper application of the methods provided here. Please cite this
paper when using the pensim package in your research, as well as the
penalized package (@goeman_l1_2010).
# Example data
*pensim* provides example data from a microarray experiment
investigating survival of cancer patients with lung adenocarcinomas
(@beer_gene-expression_2002). Load this data
and do an initial pre-filter of genes with low IQR:
```{r dataprep}
library(pensim)
data(beer.exprs)
data(beer.survival)
##select just 100 genes to speed computation, just for the sake of example:
set.seed(1)
beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 100),]
#
gene.quant <- apply(beer.exprs.sample, 1, quantile, probs = 0.75)
dat.filt <- beer.exprs.sample[gene.quant > log2(100),]
gene.iqr <- apply(dat.filt, 1, IQR)
dat.filt <- as.matrix(dat.filt[gene.iqr > 0.5,])
dat.filt <- t(dat.filt)
dat.filt <- data.frame(dat.filt)
#
library(survival)
surv.obj <- Surv(beer.survival$os, beer.survival$status)
```
Note that the expression data are in "wide" format, with one column
per predictor (gene). It is recommended to put covariate data in a
dataframe object, rather than a matrix.
# Nested cross-validation
Unbiased estimation of prediction accuracy involves two levels of
cross-validation: an outer level for estimating prediction accuracy,
and an inner level for model tuning. This process is simplified by
the opt.nested.crossval function.
It is recommended first to establish the arguments for the penalized
regression by testing on the *penalized* package functions `optL1`
(for LASSO), `optL2` (for Ridge) or `cvl` (for Elastic Net). Here we use
LASSO. Setting `maxlambda1=5` is not a generally recommended procedure,
but is useful in this toy example to avoid converging on the null
model.
```{r lassotest}
library(penalized)
testfit <- optL1(
response = surv.obj,
penalized = dat.filt,
fold = 5,
maxlambda1 = 5,
positive = FALSE,
standardize = TRUE,
trace = FALSE
)
```
Now pass these arguments to `opt.nested.crossval()` for cross-validated
calculation and assessment of risk scores, with the additional
arguments:
* `outerfold` and `nprocessors`: number of folds for the outer level of cross-validation, and the number of processors to use for the outer level of cross-validation (see `?opt.nested.crossval`)
* optFUN and scaling: which optimization function to use (`opt1D` for LASSO or Ridge, `opt2D` for Elastic Net) - see `?opt.splitval`.
* setpen and nsim: setpen defines whether to do LASSO ("**L1**") or Ridge ("**L2**"), `nsim` defines the number of times to repeat tuning (see `?opt1D`. `opt2D` has different required arguments.)
```{r opt.nested.crossval}
set.seed(1)
preds <-
opt.nested.crossval(
outerfold = 5,
nprocessors = 1,
#opt.nested.crossval arguments
optFUN = "opt1D",
scaling = FALSE,
#opt.splitval arguments
setpen = "L1",
nsim = 1,
#opt1D arguments
response = surv.obj,
#rest are penalized::optl1 arguments
penalized = dat.filt,
fold = 5,
positive = FALSE,
standardize = TRUE,
trace = FALSE
)
```
Ideally nsim would be 50, and outerfold and fold would be 10, but the
values below speed computation 200x compared to these recommended
values. Note that here we are using the `standardize=TRUE` argument of
`optL1` rather than the `scaling=TRUE` argument of opt.splitval. These
two approaches to scaling are roughly equivalent, but the scaling
approaches are not the same (`scaling=TRUE` does z-score,
`standardize=TRUE` scales to unit central L2 norm), and results will not
be identical. Also, using `standardize=TRUE` scales variables but
provides coeffients for the original scale, whereas using scaling=TRUE
scales variables in the training set then applies the same scales to
the test set.
## Summarization and plotting
Cox fit on the continuous risk predictions:
```{r coxfit}
coxfit.continuous <- coxph(surv.obj~preds)
summary(coxfit.continuous)
```
Dichotomize the cross-validated risk predictions at the median, for visualization:
```{r dichot}
preds.dichot <- preds > median(preds)
```
Plot the ROC curve:
```{r ROCplot, fig.cap="**Figure 1: ROC plot of cross-validated continuous risk predictions at 12 months.** Note that the predictions are better if you don't randomly select 250 genes to start with! We only did this to ease the load on the CRAN checking servers."}
nobs <- length(preds)
cutoff <- 12
if (requireNamespace("survivalROC", quietly = TRUE)) {
preds.roc <-
survivalROC::survivalROC(
Stime = beer.survival$os,
status = beer.survival$status,
marker = preds,
predict.time = cutoff,
span = 0.01 * nobs ^ (-0.20)
)
plot(
preds.roc$FP,
preds.roc$TP,
type = "l",
xlim = c(0, 1),
ylim = c(0, 1),
xlab = paste("FP", "\n", "AUC = ", round(preds.roc$AUC, 3)),
lty = 2,
ylab = "TP",
main = "LASSO predictions\n ROC curve at 12 months"
)
abline(0, 1)
}
```
# Getting coefficients for model fit on all the data
Finally, we can get coefficients for the model fit on all the data,
for future use. Note that nsim should ideally be greater than 1, to
train the model using multiple foldings for cross-validation. The
output of `opt1D` or `opt2D` will be a matrix with one row per simulation.
The default behavior in `opt.nested.crossval()` is to take the
simulation with highest cross-validated partial log likelihood (**CVL**),
which is the recommended way to select a model from the multiple
simulations.
```{r full.model}
beer.coefs <- opt1D(
setpen = "L1",
nsim = 1,
response = surv.obj,
penalized = dat.filt,
fold = 5,
maxlambda1 = 5,
positive = FALSE,
standardize = TRUE,
trace = FALSE
)
```
We can also include unpenalized covariates, if desired. Note that
when keeping only one variable for a penalized or unpenalized
covariate, indexing a dataframe like `[1]` instead of doing `[, 1]`
preserves the variable name. With `[, 1]` the variable name gets
converted to "".
```{r unpenalized.eg}
beer.coefs.unpen <-
opt1D(
setpen = "L1",
nsim = 1,
response = surv.obj,
penalized = dat.filt[-1],
# This is equivalent to dat.filt[,-1]
unpenalized = dat.filt[1],
fold = 5,
maxlambda1 = 5,
positive = FALSE,
standardize = TRUE,
trace = FALSE
)
```
Note the non-zero first coefficient this time, due to it being unpenalized:
```{r lookatcoefs}
beer.coefs[1, 1:5] #example output with no unpenalized covariates
beer.coefs.unpen[1, 1:5] #example output with first covariate unpenalized
```
# Simulation of collinear high-dimensional data with survival or binary outcome
The *pensim* also provides a convenient means to simulation
high-dimensional expression data with (potentially censored) survival
outcome or binary outcome which is dependent on specified covariates.
## Binary outcome example
First, generate the data. Here we simulate 20 variables. The first
15 (**group "a"**) are uncorrelated, and have no association with
outcome. The final five (**group "b"**) have covariance of 0.8 to each
other variable in that group. The response variable is associated
with the first variable group "b" (`firstonly=TRUE`) with a
coefficient of 2.
Binary outcomes for $n_s= 50$ samples are simulated as a Bernoulli
distribution with probability for patient s:
\begin{equation}
p_{s} =\frac{1}{1 + exp(-\beta X_{s})}
\end{equation}
with $\beta_{s,16} = 0.5$ and all other $\beta_{s,i}$ equal to zero.
The code for this simulation is as follows:
```{r genbinary}
set.seed(9)
x <- create.data(
nvars = c(15, 5),
cors = c(0, 0.8),
associations = c(0, 2),
firstonly = c(TRUE, TRUE),
nsamples = 50,
response = "binary",
logisticintercept = 0.5
)
```
Take a look at the simulated data:
```{r lookbinary}
summary(x)
x$summary
```
A simple logistic model fails at variable selection in this case:
```{r fitmodel}
simplemodel <- glm(outcome ~ ., data = x$data, family = binomial)
summary(simplemodel)
```
But LASSO does a better job, selecting several of the collinear
variables in the "b" group of variables which are associated with
outcome:
```{r binarylassodemo}
lassofit <-
opt1D(
nsim = 3,
nprocessors = 1,
setpen = "L1",
penalized = x$data[1:20],
response = x$data[, "outcome"],
trace = FALSE,
fold = 10
)
print(lassofit)
```
And visualize the data as a heatmap:
```{r heatmap, fig.cap = "**Figure 2: Heatmap of simulated data with binary response.**"}
dat <- t(as.matrix(x$data[,-match("outcome", colnames(x$data))]))
heatmap(dat, ColSideColors = ifelse(x$data$outcome == 0, "black", "white"))
```
## Survival outcome example
We simulate these data in the same way, but with
`response="timetoevent"`. Here censoring is uniform random between
times 2 and 10, generating approximately 34\% censoring:
```{r survoutcome}
set.seed(1)
x <- create.data(
nvars = c(15, 5),
cors = c(0, 0.8),
associations = c(0, 0.5),
firstonly = c(TRUE, TRUE),
nsamples = 50,
censoring = c(2, 10),
response = "timetoevent"
)
```
How many events are censored?
```{r howmanycensored}
sum(x$data$cens == 0) / nrow(x$data)
```
Kaplan-Meier plot of this simulated cohort:
```{r simulatedKM, fig.cap = "**Figure 3: Kaplan-Meier plot of survival of simulated cohort.**"}
library(survival)
surv.obj <- Surv(x$data$time, x$data$cens)
plot(survfit(surv.obj ~ 1), ylab = "Survival probability", xlab = "time")
```
# Session Info
```{r}
sessionInfo()
```
# References