-
Notifications
You must be signed in to change notification settings - Fork 1
/
linreg.Rmd
230 lines (182 loc) · 7.55 KB
/
linreg.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
title: "Accelerating co-ordinate ascent updates for linear regression using DAAREM"
date: June 19, 2019
site: workflowr::wflow_site
output: workflowr::wflow_html
---
In this small demonstration, we show how the
[DAAREM method][daarem-paper] can be used to accelerate co-ordinate
ascent algorithms for linear regression models.
We begin with a simple case in which the regression coefficients are
independently and identically drawn from a simple normal prior with
zero mean---*i.e.*, ridge regression. The co-ordinate ascent update
for $\hat{\beta}_i$, the estimate of the regression coefficient for
variable $i$, is $$\hat{\beta}_i = \frac{(X^Ty)_i - \sum_{j\,\neq\,i}
(X^T\!X)_{ij} \hat{\beta}_j} {(X^T\!X)_{ii} + 1/\sigma_0^2},$$ where
$X$ is the $n \times p$ matrix storing the $n$ observations of $p$
variables, $y$ is the $n$-vector of regression outcomes, and the prior
on the coefficients is assumed to be *i.i.d* normal with mean zero and
variance $\sigma^2 \sigma_0^2$, where $\sigma^2$ is the variance of
the residual.
```{r knitr, echo=FALSE}
knitr::opts_chunk$set(comment = "#",results = "hold",collapse = TRUE,
fig.align = "center")
```
## Analysis settings
These variables specify how the data are generated: `n` is the number
of simulated samples, `p` is the number of simulated predictors, `na`
is the number of simulated predictors that have a nonzero effect, `se`
is the variance of the residual.
```{r sim-settings}
n <- 200
p <- 500
na <- 10
se <- 4
```
This specifies the prior on the regression coefficients: it is normal
with zero mean and variance `s0`.
```{r ridge-settings}
s0 <- 1/se
```
## Set up environment
Load some packages and function definitions used in the example below.
```{r load-pkgs, warning=FALSE, message=FALSE}
library(MASS)
library(daarem)
library(ggplot2)
library(cowplot)
source("../code/misc.R")
source("../code/datasim.R")
source("../code/ridge.R")
source("../code/mr_ash.R")
```
Initialize the sequence of pseudorandom numbers.
```{r set-seed}
set.seed(1)
```
## Simulate a data set
Simulate predictors with "decaying" correlations.
```{r sim-x}
X <- simulate_predictors_decaying_corr(n,p,s = 0.5)
X <- scale(X,center = TRUE,scale = FALSE)
```
Generate additive effects for the markers so that exactly `na` of them have
a nonzero effect on the trait.
```{r sim-beta}
i <- sample(p,na)
b <- rep(0,p)
b[i] <- rnorm(na)
```
Simulate the continuous outcomes, and center them.
```{r sim-y}
y <- drop(X %*% b + sqrt(se)*rnorm(n))
y <- y - mean(y)
```
## Run ridge regression co-ordinate ascent updates
Set the initial estimate of the coefficients.
```{r init-estimate}
b0 <- rep(0,p)
```
Fit the ridge regression model by running 100 iterations of the basic
co-ordinate ascent updates. Note that the co-ordinate ascent updates
are very simple, and are easily implemented in a single line of R
code; see the code for the `ridge.update` function.
```{r fit-ridge}
out <- system.time(fit1 <- ridge(X,y,b0,s0,numiter = 100))
f1 <- ridge.objective(X,y,fit1$b,s0)
cat(sprintf("Computation took %0.2f seconds.\n",out["elapsed"]))
cat(sprintf("Objective value at solution is %0.12f.\n",f1))
```
## Run accelerated co-ordinate ascent updates
Fit the ridge regression model again, this time using DAAREM to speed
up the co-ordinate ascent algorithm.
```{r fit-ridge-2}
out <- system.time(fit2 <- daarridge(X,y,b0,s0,numiter = 100))
f2 <- ridge.objective(X,y,fit2$b,s0)
cat(sprintf("Computation took %0.2f seconds.\n",out["elapsed"]))
cat(sprintf("Objective value at solution is %0.12f.\n",f2))
```
We see that the DAAREM solution is better (it has a higher posterior
value).
## Plot improvement in solution over time
Since the ridge estimate as a closed-form solution, we can easily
compare the above estimates obtained via co-ordinate ascent against
the actual solution.
```{r ridge-solution}
bhat <- drop(solve(t(X) %*% X + diag(rep(1/s0,p)),t(X) %*% y))
f <- ridge.objective(X,y,bhat,s0)
```
This plot shows the improvement in the solution over time for the two
co-ordinate ascent algorithms: the vertical axis ("distance to best
solution") shows the difference between the largest log-posterior
obtained, and the log-posterior at the actual ridge solution (`bhat`).
```{r plot-iter-vs-objective, fig.height=4, fig.width=6}
pdat <-
rbind(data.frame(iter = 1:100,dist = f - fit1$value,method = "basic"),
data.frame(iter = 1:100,dist = f - fit2$value,method = "accelerated"))
p <- ggplot(pdat,aes(x = iter,y = dist,col = method)) +
geom_line(size = 1) +
scale_y_continuous(trans = "log10",breaks = 10^seq(-8,4)) +
scale_color_manual(values = c("darkorange","dodgerblue")) +
labs(x = "iteration",y = "distance from solution")
print(p)
```
From this plot, we see that the accelerated algorithm progresses much
more rapidly toward the solution; after 100 iterations, it nearly
recovers the actual ridge estimates, whereas the unaccelerated version
is still very far away.
## Linear regression with mixture-of-normals priors
Next, we consider a less simple case in which the regression
coefficients are independently and identically drawn from mixture of
zero-centered normals; this can be seen as a multivariate extension to
the [adaptive shrinkage model][adaptive-shrinkage-paper], so we call
this "multivariate regression adaptive shrinkage" (mr-ash). Although
posterior computations with this model are more difficult than with
ridge regression, we can nonetheless obtain simple co-ordinate ascent
updates for computing posterior expectations of the coefficients if we
introduce a *variational approximation* to the posterior
distribution. The full derivation is omitted here; see the code in the
`mr_ash_update` function for details. (Note that the co-ordinate
ascent updates, unlike the ridge regression updates, are only
guaranteed to recover a *local maximum* of the objective function
being optimized.)
These two variables specify the variances and mixture weights for the
mixture-of-normals priors. Here we illustrate mr-ash with a prior
that is a mixture of three normals.
```{r mr-ash-settings}
s0 <- c(0.1,1,10)^2/se
w <- c(0.5,0.25,0.25)
```
## Run mr-ash co-ordinate ascent updates
Fit the mr-ash model by running 200 iterations of the basic
co-ordinate ascent updates.
```{r fit-mr-ash}
out <- system.time(fit3 <- mr_ash(X,y,b0,se,s0,w,numiter = 100))
cat(sprintf("Computation took %0.2f seconds.\n",out["elapsed"]))
```
## Run accelerated mr-ash co-ordinate ascent updates
Fit the mr-ash model again, this time using DAAREM to speed up the
co-ordinate ascent updates.
```{r fit-mr-ash-2}
out <- system.time(fit4 <- daar_mr_ash(X,y,b0,se,s0,w,numiter = 100))
cat(sprintf("Computation took %0.2f seconds.\n",out["elapsed"]))
```
Like the plot above, plot shows the improvement in the solution over
time for the basic and accelated co-ordinate ascent algorithms for
mr-ash. Both algorithms end up at the same solution, but the
"accelerated" version indeed arrives at the solution much more
quickly.
```{r plot-iter-vs-mr-ash-objective, fig.height=4, fig.width=6}
f <- max(c(fit3$value,fit4$value)) + 1e-8
pdat <-
rbind(data.frame(iter = 1:100,dist = f - fit3$value,method = "basic"),
data.frame(iter = 1:100,dist = f - fit4$value,method = "accelerated"))
p <- ggplot(pdat,aes(x = iter,y = dist,col = method)) +
geom_line(size = 1) +
scale_y_continuous(trans = "log10",breaks = 10^seq(-8,4)) +
scale_color_manual(values = c("darkorange","dodgerblue")) +
labs(x = "iteration",y = "distance from best solution")
print(p)
```
[daarem-paper]: https://doi.org/10.1080/10618600.2019.1594835
[adaptive-shrinkage-paper]: https://doi.org/10.1093/biostatistics/kxw041