analysis/linreg.Rmd

---
title: "Accelerating co-ordinate ascent updates for linear regression using DAAREM"
date: June 19, 2019
site: workflowr::wflow_site
output: workflowr::wflow_html
---

In this small demonstration, we show how the
[DAAREM method][daarem-paper] can be used to accelerate co-ordinate
ascent algorithms for linear regression models.

We begin with a simple case in which the regression coefficients are
independently and identically drawn from a simple normal prior with
zero mean---*i.e.*, ridge regression. The co-ordinate ascent update
for $\hat{\beta}_i$, the estimate of the regression coefficient for
variable $i$, is $$\hat{\beta}_i = \frac{(X^Ty)_i - \sum_{j\,\neq\,i}
(X^T\!X)_{ij} \hat{\beta}_j} {(X^T\!X)_{ii} + 1/\sigma_0^2},$$ where
$X$ is the $n \times p$ matrix storing the $n$ observations of $p$
variables, $y$ is the $n$-vector of regression outcomes, and the prior
on the coefficients is assumed to be *i.i.d* normal with mean zero and
variance $\sigma^2 \sigma_0^2$, where $\sigma^2$ is the variance of
the residual.

```{r knitr, echo=FALSE}
knitr::opts_chunk$set(comment = "#",results = "hold",collapse = TRUE,
                      fig.align = "center")
```

## Analysis settings

These variables specify how the data are generated: `n` is the number
of simulated samples, `p` is the number of simulated predictors, `na`
is the number of simulated predictors that have a nonzero effect, `se`
is the variance of the residual.

```{r sim-settings}
n  <- 200 
p  <- 500
na <- 10
se <- 4
```

This specifies the prior on the regression coefficients: it is normal
with zero mean and variance `s0`.

```{r ridge-settings}
s0 <- 1/se
```

## Set up environment

Load some packages and function definitions used in the example below.

```{r load-pkgs, warning=FALSE, message=FALSE}
library(MASS)
library(daarem)
library(ggplot2)
library(cowplot)
source("../code/misc.R")
source("../code/datasim.R")
source("../code/ridge.R")
source("../code/mr_ash.R")
```

Initialize the sequence of pseudorandom numbers.

```{r set-seed}
set.seed(1)
```

## Simulate a data set

Simulate predictors with "decaying" correlations.

```{r sim-x}
X <- simulate_predictors_decaying_corr(n,p,s = 0.5)
X <- scale(X,center = TRUE,scale = FALSE)
```

Generate additive effects for the markers so that exactly `na` of them have
a nonzero effect on the trait.

```{r sim-beta}
i    <- sample(p,na)
b    <- rep(0,p)
b[i] <- rnorm(na)
```

Simulate the continuous outcomes, and center them.

```{r sim-y}
y <- drop(X %*% b + sqrt(se)*rnorm(n))
y <- y - mean(y)
```

## Run ridge regression co-ordinate ascent updates

Set the initial estimate of the coefficients.

```{r init-estimate}
b0 <- rep(0,p)
```

Fit the ridge regression model by running 100 iterations of the basic
co-ordinate ascent updates. Note that the co-ordinate ascent updates
are very simple, and are easily implemented in a single line of R
code; see the code for the `ridge.update` function.

```{r fit-ridge}
out <- system.time(fit1 <- ridge(X,y,b0,s0,numiter = 100))
f1  <- ridge.objective(X,y,fit1$b,s0)
cat(sprintf("Computation took %0.2f seconds.\n",out["elapsed"]))
cat(sprintf("Objective value at solution is %0.12f.\n",f1))
```

## Run accelerated co-ordinate ascent updates

Fit the ridge regression model again, this time using DAAREM to speed
up the co-ordinate ascent algorithm.

```{r fit-ridge-2}
out <- system.time(fit2 <- daarridge(X,y,b0,s0,numiter = 100))
f2  <- ridge.objective(X,y,fit2$b,s0)
cat(sprintf("Computation took %0.2f seconds.\n",out["elapsed"]))
cat(sprintf("Objective value at solution is %0.12f.\n",f2))
```

We see that the DAAREM solution is better (it has a higher posterior
value).

## Plot improvement in solution over time

Since the ridge estimate as a closed-form solution, we can easily
compare the above estimates obtained via co-ordinate ascent against
the actual solution.

```{r ridge-solution}
bhat <- drop(solve(t(X) %*% X + diag(rep(1/s0,p)),t(X) %*% y))
f    <- ridge.objective(X,y,bhat,s0)
```

This plot shows the improvement in the solution over time for the two
co-ordinate ascent algorithms: the vertical axis ("distance to best
solution") shows the difference between the largest log-posterior
obtained, and the log-posterior at the actual ridge solution (`bhat`).

```{r plot-iter-vs-objective, fig.height=4, fig.width=6}
pdat <-
  rbind(data.frame(iter = 1:100,dist = f - fit1$value,method = "basic"),
        data.frame(iter = 1:100,dist = f - fit2$value,method = "accelerated"))
p <- ggplot(pdat,aes(x = iter,y = dist,col = method)) +
  geom_line(size = 1) +
  scale_y_continuous(trans = "log10",breaks = 10^seq(-8,4)) +
  scale_color_manual(values = c("darkorange","dodgerblue")) +
  labs(x = "iteration",y = "distance from solution")
print(p)
```

From this plot, we see that the accelerated algorithm progresses much
more rapidly toward the solution; after 100 iterations, it nearly
recovers the actual ridge estimates, whereas the unaccelerated version
is still very far away.

## Linear regression with mixture-of-normals priors

Next, we consider a less simple case in which the regression
coefficients are independently and identically drawn from mixture of
zero-centered normals; this can be seen as a multivariate extension to
the [adaptive shrinkage model][adaptive-shrinkage-paper], so we call
this "multivariate regression adaptive shrinkage" (mr-ash). Although
posterior computations with this model are more difficult than with
ridge regression, we can nonetheless obtain simple co-ordinate ascent
updates for computing posterior expectations of the coefficients if we
introduce a *variational approximation* to the posterior
distribution. The full derivation is omitted here; see the code in the
`mr_ash_update` function for details. (Note that the co-ordinate
ascent updates, unlike the ridge regression updates, are only
guaranteed to recover a *local maximum* of the objective function
being optimized.)

These two variables specify the variances and mixture weights for the
mixture-of-normals priors. Here we illustrate mr-ash with a prior
that is a mixture of three normals.

```{r mr-ash-settings}
s0 <- c(0.1,1,10)^2/se
w  <- c(0.5,0.25,0.25)
```

## Run mr-ash co-ordinate ascent updates

Fit the mr-ash model by running 200 iterations of the basic
co-ordinate ascent updates. 

```{r fit-mr-ash}
out <- system.time(fit3 <- mr_ash(X,y,b0,se,s0,w,numiter = 100))
cat(sprintf("Computation took %0.2f seconds.\n",out["elapsed"]))
```

## Run accelerated mr-ash co-ordinate ascent updates

Fit the mr-ash model again, this time using DAAREM to speed up the
co-ordinate ascent updates.

```{r fit-mr-ash-2}
out <- system.time(fit4 <- daar_mr_ash(X,y,b0,se,s0,w,numiter = 100))
cat(sprintf("Computation took %0.2f seconds.\n",out["elapsed"]))
```

Like the plot above, plot shows the improvement in the solution over
time for the basic and accelated co-ordinate ascent algorithms for
mr-ash. Both algorithms end up at the same solution, but the
"accelerated" version indeed arrives at the solution much more
quickly.

```{r plot-iter-vs-mr-ash-objective, fig.height=4, fig.width=6}
f    <- max(c(fit3$value,fit4$value)) + 1e-8
pdat <- 
  rbind(data.frame(iter = 1:100,dist = f - fit3$value,method = "basic"),
        data.frame(iter = 1:100,dist = f - fit4$value,method = "accelerated"))
p <- ggplot(pdat,aes(x = iter,y = dist,col = method)) +
  geom_line(size = 1) +
  scale_y_continuous(trans = "log10",breaks = 10^seq(-8,4)) +
  scale_color_manual(values = c("darkorange","dodgerblue")) +
  labs(x = "iteration",y = "distance from best solution")
print(p)
```

[daarem-paper]: https://doi.org/10.1080/10618600.2019.1594835
[adaptive-shrinkage-paper]: https://doi.org/10.1093/biostatistics/kxw041