analysis/linreg.Rmd

---
title: "Evaluating linear regression methods in four scenarios from Zou & Hastie (2005)"
author: "Peter Carbonetto, Gao Wang and Matthew Stephens"
date: April 2, 2019
output:
  html_document:
    theme: readable
    include:
      before_body: include/header.html
      after_body: include/footer.html
---

In this short analysis, we compare the prediction accuracy of several
linear regression in the four simulation examples of
[Zou & Hastie (2005)][zou-hastie-2005]. The six methods compared are:

+ ridge regression;

+ the Lasso;

+ the Elastic Net;

+ Sum of Single Effects regression (SuSiE), described [here][susie];

+ variational inference for Bayesian variable selection, or "varbvs",
  described [here][varbvs]; and

+ "varbvsmix", an elaboration of varbvs that replaces the single normal
  prior with a mixture-of-normals.

See [here][github-repo] for setup instructions. A webpage was
generated from this source code by running
`rmarkdown::render("linreg.Rmd")` in the "analysis" directory.

```{r knitr, echo=FALSE}
knitr::opts_chunk$set(comment = "#",results = "hold",collapse = TRUE,
                      fig.align = "center")
```

Load packages
-------------

Load a few packages and custom functions used in the analysis below.

```{r load-pkgs, warning=FALSE, message=FALSE}
library(dscrutils)
library(ggplot2)
library(cowplot)
source("../code/plots.R")
```

Import DSC results
------------------

Here we use function "dscquery" from the dscrutils package to extract
the DSC results we are interested in---the mean squared error in the
predictions from each method and in each simulation scenario. After
this call, the "dsc" data frame should contain results for 480
pipelines---6 methods times 4 scenarios times 20 data sets simulated
in each scenario.

```{r import-dsc-results}
library(dscrutils)
methods <- c("ridge","lasso","elastic_net","susie","varbvs","varbvsmix")
dsc <- dscquery("../dsc/linreg",c("simulate.scenario","fit","mse.err"),
                 verbose = FALSE)
dsc <- transform(dsc,fit = factor(fit,methods))
nrow(dsc)
```

Note that you will need to run the DSC before querying the results;
see [here][github-repo] for instructions on running the DSC. If you
did not run the DSC to generate these results, you can replace the
dscquery call above by this line to load the pre-extracted results
stored in a CSV file:

```{r import-results-from-csv, eval=FALSE}
dsc <- read.csv("../output/linreg_mse.csv")
```

This is how the CSV file was created:

```{r write-dsc-results}
write.csv(dsc,"../output/linreg_mse.csv",row.names = FALSE,quote = FALSE)
```

## Summarize and discuss simulation results

The boxplots summarize the prediction errors in each of the simulations.

```{r create-boxplots, fig.height=8, fig.width=6.5}
p1 <- mse.boxplot(subset(dsc,simulate.scenario == 1)) + ggtitle("Scenario 1")
p2 <- mse.boxplot(subset(dsc,simulate.scenario == 2)) + ggtitle("Scenario 2")
p3 <- mse.boxplot(subset(dsc,simulate.scenario == 3)) + ggtitle("Scenario 3")
p4 <- mse.boxplot(subset(dsc,simulate.scenario == 4)) + ggtitle("Scenario 4")
plot_grid(p1,p2,p3,p4)
```

Here are a few initial (*i.e.,* imprecise) impressions from these plots.

In most cases, the Elastic Net does at least as well, or better than
the Lasso, which is what we would expect.

Ridge regression actually achieves excellent accuracy in all cases
except Scenario 4. Ridge regression is expected to do less well in
Scenario 4 because the majority of the true coefficients are zero, so
a sparse model would be favoured.

In Scenario 4 where the predictors are correlated in a structured way,
and the effects are sparse, varbvs and varbvsmix perform better than
the other methods.

varbvsmix yields competitive predictions in all four scenarios.

## Session information

This is the version of R and the packages that were used to generate
these results.

```{r session-info}
sessionInfo()
```

[github-repo]: https://github.com/stephenslab/dsc-linreg
[susie]: https://doi.org/10.1101/501114
[varbvs]: https://projecteuclid.org/euclid.ba/1339616726 
[zou-hastie-2005]: https://doi.org/10.1111/j.1467-9868.2005.00503.x