-
Notifications
You must be signed in to change notification settings - Fork 1
/
linreg.Rmd
127 lines (96 loc) · 4.03 KB
/
linreg.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
title: "Evaluating linear regression methods in four scenarios from Zou & Hastie (2005)"
author: "Peter Carbonetto, Gao Wang and Matthew Stephens"
date: April 2, 2019
output:
html_document:
theme: readable
include:
before_body: include/header.html
after_body: include/footer.html
---
In this short analysis, we compare the prediction accuracy of several
linear regression in the four simulation examples of
[Zou & Hastie (2005)][zou-hastie-2005]. The six methods compared are:
+ ridge regression;
+ the Lasso;
+ the Elastic Net;
+ Sum of Single Effects regression (SuSiE), described [here][susie];
+ variational inference for Bayesian variable selection, or "varbvs",
described [here][varbvs]; and
+ "varbvsmix", an elaboration of varbvs that replaces the single normal
prior with a mixture-of-normals.
See [here][github-repo] for setup instructions. A webpage was
generated from this source code by running
`rmarkdown::render("linreg.Rmd")` in the "analysis" directory.
```{r knitr, echo=FALSE}
knitr::opts_chunk$set(comment = "#",results = "hold",collapse = TRUE,
fig.align = "center")
```
Load packages
-------------
Load a few packages and custom functions used in the analysis below.
```{r load-pkgs, warning=FALSE, message=FALSE}
library(dscrutils)
library(ggplot2)
library(cowplot)
source("../code/plots.R")
```
Import DSC results
------------------
Here we use function "dscquery" from the dscrutils package to extract
the DSC results we are interested in---the mean squared error in the
predictions from each method and in each simulation scenario. After
this call, the "dsc" data frame should contain results for 480
pipelines---6 methods times 4 scenarios times 20 data sets simulated
in each scenario.
```{r import-dsc-results}
library(dscrutils)
methods <- c("ridge","lasso","elastic_net","susie","varbvs","varbvsmix")
dsc <- dscquery("../dsc/linreg",c("simulate.scenario","fit","mse.err"),
verbose = FALSE)
dsc <- transform(dsc,fit = factor(fit,methods))
nrow(dsc)
```
Note that you will need to run the DSC before querying the results;
see [here][github-repo] for instructions on running the DSC. If you
did not run the DSC to generate these results, you can replace the
dscquery call above by this line to load the pre-extracted results
stored in a CSV file:
```{r import-results-from-csv, eval=FALSE}
dsc <- read.csv("../output/linreg_mse.csv")
```
This is how the CSV file was created:
```{r write-dsc-results}
write.csv(dsc,"../output/linreg_mse.csv",row.names = FALSE,quote = FALSE)
```
## Summarize and discuss simulation results
The boxplots summarize the prediction errors in each of the simulations.
```{r create-boxplots, fig.height=8, fig.width=6.5}
p1 <- mse.boxplot(subset(dsc,simulate.scenario == 1)) + ggtitle("Scenario 1")
p2 <- mse.boxplot(subset(dsc,simulate.scenario == 2)) + ggtitle("Scenario 2")
p3 <- mse.boxplot(subset(dsc,simulate.scenario == 3)) + ggtitle("Scenario 3")
p4 <- mse.boxplot(subset(dsc,simulate.scenario == 4)) + ggtitle("Scenario 4")
plot_grid(p1,p2,p3,p4)
```
Here are a few initial (*i.e.,* imprecise) impressions from these plots.
In most cases, the Elastic Net does at least as well, or better than
the Lasso, which is what we would expect.
Ridge regression actually achieves excellent accuracy in all cases
except Scenario 4. Ridge regression is expected to do less well in
Scenario 4 because the majority of the true coefficients are zero, so
a sparse model would be favoured.
In Scenario 4 where the predictors are correlated in a structured way,
and the effects are sparse, varbvs and varbvsmix perform better than
the other methods.
varbvsmix yields competitive predictions in all four scenarios.
## Session information
This is the version of R and the packages that were used to generate
these results.
```{r session-info}
sessionInfo()
```
[github-repo]: https://github.com/stephenslab/dsc-linreg
[susie]: https://doi.org/10.1101/501114
[varbvs]: https://projecteuclid.org/euclid.ba/1339616726
[zou-hastie-2005]: https://doi.org/10.1111/j.1467-9868.2005.00503.x