/
MASHvFLASHsims.Rmd
153 lines (99 loc) · 7.04 KB
/
MASHvFLASHsims.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
title: "MASH v FLASH simulation results"
output:
workflowr::wflow_html:
code_folding: hide
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
Here I compare MASH and FLASH fits to data simulated from various MASH and FLASH models. In addition to comparing the performance of the various FLASH fits, I was interested in seeing how FLASH performed on data generated from a MASH model (and vice versa). For code, see [below](#code).
## Fitting methods
The MASH fit is produced following the recommendations in the MASH vignettes (using both canonical matrices and data-driven matrices).
Five FLASH fits are produced. The first two serve as baselines. The "Vanilla" FLASH fit uses the default method to fit a flash object, adding up to ten factors greedily and then backfitting. The "Zero" fit does the same, but sets `var_type = "zero"` rather than using the default parameter option `var_type = "by_column"`. That is, the "Vanilla" fit estimates the residual variances, with the constraint that the variances are identical across any given column, while the "Zero" fit fixes the standard errors at their known values (here, `S = 1`).
The remaining three fits also use parameter option `var_type = "zero"`. FLASH-OHL (for "one-hots last") adds up to ten factors greedily, then adds a one-hot vector for each row in the data matrix, then backfits the whole thing. As described in the [vignette](intro.html), the purpose of the one-hot vectors is to account for certain "canonical" covariance structures.
FLASH-OHF (for "one-hots first") adds the one-hot vectors first, then backfits, then greedily adds up to ten factors. These greedily added factors are not subsequently backfit, so FLASH-OHF can be much faster than FLASH-OHL. FLASH-OHF+ begins with the FLASH-OHF fit and then performs a second backfitting.
## Simulations
All simulated datasets $Y$ are of dimension 25 x 1000. In each case, $Y = X + E$, where $X$ is the matrix of "true" effects and $E$ is a matrix of $N(0, 1)$ noise. One simulation is from the null model, three are generated according to a MASH model, and three are generated from a FLASH model. See the individual sections below for details.
The MASH fits are evaluated using built-in functions `get_pm()` to calculate MSE, `get_psd()` to calculate confidence intervals, and `get_lfsr()` to calculate true and false positive rates.
For the FLASH fits, only MSE is calculated using a built-in function (`flash_get_fitted_values()`). Confidence intervals and true and false positive rates are calculated by sampling from the posterior using function `flash_sampler()`. For details, see the code below.
## Results overview
Without the one-hot vectors, the Vanilla and Zero FLASH fits perform poorly on the MASH simulations (see the ROC curves below). The other FLASH fits all perform similarly to one another. They are outperformed by MASH, especially at low FPR thresholds, but this is expected since the data are, after all, generated from the MASH model. Of these three FLASH methods, OHF is (as expected) the fastest, and is about twice as fast as the slowest method, OHF+.
The Vanilla and Zero fits do much better on the FLASH simulations (again, as expected, since the data are now generated from the FLASH model). More surprisingly, the MASH fit does nearly as well (and, in some cases, better) than the FLASH fits. OHF is again the fastest of the three FLASH fits that include one-hot vectors, but it performs much more poorly than the other two methods. On the rank-5 model, OHF+ also performs poorly relative to OHL.
Overall, the OHL method consistently performs best among the three FLASH methods that include one-hot vectors. It is slower than OHF, and sometimes even slower than OHF+, but the performance gains on the FLASH data are substantial.
## Null model
Here the entries of $X$ are all zero.
```{r sim1, echo=F}
#
# The output for this analysis was produced by running the code in
# code/main.R. See below ("Code") for details.
#
tmp <- readRDS("./output/sim1res.rds")
knitr::kable(tmp, digits=3)
```
![](images/sim1time.png)
## Model with independent effects
Now the columns $X_{:, j}$ are either identically zero (with probability 0.8) or identically nonzero. In the latter case, the entries of the $j$th column of $X$ are i.i.d. $N(0, 1)$.
```{r sim2, echo=F}
tmp <- readRDS("./output/sim2res.rds")
knitr::kable(tmp, digits=3)
```
![](images/sim2ROC.png)
![](images/sim2time.png)
## Model with independent and shared effects
Again 80% of the columns of $X$ are identically zero. But now, only half of the nonzero columns have entries that are i.i.d. $N(0, 1)$. The other half have entries that are identical across rows, with a value that is drawn from the $N(0, 1)$ distribution. (In other words, the covariance matrix for these columns is a matrix of all ones.)
```{r sim3, echo=F}
tmp <- readRDS("./output/sim3res.rds")
knitr::kable(tmp, digits=3)
```
![](images/sim3ROC.png)
![](images/sim3time.png)
## Model with independent, shared, and unique effects
This model is similar to the above two, but now only a third of the nonnull columns have independently distributed entries and a third have shared entries. The other third have a unique nonzero entry. (This corresponds, for example, to a gene that is only expressed in a single condition.) The unique effects are distributed uniformly across rows.
```{r sim4, echo=F}
tmp <- readRDS("./output/sim4res.rds")
knitr::kable(tmp, digits=3)
```
![](images/sim4ROC.png)
![](images/sim4time.png)
## Rank 1 FLASH model
This is the FLASH model $X = LF$, where $L$ is an $n$ by $k$ matrix and $F$ is a $k$ by $p$ matrix. In this first simulation, $k = 1$. 80% of the entries in $F$ and 50% of the entries in $L$ are equal to zero. The other entries are i.i.d. $N(0, 1)$.
```{r sim5, echo=F}
tmp <- readRDS("./output/sim5res.rds")
knitr::kable(tmp, digits=3)
```
![](images/sim5ROC.png)
![](images/sim5time.png)
## Rank 5 FLASH model
This is the same as above with $k = 5$ and with only 20% of the entries in $L$ equal to zero.
```{r sim6, echo=F}
tmp <- readRDS("./output/sim6res.rds")
knitr::kable(tmp, digits=3)
```
![](images/sim6ROC.png)
![](images/sim6time.png)
## Rank 3 FLASH model with UV
This is similar to the above with $k = 3$ and with 30% of the rows in $L$ equal to zero. In addition, a dense rank-one matrix $W$ is added to $X$ to mimic the effects of unwanted variation. Here, $W = UV$, with $U$ an $n$ by 1 vector and $V$ a 1 by $p$ vector, both of which have entries distributed $N(0, 0.25)$.
```{r sim7, echo=F}
tmp <- readRDS("./output/sim7res.rds")
knitr::kable(tmp, digits=3)
```
![](images/sim7ROC.png)
![](images/sim7time.png)
## Code
for simulating datasets...
```{r sims, code=readLines("../code/sims.R")}
```
...for fitting MASH and FLASH objects...
```{r fits, code=readLines("../code/fits.R")}
```
...for evaluating performance...
```{r utils, code=readLines("../code/utils.R")}
```
...for running simulations and plotting results...
```{r mashvflash, code=readLines("../code/mashvflash.R")}
```
...and the main function calls.
```{r main, code=readLines("../code/main.R"), eval=FALSE}
```