/
MASHvFLASHgtex.Rmd
105 lines (77 loc) · 4.09 KB
/
MASHvFLASHgtex.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
title: "MASH v FLASH analysis of GTEx data"
output:
workflowr::wflow_html:
code_folding: hide
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
Here I analyze some GTEx data. The dataset can be found at https://stephenslab.github.io/gtexresults/. I use the "random.z" dataset, which consists of $z$-scores for 44 tissues and a random subset of 20000 tests. For the code used in this analysis, see [below](#code).
## Fitting methods
I used the same methods to fit the data that I used in my [simulation study](MASHvFLASHsims.html). These methods assume that noise is independent among conditions. It is not, but it is still useful to see how the methods compare when applied to a real dataset.
The simulation study suggested that the "one-hots last" method typically produces a better fit than the "one-hots first" method, even though it can take quite a bit longer. Here I enter into some more detail.
First I load the data and the fits.
```{r load_fits}
devtools::load_all("/Users/willwerscheid/GitHub/flashr/")
library(mashr)
gtex <- readRDS(gzcon(url("https://github.com/stephenslab/gtexresults/blob/master/data/MatrixEQTLSumStats.Portable.Z.rds?raw=TRUE")))
data <- gtex$random.z
data <- t(data)
fl_data <- flash_set_data(data, S = 1)
gtex_mfit <- readRDS("./output/gtexmfit.rds")
gtex_flfit <- readRDS("./output/gtexflfit.rds")
```
The OHL fit was produced by greedily adding a total of 17 factors, then adding 44 fixed one-hot factors (one per condition), then backfitting the whole thing. The objective attained was `r format(flash_get_objective(fl_data, gtex_flfit$fits$OHL), scientific=F)`.
The OHF fit added the 44 fixed one-hot factors, then backfit them, then added only 4 (!) more factors greedily. The resulting objective was much worse than that of the OHL fit, at `r format(flash_get_objective(fl_data, gtex_flfit$fits$OHF), scientific=F)`.
Finally, I tried applying an additional backfitting step to the OHF fit to see how much the objective improved (I call this method "FLASH-OHF+" in the simulation study). The final objective was `r format(flash_get_objective(fl_data, gtex_flfit$fits$OHFp), scientific=F)`: better, but still not as good as the OHL fit.
It seems clear that the OHL method is the way to go. However, it does take a long time (about 50% longer than MASH):
```{r plot_timing}
data <- c(gtex_mfit$timing$ed, gtex_mfit$timing$mash,
gtex_flfit$timing$OHL$greedy, gtex_flfit$timing$OHL$backfit,
gtex_flfit$timing$OHF$greedy, gtex_flfit$timing$OHF$backfit,
gtex_flfit$timing$OHFp$greedy, gtex_flfit$timing$OHFp$backfit)
time_units <- units(data)
data <- matrix(as.numeric(data), 2, 4)
barplot(data, axes=T,
main=paste("Average time to fit in", time_units),
names.arg = c("MASH", "FL-OHL", "FL-OHF", "FL-OHF+"),
legend.text = c("ED/Greedy", "MASH/Backfit"),
ylim = c(0, max(colSums(data))*1.5))
```
## MASH v FLASH posterior means
The posterior means are quite similar (correlation coefficient = 0.95). The dashed line plots $y = x$:
![](images/gtexcompare.png)
## MASH v FLASH LFSR
Next I look at confusion matrices for gene-condition pairs that are declared significant at a given LFSR threshold. As in the [simulation study](MASHvFLASHsims.html), I used a built-in function to evaluate LFSR for the MASH fit and I sampled from the posterior for the FLASH fit. In general, FLASH appears be more conservative than MASH.
```{r lfsr}
m_lfsr <- t(get_lfsr(gtex_mfit$m))
fl_lfsr <- readRDS("./output/gtexfllfsr.rds")
confusion_matrix <- function(t) {
mash_signif <- m_lfsr <= t
flash_signif <- fl_lfsr <= t
round(table(mash_signif, flash_signif)
/ length(mash_signif), digits=3)
}
```
At 5%:
```{r cm05}
confusion_matrix(.05)
```
At 1%:
```{r cm01}
confusion_matrix(.01)
```
<!-- Note: we cannot do 0.1% if we only have 200 samples from the posterior! -->
<!-- At 0.1%: -->
<!-- ```{r cm001} -->
<!-- confusion_matrix(.001) -->
<!-- ``` -->
## Code
Click "Code" to view the code used to obtain the above results.
```{r gtexcode, echo=F, include=F}
knitr::read_chunk("./code/gtex.R")
```
```{r gtex, eval=F}
```