/
MASHvFLASHgtex2.Rmd
287 lines (224 loc) · 12.4 KB
/
MASHvFLASHgtex2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
---
title: "MASH v FLASH workflows for GTEx"
output:
workflowr::wflow_html:
code_folding: hide
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
While I previously compared MASH and FLASH fits on [strong tests only](MASHvFLASHgtex.html) and on a [random subset of tests](MASHvFLASHrandom.html), here I propose several workflows that are analogous to the one suggested for the GTEx data in [this MASH vignette](https://stephenslab.github.io/mashr/articles/eQTL_outline.html). For the code used in this analysis, see [below](#code).
## Fitting Methods
For MASH, I follow the workflow in the vignette linked above, except that I assume that the null tests are uncorrelated (that is, I set $V = I$). This is almost certainly not the case, but some more work needs to be done before we can handle the case $V \ne I$ in FLASH.
The workflows for FLASH proceed along similar lines to the workflow for MASH:
1. I obtain "data-driven" loadings (analogous to MASH's data-driven covariance matrices) by fitting a FLASH object to the "strong" tests using either the "OHF" method or the "Zero" method described in my [simulation study](MASHvFLASHsims.html).
2. I fix the loadings obtained in step 1 at their expectation $EL$ and add them as fixed data-driven loadings to a FLASH object. Further, I add 45 fixed "canonical" loadings (a vector of all ones and a one-hot vector for each of 44 conditions), which can be viewed as analogous to MASH's canonical covariance matrices. (See the discussion in my [vignette](intro.html).) Since the "Zero" method generates many more data-driven loadings than the "OHF" method, I experiment with only keeping a subset of the data-driven loadings from the "Zero" method.
3. With the loadings fixed, I backfit a FLASH object to the random subset of tests to obtain priors $g_f$ on the factors. I use the random subset rather than the "strong" subset because I want the priors to hold generally, and not just for the strong tests.
4. Finally, using the same fixed loadings as in step 2 and fixing the priors $g_f$ at the values obtained in step 3, I backfit a FLASH object to the strong tests to get posterior means and variances.
I fit five FLASH objects, using five variations of the above workflow:
```{r flash_methods}
methods <- data.frame(fit = as.character(1:5),
data.driven = c("OHF", "Zero (top 5)",
"Zero (top 10)", "Zero (top 20)",
"Zero (full)"),
include.canonical = c(rep("Yes", 4), "No"))
knitr::kable(methods)
```
## Comments on MASH fit
```{r load_mash}
library(mashr)
fpath <- "./output/MASHvFLASHgtex2/"
m_final <- readRDS(paste0(fpath, "m.rds"))
```
It took 2.4 minutes to run Extreme Deconvolution to find data-driven covariance matrices. The MASH fit on the random subset of tests (to determine mixture weights) required 2.7 minutes, and the final MASH fit on the strong tests required only 16 seconds. The total fitting time was 5.3 minutes.
The estimated mixture weights were as follows. Note in particular that there are large weights on the data-driven matrices "ED_tPCA" and "ED_PCA_2," as well as on the canonical "equal_effects" matrix. There are moderate weights on the null matrix, some unique effects (including Testis, Thyroid, and Transformed fibroblasts), and two of the canonical "simple_het" matrices (where effect sizes are assumed to be of equal variance and equally correlated, with correlation coefficients of, respectively, 0.25 and 0.5).
```{r mixwts}
barplot(get_estimated_pi(m_final), las = 2, cex.names = 0.4)
```
Correlation plots for the data-driven matrices are as follows. The first ("ED_tPCA") describes effects that are shared across a handful of tissues (which, notably, does not include brain tissues).
```{r corr}
library(corrplot)
corrplot(m_final$fitted_g$Ulist[["ED_tPCA"]], tl.cex=0.5)
```
The second ("ED_PCA_2") describes a pattern of sharing among brain tissues that is, interestingly, strongly anti-correlated with whole blood.
```{r corr2}
corrplot(m_final$fitted_g$Ulist[["ED_PCA_2"]], tl.cex=0.5)
```
## Comments on FLASH fits
```{r load_flash}
devtools::load_all("/Users/willwerscheid/GitHub/flashr/")
fl_final <- list()
fl_final[[1]] <- readRDS(paste0(fpath, "OHF.rds"))
fl_final[[2]] <- readRDS(paste0(fpath, "top5.rds"))
fl_final[[3]] <- readRDS(paste0(fpath, "top10.rds"))
fl_final[[4]] <- readRDS(paste0(fpath, "top20.rds"))
fl_final[[5]] <- readRDS(paste0(fpath, "zero.rds"))
```
### Data-driven loadings (step 1)
14.3 minutes were required to backfit the 45 canonical loadings and then greedily obtain 2 additional factor/loading pairs using the "OHF" method. The loadings thus obtained were as follows:
```{r dd.OHF}
missing.tissues <- c(7, 8, 19, 20, 24, 25, 31, 34, 37)
gtex.colors <- read.table("https://github.com/stephenslab/gtexresults/blob/master/data/GTExColors.txt?raw=TRUE", sep = '\t', comment.char = '')[-missing.tissues, 2]
par(mar=c(1,1,1,1))
par(mfrow=c(2,1))
for (i in 46:47) {
barplot(fl_final[[1]]$EL[, i], main=paste('Loading', i), las=2,
cex.names = 0.4, col=as.character(gtex.colors), names="")
}
```
6.3 minutes were required to greedily add 37 factor/loading pairs using the "Zero" method. To view plots of the loadings, scroll down to the "FLASH loadings" section [below](#flash_loadings). Note in particular that most of the loadings beyond the first 16 or so are strongly loaded on a single condition. This observation motivated my desire to discard the majority of data-driven loadings. I reasoned that retaining only the "top" data-driven loadings would cut down on the time needed to backfit but would not change the final fit very much.
### Priors on factors (steps 2-3)
The times needed to fit priors to the random subset of tests were as follows.
```{r t_random}
t <- readRDS(paste0(fpath, "t.rds"))
t_random <- data.frame(fit = as.character(1:5),
fixed.loadings = c("Canonical + OHF", "Canonical + Top 5 Zero", "Canonical + Top 10 Zero", "Canonical + Top 20 Zero", "Zero only"),
number.of.factors = sapply(fl_final, function(x) {flash_get_nfactors(x)}),
minutes.to.fit = sapply(t$random, function(x) {as.numeric(x, units="mins")})
)
knitr::kable(t_random, digits=1)
```
To inspect the fitted priors, I plot $w = 1 - \pi_0$ for each $g_f$ (that is, the proportion of genes that we expect to have nonnull loadings for each factor/loading pair).
```{r plot_w}
par(mfrow = c(1, 2))
plot_gf.w <- function(fl, names, colors, main) {
w <- 1 - sapply(fl$gf, function(g) {g$pi0})
barplot(w, ylim=c(0, 1), ylab="w", xlab="factor/loading",
cex.names=0.4, names=names, col=colors, main=main)
}
canonical.names <- c("EE", as.character(1:44))
c.col <- "blue"
d.col <- "red"
plot_gf.w(fl_final[[1]],
c(canonical.names, paste0("D", 1:2)),
c(rep(c.col, 45), rep(d.col, 2)),
main="Canonical + OHF")
plot_gf.w(fl_final[[2]],
c(canonical.names, paste0("D", 1:5)),
c(rep(c.col, 45), rep(d.col, 5)),
main="Canonical + Top 5")
plot_gf.w(fl_final[[3]],
c(canonical.names, paste0("D", 1:10)),
c(rep(c.col, 45), rep(d.col, 10)),
main="Canonical + Top 10")
plot_gf.w(fl_final[[4]],
c(canonical.names, paste0("D", 1:20)),
c(rep(c.col, 45), rep(d.col, 20)),
main="Canonical + Top 20")
plot_gf.w(fl_final[[5]],
paste0("D", 1:37),
rep(d.col, 37),
main="Zero")
```
It is also instructive to plot $\sigma = 1 / \sqrt{a}$ for the $g_f$s that correspond to canonical factors (I don't include data-driven factors here because the value of $a$ is meaningless unless the factors have been normalized). These plots roughly describe how large we can expect non-null identical and unique effects to be.
```{r plot_s}
par(mfrow = c(1, 2))
plot_gf.s <- function(fl, main) {
s <- sqrt(1 / sapply(fl$gf, function(g) {g$a}))
barplot(s[1:45], ylim=c(0, 6), ylab="sigma", xlab="factor/loading",
cex.names=0.4, names=canonical.names, col="blue", main=main)
}
plot_gf.s(fl_final[[1]],
main="Canonical + OHF")
plot_gf.s(fl_final[[2]],
main="Canonical + Top 5")
plot_gf.s(fl_final[[3]],
main="Canonical + Top 10")
plot_gf.s(fl_final[[4]],
main="Canonical + Top 20")
```
### Final backfit (step 4)
The times needed for the final backfits (with loadings and priors $g_f$ fixed at values determined during steps 1-3) were as follows.
```{r t_final}
t_final <- data.frame(fit = as.character(1:5),
fixed.loadings = c("Canonical + OHF", "Canonical + Top 5 Zero", "Canonical + Top 10 Zero", "Canonical + Top 20 Zero", "Zero only"),
number.of.factors = sapply(fl_final, function(x) {flash_get_nfactors(x)}),
minutes.to.fit = sapply(t$final, function(x) {as.numeric(x, units="mins")})
)
knitr::kable(t_final, digits=1)
```
The proportion of variance explained per factor/loading pair is as follows. Since, in each case, the "equal effects" factor/loading pair constitutes by far the largest proportion of variance explained (around 0.7), I only plot the PVE for the other factor/loading pairs.
```{r pve}
par(mfrow = c(1, 2))
plot_pve <- function(fl, names, colors, main) {
pve <- flash_get_pve(fl)
barplot(pve[2:length(pve)], ylim=c(0, .05), ylab="PVE",
xlab="factor/loading", cex.names=0.4,
names=names, col=colors, main=main)
}
canonical.names <- as.character(1:44)
plot_pve(fl_final[[1]],
c(canonical.names, paste0("D", 1:2)),
c(rep(c.col, 44), rep(d.col, 2)),
main="Canonical + OHF")
plot_pve(fl_final[[2]],
c(canonical.names, paste0("D", 1:5)),
c(rep(c.col, 44), rep(d.col, 5)),
main="Canonical + Top 5")
plot_pve(fl_final[[3]],
c(canonical.names, paste0("D", 1:10)),
c(rep(c.col, 44), rep(d.col, 10)),
main="Canonical + Top 10")
plot_pve(fl_final[[4]],
c(canonical.names, paste0("D", 1:20)),
c(rep(c.col, 44), rep(d.col, 20)),
main="Canonical + Top 20")
plot_pve(fl_final[[5]],
paste0("D", 2:37),
rep(d.col, 36),
main="Zero")
```
## Total fitting time
The total time needed for each workflow is as follows.
```{r total_time}
fl_t <- (t_random$minutes.to.fit + t_final$minutes.to.fit
+ c(t$strong$OHF, t$strong$zero * c(5, 10, 20, 37) / 37))
total_time <- data.frame(method = c("OHF", "Top 5", "Top 10",
"Top 20", "Zero", "MASH"),
total.time.to.fit = c(fl_t, 5.3))
knitr::kable(total_time, digits=1)
```
## MASH v FLASH initial observations
There is substantial disagreement among fits. The correlation matrix for posterior means is as follows.
```{r cor_mat}
all_fitted <- sapply(fl_final, flash_get_fitted_values)
all_fitted <- cbind(all_fitted, as.vector(t(get_pm(m_final))))
colnames(all_fitted) <- c("OHF", "Top 5", "Top 10",
"Top 20", "Zero", "MASH")
round(cor(all_fitted), digits=3)
```
Entry $(i, j)$ in the following matrix gives, of the values that method $j$ finds to be significant at 5%, the proportion of values that method $i$ also finds to be significant at 5%. For example, of the values that MASH finds to be significant, OHF also finds 90% to be significant. But of the values that OHF finds to be significant, MASH only agrees in 81% of cases.
```{r signif_mat}
fl_lfsr <- readRDS(paste0(fpath, "fllfsr.rds"))
all_signif <- sapply(fl_lfsr, function(x) {as.numeric(x <= 0.05)})
all_signif <- cbind(all_signif, as.vector(t(get_lfsr(m_final) <= 0.05)))
m <- ncol(all_signif)
agree_mat <- matrix(0, nrow=m, ncol=m)
for (i in 1:m) {
for (j in 1:m) {
agree_mat[i, j] <- (sum(all_signif[, i] * all_signif[, j]) /
sum(all_signif[, j]))
}
}
rownames(agree_mat) <- colnames(agree_mat) <- colnames(all_fitted)
round(agree_mat, digits=2)
```
I will offer some explanations for the differences among fits in a subsequent analysis.
## FLASH loadings
The FLASH loadings (as fitted on the strong tests) are as follows.
```{r flash_factors}
par(mar=c(1,1,1,1))
par(mfrow=c(3,2))
for(i in 1:flash_get_nfactors(fl_final[[5]])){
barplot(fl_final[[5]]$EL[, i], main=paste('Loading', i), las=2,
cex.names = 0.4, col=as.character(gtex.colors), names="")
}
```
## Code
Click "Code" to view the code used to obtain the above results.
```{r gtexcode, echo=F, include=F}
knitr::read_chunk("./code/MASHvFLASHgtex2.R")
```
```{r gtex2, eval=F}
```