analysis/MASHvFLASHgtex2.Rmd

---
title: "MASH v FLASH workflows for GTEx"
output: 
  workflowr::wflow_html:
    code_folding: hide
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

While I previously compared MASH and FLASH fits on [strong tests only](MASHvFLASHgtex.html) and on a [random subset of tests](MASHvFLASHrandom.html), here I propose several workflows that are analogous to the one suggested for the GTEx data in [this MASH vignette](https://stephenslab.github.io/mashr/articles/eQTL_outline.html). For the code used in this analysis, see [below](#code).


## Fitting Methods

For MASH, I follow the workflow in the vignette linked above, except that I assume that the null tests are uncorrelated (that is, I set $V = I$). This is almost certainly not the case, but some more work needs to be done before we can handle the case $V \ne I$ in FLASH.

The workflows for FLASH proceed along similar lines to the workflow for MASH:

1. I obtain "data-driven" loadings (analogous to MASH's data-driven covariance matrices) by fitting a FLASH object to the "strong" tests using either the "OHF" method or the "Zero" method described in my [simulation study](MASHvFLASHsims.html).

2. I fix the loadings obtained in step 1 at their expectation $EL$ and add them as fixed data-driven loadings to a FLASH object. Further, I add 45 fixed "canonical" loadings (a vector of all ones and a one-hot vector for each of 44 conditions), which can be viewed as analogous to MASH's canonical covariance matrices. (See the discussion in my [vignette](intro.html).) Since the "Zero" method generates many more data-driven loadings than the "OHF" method, I experiment with only keeping a subset of the data-driven loadings from the "Zero" method.

3. With the loadings fixed, I backfit a FLASH object to the random subset of tests to obtain priors $g_f$ on the factors. I use the random subset rather than the "strong" subset because I want the priors to hold generally, and not just for the strong tests.

4. Finally, using the same fixed loadings as in step 2 and fixing the priors $g_f$ at the values obtained in step 3, I backfit a FLASH object to the strong tests to get posterior means and variances.

I fit five FLASH objects, using five variations of the above workflow:
```{r flash_methods}
methods <- data.frame(fit = as.character(1:5),
                      data.driven = c("OHF", "Zero (top 5)", 
                                      "Zero (top 10)", "Zero (top 20)",
                                      "Zero (full)"),
                      include.canonical = c(rep("Yes", 4), "No"))
knitr::kable(methods)
```


## Comments on MASH fit

```{r load_mash}
library(mashr)

fpath <- "./output/MASHvFLASHgtex2/"
m_final <- readRDS(paste0(fpath, "m.rds"))
```

It took 2.4 minutes to run Extreme Deconvolution to find data-driven covariance matrices. The MASH fit on the random subset of tests (to determine mixture weights) required 2.7 minutes, and the final MASH fit on the strong tests required only 16 seconds. The total fitting time was 5.3 minutes. 

The estimated mixture weights were as follows. Note in particular that there are large weights on the data-driven matrices "ED_tPCA" and "ED_PCA_2," as well as on the canonical "equal_effects" matrix. There are moderate weights on the null matrix, some unique effects (including Testis, Thyroid, and Transformed fibroblasts), and two of the canonical "simple_het" matrices (where effect sizes are assumed to be of equal variance and equally correlated, with correlation coefficients of, respectively, 0.25 and 0.5).
```{r mixwts}
barplot(get_estimated_pi(m_final), las = 2, cex.names = 0.4)
```

Correlation plots for the data-driven matrices are as follows. The first ("ED_tPCA") describes effects that are shared across a handful of tissues (which, notably, does not include brain tissues). 

```{r corr}
library(corrplot)

corrplot(m_final$fitted_g$Ulist[["ED_tPCA"]], tl.cex=0.5)
```

The second ("ED_PCA_2") describes a pattern of sharing among brain tissues that is, interestingly, strongly anti-correlated with whole blood.

```{r corr2}
corrplot(m_final$fitted_g$Ulist[["ED_PCA_2"]], tl.cex=0.5)
```


## Comments on FLASH fits

```{r load_flash}
devtools::load_all("/Users/willwerscheid/GitHub/flashr/")

fl_final <- list()
fl_final[[1]] <- readRDS(paste0(fpath, "OHF.rds"))
fl_final[[2]] <- readRDS(paste0(fpath, "top5.rds"))
fl_final[[3]] <- readRDS(paste0(fpath, "top10.rds"))
fl_final[[4]] <- readRDS(paste0(fpath, "top20.rds"))
fl_final[[5]] <- readRDS(paste0(fpath, "zero.rds"))
```

### Data-driven loadings (step 1)

14.3 minutes were required to backfit the 45 canonical loadings and then greedily obtain 2 additional factor/loading pairs using the "OHF" method. The loadings thus obtained were as follows:

```{r dd.OHF}
missing.tissues <- c(7, 8, 19, 20, 24, 25, 31, 34, 37)
gtex.colors <- read.table("https://github.com/stephenslab/gtexresults/blob/master/data/GTExColors.txt?raw=TRUE", sep = '\t', comment.char = '')[-missing.tissues, 2]

par(mar=c(1,1,1,1))
par(mfrow=c(2,1))
for (i in 46:47) {
  barplot(fl_final[[1]]$EL[, i], main=paste('Loading', i), las=2,
          cex.names = 0.4, col=as.character(gtex.colors), names="")
}
```

6.3 minutes were required to greedily add 37 factor/loading pairs using the "Zero" method. To view plots of the loadings, scroll down to the "FLASH loadings" section [below](#flash_loadings). Note in particular that most of the loadings beyond the first 16 or so are strongly loaded on a single condition. This observation motivated my desire to discard the majority of data-driven loadings. I reasoned that retaining only the "top" data-driven loadings would cut down on the time needed to backfit but would not change the final fit very much.

### Priors on factors (steps 2-3)

The times needed to fit priors to the random subset of tests were as follows.

```{r t_random}
t <- readRDS(paste0(fpath, "t.rds"))
t_random <- data.frame(fit = as.character(1:5),
                       fixed.loadings = c("Canonical + OHF", "Canonical + Top 5 Zero", "Canonical + Top 10 Zero", "Canonical + Top 20 Zero", "Zero only"), 
                       number.of.factors = sapply(fl_final, function(x) {flash_get_nfactors(x)}),
                       minutes.to.fit = sapply(t$random, function(x) {as.numeric(x, units="mins")})
)
knitr::kable(t_random, digits=1)
```

To inspect the fitted priors, I plot $w = 1 - \pi_0$ for each $g_f$ (that is, the proportion of genes that we expect to have nonnull loadings for each factor/loading pair).

```{r plot_w}
par(mfrow = c(1, 2))
plot_gf.w <- function(fl, names, colors, main) {
  w <- 1 - sapply(fl$gf, function(g) {g$pi0})
  barplot(w, ylim=c(0, 1), ylab="w", xlab="factor/loading",
          cex.names=0.4, names=names, col=colors, main=main)
}
canonical.names <- c("EE", as.character(1:44))
c.col <- "blue"
d.col <- "red"
plot_gf.w(fl_final[[1]], 
          c(canonical.names, paste0("D", 1:2)),
          c(rep(c.col, 45), rep(d.col, 2)),
          main="Canonical + OHF")
plot_gf.w(fl_final[[2]], 
          c(canonical.names, paste0("D", 1:5)),
          c(rep(c.col, 45), rep(d.col, 5)),
          main="Canonical + Top 5")
plot_gf.w(fl_final[[3]], 
          c(canonical.names, paste0("D", 1:10)),
          c(rep(c.col, 45), rep(d.col, 10)),
          main="Canonical + Top 10")
plot_gf.w(fl_final[[4]], 
          c(canonical.names, paste0("D", 1:20)),
          c(rep(c.col, 45), rep(d.col, 20)),
          main="Canonical + Top 20")
plot_gf.w(fl_final[[5]], 
          paste0("D", 1:37),
          rep(d.col, 37),
          main="Zero")
```

It is also instructive to plot $\sigma = 1 / \sqrt{a}$ for the $g_f$s that correspond to canonical factors (I don't include data-driven factors here because the value of $a$ is meaningless unless the factors have been normalized). These plots roughly describe how large we can expect non-null identical and unique effects to be.

```{r plot_s}
par(mfrow = c(1, 2))
plot_gf.s <- function(fl, main) {
  s <- sqrt(1 / sapply(fl$gf, function(g) {g$a}))
  barplot(s[1:45], ylim=c(0, 6), ylab="sigma", xlab="factor/loading",
          cex.names=0.4, names=canonical.names, col="blue", main=main)
}
plot_gf.s(fl_final[[1]], 
          main="Canonical + OHF")
plot_gf.s(fl_final[[2]],
          main="Canonical + Top 5")
plot_gf.s(fl_final[[3]],
          main="Canonical + Top 10")
plot_gf.s(fl_final[[4]],
          main="Canonical + Top 20")
```

### Final backfit (step 4)

The times needed for the final backfits (with loadings and priors $g_f$ fixed at values determined during steps 1-3) were as follows.

```{r t_final}
t_final <- data.frame(fit = as.character(1:5),
                      fixed.loadings = c("Canonical + OHF", "Canonical + Top 5 Zero", "Canonical + Top 10 Zero", "Canonical + Top 20 Zero", "Zero only"), 
                       number.of.factors = sapply(fl_final, function(x) {flash_get_nfactors(x)}),
                       minutes.to.fit = sapply(t$final, function(x) {as.numeric(x, units="mins")})
)
knitr::kable(t_final, digits=1)
```

The proportion of variance explained per factor/loading pair is as follows. Since, in each case, the "equal effects" factor/loading pair constitutes by far the largest proportion of variance explained (around 0.7), I only plot the PVE for the other factor/loading pairs.

```{r pve}
par(mfrow = c(1, 2))
plot_pve <- function(fl, names, colors, main) {
  pve <- flash_get_pve(fl)
  barplot(pve[2:length(pve)], ylim=c(0, .05), ylab="PVE", 
          xlab="factor/loading", cex.names=0.4,
          names=names, col=colors, main=main)
}
canonical.names <- as.character(1:44)
plot_pve(fl_final[[1]], 
         c(canonical.names, paste0("D", 1:2)),
         c(rep(c.col, 44), rep(d.col, 2)),
         main="Canonical + OHF")
plot_pve(fl_final[[2]], 
         c(canonical.names, paste0("D", 1:5)),
         c(rep(c.col, 44), rep(d.col, 5)),
         main="Canonical + Top 5")
plot_pve(fl_final[[3]], 
         c(canonical.names, paste0("D", 1:10)),
         c(rep(c.col, 44), rep(d.col, 10)),
         main="Canonical + Top 10")
plot_pve(fl_final[[4]], 
         c(canonical.names, paste0("D", 1:20)),
         c(rep(c.col, 44), rep(d.col, 20)),
         main="Canonical + Top 20")
plot_pve(fl_final[[5]], 
         paste0("D", 2:37),
         rep(d.col, 36),
         main="Zero")
```


## Total fitting time

The total time needed for each workflow is as follows.

```{r total_time}
fl_t <- (t_random$minutes.to.fit + t_final$minutes.to.fit
         + c(t$strong$OHF, t$strong$zero * c(5, 10, 20, 37) / 37))
total_time <- data.frame(method = c("OHF", "Top 5", "Top 10", 
                                    "Top 20", "Zero", "MASH"),
                         total.time.to.fit = c(fl_t, 5.3))
knitr::kable(total_time, digits=1)
```


## MASH v FLASH initial observations

There is substantial disagreement among fits. The correlation matrix for posterior means is as follows.

```{r cor_mat}
all_fitted <- sapply(fl_final, flash_get_fitted_values)
all_fitted <- cbind(all_fitted, as.vector(t(get_pm(m_final))))
colnames(all_fitted) <- c("OHF", "Top 5", "Top 10", 
                          "Top 20", "Zero", "MASH")
round(cor(all_fitted), digits=3)
```

Entry $(i, j)$ in the following matrix gives, of the values that method $j$ finds to be significant at 5%, the proportion of values that method $i$ also finds to be significant at 5%. For example, of the values that MASH finds to be significant, OHF also finds 90% to be significant. But of the values that OHF finds to be significant, MASH only agrees in 81% of cases.

```{r signif_mat}
fl_lfsr <- readRDS(paste0(fpath, "fllfsr.rds"))

all_signif <- sapply(fl_lfsr, function(x) {as.numeric(x <= 0.05)})
all_signif <- cbind(all_signif, as.vector(t(get_lfsr(m_final) <= 0.05)))
m <- ncol(all_signif)
agree_mat <- matrix(0, nrow=m, ncol=m)
for (i in 1:m) {
  for (j in 1:m) {
    agree_mat[i, j] <- (sum(all_signif[, i] * all_signif[, j]) /
                          sum(all_signif[, j]))
  }
}
rownames(agree_mat) <- colnames(agree_mat) <- colnames(all_fitted)
round(agree_mat, digits=2)
```

I will offer some explanations for the differences among fits in a subsequent analysis.


## FLASH loadings

The FLASH loadings (as fitted on the strong tests) are as follows.

```{r flash_factors}
par(mar=c(1,1,1,1))
par(mfrow=c(3,2))
for(i in 1:flash_get_nfactors(fl_final[[5]])){
  barplot(fl_final[[5]]$EL[, i], main=paste('Loading', i), las=2,
          cex.names = 0.4, col=as.character(gtex.colors), names="")
}
```


## Code

Click "Code" to view the code used to obtain the above results.
```{r gtexcode, echo=F, include=F}
knitr::read_chunk("./code/MASHvFLASHgtex2.R")
```
```{r gtex2, eval=F}
```