analysis/MASHvFLASHsims.Rmd

---
title: "MASH v FLASH simulation results"
output: 
  workflowr::wflow_html:
    code_folding: hide
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

Here I compare MASH and FLASH fits to data simulated from various MASH and FLASH models. In addition to comparing different methods for obtaining FLASH fits, I was interested to see how FLASH performed on data generated from a MASH model (and vice versa). For the code used in this analysis, see [below](#code).


## Fitting methods

The MASH fit is produced following the recommendations in the MASH vignettes (using both canonical matrices and data-driven matrices).

Three FLASH fits are produced. The first serves as a baseline. The "Zero" FLASH fit simply adds up to 25 factors greedily, with `var_type` set to `"zero"` to reflect the fact that standard errors are fixed at 1.

The other two fits also use parameter option `var_type = "zero"`. OHL (for "one-hots last") adds up to 25 factors greedily, then adds "canonical" factors (an all-ones vector and a one-hot vector for each row in the data matrix). It backfits the canonical factors, but not the greedily-added ones.

OHF (for "one-hots first") adds the canonical factors first, then backfits, then greedily adds up to 25 factors.

I also experimented with doing an additional backfit for each of the three FLASH fits. It increased the time needed to fit (sometimes as much as eightfold), but results were otherwise very similar.


## Simulations

All simulated datasets $Y$ are of dimension 25 x 1000. In each case, $Y = X + E$, where $X$ is the matrix of "true" effects and $E$ is a matrix of $N(0, 1)$ noise. One simulation is from the null model, three are generated according to a MASH model, and two are generated from a FLASH model. See the individual sections below for details. 

The MASH fits are evaluated using built-in functions `get_pm()` to calculate MSE, `get_psd()` to calculate confidence intervals, and `get_lfsr()` to calculate true and false positive rates. 

For the FLASH fits, only MSE is calculated using a built-in function (`flash_get_fitted_values()`). Confidence intervals and true and false positive rates are calculated by sampling from the posterior using function `flash_sampler()`. For details, see the code [below](#code).


## Results: null model

Here the entries of $X$ are all zero.

While FLASH does very well, MASH does poorly here. I'm not sure why this is the case; in runs with different seeds, MASH performed at least as well as FLASH.

```{r sim1, echo=F}
#
# The output for this analysis was produced by running the code in
#   code/MASHvFLASHsims.R. See below ("Code") for details.
#
fpath <- "./output/MASHvFLASHsims/greedy/"
tmp <- readRDS(paste0(fpath, "sim1res.rds"))
knitr::kable(tmp, digits=3)
```

![](images/sim1time.png)

## Results: MASH models

As expected, MASH outperforms FLASH when the data is generated from a MASH model. The MSE obtained using a MASH fit is about half the MSE obtained using any of the FLASH fits, and the 95% confidence intervals given by MASH contain the "true" values in 98-99% of cases, whereas the FLASH coverage can fall a bit short of 95%. Further, the MASH ROC curves consistently dominate the FLASH ROC curves.

Interestingly, however, there is no clear victor among the FLASH methods. Results for the "Zero" and OHL methods are nearly identical. I suspect that this is because loadings for the "canonical" factors are estimated to be essentially zero, so that there is no substantial difference between the two fits. The OHF method tends to do slightly better in terms of MSE and confidence interval coverage, but the OHL ROC curves dominate the OHF ROC curves for FPR < 0.6. 

The OHF method is consistently faster than both MASH and the other FLASH methods, while the OHL method is the slowest of the four methods.

### Independent effects

Here the columns $X_{:, j}$ are either identically zero (with probability 0.8) or identically nonzero. In the latter case, the entries of the $j$th column of $X$ are i.i.d. $N(0, 2^2)$.

```{r sim2, echo=F}
tmp <- readRDS(paste0(fpath, "sim2res.rds"))
knitr::kable(tmp, digits=2)
```

![](images/sim2ROC.png)
![](images/sim2time.png)

### Independent and shared effects

Again 80% of the columns of $X$ are identically zero. But now, only half of the nonzero columns have entries that are i.i.d. $N(0, 2^2).$ The other half have entries that are identical across rows, with a value that is drawn from a $N(0, 2)$ distribution. (In other words, the covariance matrix for these columns is a matrix whose entries are all equal to 2.)

```{r sim3, echo=F}
tmp <- readRDS(paste0(fpath, "sim3res.rds"))
knitr::kable(tmp, digits=2)
```

![](images/sim3ROC.png)
![](images/sim3time.png)

### Independent, shared, and unique effects

This model is similar to the above two, but now only a third of the nonnull columns have independently distributed entries and a third have shared entries. The other third have a unique nonzero entry. (This corresponds, for example, to a gene that is only expressed in a single condition.) The unique effects are distributed uniformly across rows, and are drawn from a $N(0, 10^2)$ distribution.

```{r sim4, echo=F}
tmp <- readRDS(paste0(fpath, "sim4res.rds"))
knitr::kable(tmp, digits=2)
```

![](images/sim4ROC.png)
![](images/sim4time.png)

## Results: FLASH models

Interestingly, MASH is very competitive with FLASH even when the true model is a FLASH model. Indeed, of the four fits, the MASH fit gives the best results for the rank 5 model.

Results for the "Zero" and OHL fits are again very similar, and both do quite a bit better than the OHF method, which does poorly in terms of MSE on the rank 1 model and in terms of FPR/TPR trade-off on the rank 5 model. (However, it outperforms the "Zero" and OHL fits in terms of confidence interval coverage.) Further, the OHF method is much slower here (if not worryingly so). I suspect that this is because the canonical factors are quickly "zeroed out" when the greedily-added "data-driven" factors (which, I assume, are pretty close to the true $F$) have already been added, whereas much more time is spent backfitting them when no data-driven factors have yet been added.

### Rank 1 model

This is the FLASH model $X = LF$, where $L$ is an $n$ by $k$ matrix and $F$ is a $k$ by $p$ matrix. In this first simulation, $k = 1$. 80% of the columns in $F$ and 20% of the entries in $L$ are equal to zero. The other entries of $F$ are i.i.d. $N(0, 1)$; the nonzero entries of $L$ are i.i.d. $N(0, 2^2)$.

```{r sim5, echo=F}
tmp <- readRDS(paste0(fpath, "sim5res.rds"))
knitr::kable(tmp, digits=2)
```

![](images/sim5ROC.png)
![](images/sim5time.png)

### Rank 5 model 

This is the same as above with $k = 5$, with 80% of the entries in $L$ equal to zero, and with the other entries i.i.d. $N(0, 1)$.

```{r sim6, echo=F}
tmp <- readRDS(paste0(fpath, "sim6res.rds"))
knitr::kable(tmp, digits=2)
```

![](images/sim6ROC.png)
![](images/sim6time.png)


## Code

for simulating datasets...
```{r sims, code=readLines("../code/sims.R")}
```

...for fitting MASH and FLASH objects...
```{r fits, code=readLines("../code/fits.R")}
```

...for evaluating performance...
```{r utils, code=readLines("../code/utils.R")}
```

...for running simulations and plotting results...
```{r mashvflash, code=readLines("../code/mashvflash.R")}
```

...and the main function calls.
```{r main, code=readLines("../code/MASHvFLASHsims.R"), eval=FALSE}
```