# Multivariate adaptive shrinkage (MASH) analysis of eQTL data

This is a "blackbox" implementation of the `mashr` eQTL workflow. 

Since the submission of Urbut 2017 we have improved implementation of MASH algorithm to produce an R package, [`mashr`](https://github.com/stephenslab/mashr). Major improvements compared to Urbut 2017 are:

1. Faster computation of likelihood and posterior matrices via matrix algebra tricks and a C++ implementation.
2. Faster computation of MASH mixture via convex optimization.
3. Add in `FLASH` method for prior covariance matrices.

Here the input data have already been converted from the original eQTL summary statistics to a format convenient for analysis in MASH. For details on data convertion step, see the README and the `fastqtl_to_mash.ipynb` workflow notebook. 

In this notebook, we did not apply the inference to all the gene-snp pairs. Rather we focused on the "top" gene-snp pairs as a demonstration. It should be straightforward to configure the Posterior computatoin step to work on all gene-snp pairs instead.

To run the workflow:

```bash
sos run workflows/mashr_flashr_workflow.ipynb # --data ... --cwd ...
```

The notebook runs default setting, ie, all the analysis steps. To view all available analysis options:

```bash
sos run workflows/mashr_flashr_workflow.ipynb -h
```

The outcome of this notebook should be found under `./mashr_flashr_workflow_output` folder (can be configured). We keep track of results from every MASH step, though the inference of interest should be found in the `*.posterior.rds` file generated at the end of the pipeline.

Current mashr implementation requires that you put a copy of [MOSEK license file](https://www.mosek.com/products/academic-licenses) to `<workdir>/mosek.lic` (ie, `mashr_flashr_workflow_output/mosek.lic` if you did not change any settings below).

In [None]:
[global]
parameter: cwd = path('./mashr_flashr_workflow_output')
parameter: data = path("data/MatrixEQTLSumStats.Portable.Z.rds")
parameter: vhat = 1
parameter: alpha = 1
parameter: mosek_license = file_target(f"{cwd:a}/mosek.lic")
flash_data = file_target(f"{cwd:a}/{data:bn}.flash.rds")
fail_if(not mosek_license.is_file(), msg = f'Please put a valid copy (NOT a symbolic link!) of MOSEK license to: \n``{mosek_license}``')

### FLASH analysis

In [None]:
[flash: provides = flash_data]
# Perform FLASH analysis (time estimate: 20min)
depends: R_library("mashr@stephenslab/flashr"), R_library('mclust'), R_library('plyr')
K = 5 # flash factors
input: f"{data:a}"
output: flash_data
R: expand = "${ }", workdir = cwd
    get_flash_prior = function(strong.z, K = ${K}, factor_plot = "${_output:n}.factors.pdf") {
        Z.center = apply(strong.z, 2, function(x) x - mean(x))
        flash.data = flashr::flash_set_data(Z.center)
        fmodel = flashr::flash(flash.data, greedy = TRUE, backfit = TRUE)
        factors = flashr::flash_get_ldf(fmodel)$f
        row.names(factors) = colnames(strong.z)
        pve.order = order(flashr::flash_get_pve(fmodel), decreasing = TRUE)
        if (!is.null(factor_plot)) {
            pdf(factor_plot)
            #par(mar=c(.5,.5,.5,.5))
            #par(mfrow=c(ceiling(length(pve.order) / 2), 2))
            for(i in pve.order){
              barplot(factors[,i], main=paste0('Factor ',i, ' pve= ', round(flashr::flash_get_pve(fmodel)[i],3)), las=2, cex.names = 0.7)
            }
            dev.off()
          }
        # flash on the loading
        K = min(K, ncol(fmodel$EL))
        loading = fmodel$EL[,1:K]
        colnames(loading) = paste0('Factor',seq(1,K))
        flash.loading = flashr::flash_set_data(loading)
        flmodel = flashr::flash(flash.loading, greedy = TRUE, backfit = TRUE)
        # Cluster loadings
        library(mclust)
        mod = Mclust(loading)
        U_list = plyr::alply(mod$parameters$variance$sigma,3)
        mu_list = plyr::alply(mod$parameters$mean,2)
        ll = list()
        for (i in 1:length(U_list)){
          ll[[i]] = U_list[[i]] + mu_list[[i]] %*% t(mu_list[[i]])
        }
        Factors = fmodel$EF[,1:5]
        U.loading = lapply(ll, function(U){Factors %*% (U %*% t(Factors))})
        names(U.loading) = paste0('Load', K, "_", (1:length(U.loading)))
        # rank 1
        Flash_res = flashr::flash_get_lf(fmodel)
        U.flash = c(mashr::cov_from_factors(t(as.matrix(factors)), "Flash"),
                    list("tFlash" = t(Flash_res) %*% Flash_res / nrow(Z.center)))
        return(list(U.loading = U.loading, U.flash = U.flash))
    }
    #
    res = get_flash_prior(readRDS(${_input:r})$strong.z)
    saveRDS(res, ${_output:r})

### MASH analysis

In [None]:
[default_1: shared = {'mash_input': '_input', 'mash_vhat': '_output[1]'}]
# Compute data-driven / canonical prior matrices (time estimate: ~2h)
depends: R_library("ExtremeDeconvolution"), R_library("mashr@stephenslab/mashr"), flash_data
K = 3 # as in mash paper
P = 3 # as in mash paper
input: f"{data:a}"
output: f"{cwd:a}/{data:bn}.V{vhat}.K{K}.P{P}.rds", f"{cwd:a}/{data:bn}.Vhat.rds"
R: expand = "${ }", workdir = cwd
    dat = readRDS(${_input:r})
    data_temp = mashr::mash_set_data(as.matrix(dat$random.b), as.matrix(dat$random.s))
    vhat = mashr::estimate_null_correlation(data_temp)
    rm(data_temp)
    mash_data = mashr::mash_set_data(as.matrix(dat$strong.b), Shat=as.matrix(dat$strong.s), alpha=${alpha}, V=${"vhat" if vhat else "diag(nrow(vhat))"})
    # FLASH matrices
    flash_res = readRDS(${flash_data:r})
    # SVD matrices
    U.pca = mashr::cov_pca(mash_data, ${P})
    # D.center is used to compute emperical cov matrix
    D.center = apply(as.matrix(dat$strong.z), 2, function(x) x - mean(x))
    # Denoised data-driven matrices
    U.dd = c(flash_res$U.flash, flash_res$U.loading, U.pca, list("XX" = t(D.center) %*% D.center / nrow(dat$strong.z)))
    U.ed = mashr::cov_ed(mash_data, U.dd)
    # Canonical matrices
    U.can = mashr::cov_canonical(mash_data)
    saveRDS(list(Ulist = c(U.ed, U.can), DD_raw = U.dd), ${_output[0]:r})
    saveRDS(vhat, ${_output[1]:r})

In [None]:
[default_2]
# Fit MASH mixture model (time estimate: <5min)
depends: R_library("REBayes"), sos_variable('mash_input'), sos_variable('mash_vhat')
output: f"{_input:n}.mash_model.rds"
R: expand = "${ }", workdir = cwd, env = {'MOSEKLM_LICENSE_FILE': str(mosek_license)}
    dat = readRDS(${mash_input:r})
    vhat = readRDS(${mash_vhat:r})
    mash_data = mashr::mash_set_data(as.matrix(dat$random.b), Shat=as.matrix(dat$random.s), alpha=${alpha}, V=${"vhat" if vhat else "diag(nrow(vhat))"})
    saveRDS(mashr::mash(mash_data, Ulist = readRDS(${_input:r})$Ulist, outputlevel = 1), ${_output:r})

In [None]:
[default_3]
# Posterior inference on the "top" set of gene-snp pairs (time estimate: <3min)
depends: sos_variable('mash_input'), sos_variable('mash_vhat')
output: f"{_input:n}.posterior.rds"
R: expand = "${ }", workdir = cwd
    dat = readRDS(${mash_input:r})
    vhat = readRDS(${mash_vhat:r})
    mash_data = mashr::mash_set_data(as.matrix(dat$strong.b), Shat=as.matrix(dat$strong.s), alpha=${alpha}, V=${"vhat" if vhat else "diag(nrow(vhat))"})
    saveRDS(mashr::mash_compute_posterior_matrices(readRDS(${_input:r}), mash_data), ${_output:r})

In [1]:
%sessioninfo

0,1
SoS Version,0.9.14.2
