# MASH analysis of GTEx data, Urbut 2017

This notebook performs the analysis of GTEx data for Urbut et al 2017, using (by default) a prototype version of MASH implementation that produces results discussed in the manuscript.

Note that:

1. The input data has been converted from eQTL summary statistic to MASH format using a separate workflow (`fastqtl_to_mash.ipynb`). 
2. In this notebook we did not apply the inference to all the gene-snp pairs. Rather we focused on the "top" gene-snp pairs as a demonstration. It should be straightforward to configure the Posterior computatoin step to work on all gene-snp pairs instead. But we'd recommand using the `mashr` workflow which is a lot faster (see below)
3. We have improved MASH implementation since the paper was submitted. See section "Analysis with `mashr` package" for the `mashr` workflow, and section "Run this notebook" about how to trigger this new workflow.

## Run this notebook
For repeated runs it might be easier to execute from commandline instead of in Jupyter:

```bash
sos run workflows/gtex6_mash_analysis.ipynb # --data ... --cwd ...
```

The notebook runs default setting, ie, all the analysis steps. To view all available analysis:

```bash
sos run workflows/gtex6_mash_analysis.ipynb -h
```

Additionally I run it for dataset after LD pruning (for LD related discussion in supplemental information):

```bash
sos run workflows/gtex6_mash_analysis.ipynb --data data/MatrixEQTLSumStats.Portable.ld2.Z.rds
```

The outcome of this notebook should be found under `./gtex6_workflow_output` folder (can be configured), with the following output:

```
gtex6_mash_analysis.html
MatrixEQTLSumStats.Portable.Z.sfa.rds.log
MatrixEQTLSumStats.Portable.Z.sfa.rds
MatrixEQTLSumStats.Portable.Z.coved.K3.P3.rds
MatrixEQTLSumStats.Portable.Z.coved.K3.P3.lite.single.expanded.rds
pis.remove_before_rerun.pdf
MatrixEQTLSumStats.Portable.Z.coved.K3.P3.lite.single.expanded.pihat.rds
MatrixEQTLSumStats.Portable.Z.coved.K3.P3.lite.single.expanded.loglik.rds
MatrixEQTLSumStats.Portable.Z.coved.K3.P3.lite.single.expanded.posterior.rds
```

We keep track of results from every MASH step, though the inference of interest should be found in the `*.posterior.rds` file generated at the end of the pipeline.

### A faster implementation via package `mashr`

See section "Analysis with `mashr` package" and the `mashr` workflow for details. To use this workflow, add `--algorithm mashr` to the run commands above.

If you run from the docker image `gaow/mash-paper` you need to put a copy of [MOSEK license file](https://www.mosek.com/products/academic-licenses) to `<workdir>/mosek.lic` (ie, `gtex6_workflow_output/mosek.lic` if you did not change any settings below).

### The prototype (default) MASH implementation

Some notes about the code used for the `mash` workflow analysis (some available from the `mash` workflow, some in the downloaded `mash_script_download` code):
    
    1. `ms=deconvolution.em.with.bovy(t.stat,factor.mat,v.j,lambda.mat,K=3,P=3)`

produces an object with the denoised matrices for feeding into the
*mash* covariance code. The *factor.mat* and *lambda.mat* called
within have been produced by SFA and are single rank factors and
loadings approximating the empirical covariance.

    2. `covmat=compute.hm.covmat.all.max.step(b.hat=z.stat,se.hat=v.j,t.stat=z.stat,Q=5,lambda.mat,A=A,factor.mat,max.step=max.step,zero=TRUE)$covmat` 

produces a list of covariance matrices entitled *covmat"A".rds* upon
which to base the mixture of multivariate normals.

    3. `compute.hm.train.log.lik(train.b = train.z,se.train = train.v,covmat = covmat,A,pen=TRUE)`

computes the HM weights on training datauses the set of randomly chosen genes to train our model and produces
a matrix of likelihoods and corresponding hierarchical weights, as well as the mixture proportions.

    4. `weightedquants=lapply(seq(1:nrow(z.stat)),function(j){total.quant.per.snp(j,covmat,b.gp.hat=z.stat,se.gp.hat = v.j,pis,A,checkpoint = FALSE)})`

produces files containing the posterior means, upper and lower
tail probabilities, null probabilites, and lfsr for all J genes across
44 conditions.

In [1]:
[global]
parameter: cwd = path('./gtex6_workflow_output')
parameter: data = path("data/MatrixEQTLSumStats.Portable.Z.rds")
# path configured to /opt folder outside $HOME, to make it easier to use with `docker`
parameter: mash_src = file_target("/opt/mash-paper/main.R")
parameter: sfa_exe = file_target("/opt/sfa/bin/sfa_linux")
sfa_data = file_target(f"{cwd:a}/{data:bn}.sfa.rds")
# parameters for mashr analysis
parameter: algorithm = 'paper'
parameter: empirical_cov = 0
parameter: vhat = 1
mosek_license = file_target(f"{cwd:a}/mosek.lic")

## Covariance pattern discovery
This obtains covariance matrices, ie, the priors, for `mash` model.

### SFA
We analyze data with SFA. The cell below downloads SFA software and run it on data with rank `K = 5`.

In [2]:
[sfa_download: provides = sfa_exe]
# Download / install SFA (no need if running from docker `gaow/mash-paper`)
download: decompress = True, dest_dir = f'{sfa_exe:ad}'
    http://stephenslab.uchicago.edu/assets/software/sfa/sfa1.0.tar.gz

[sfa: provides = sfa_data]
# Perform SFA analysis (time estimate: <1min)
depends: sfa_exe
K = 5
tmpfile = path(f"{cwd:a}/{data:bn}.max.txt")
input: f"{data:a}"
output: sfa_data
R: expand = "${ }", stdout = f"{_output}.log", workdir = cwd
    z = readRDS(${_input:r})$test.z
    write.table(z, ${tmpfile:r}, col.names=F,row.names=F)
    cmd = paste0('${sfa_exe} -gen ${tmpfile} -g ', dim(z)[1], ' -n ', dim(z)[2], 
                 ' -k ${K} -iter 50 -rand 999 -o ${_output:bn}')
    system(cmd)
    saveRDS(list(F = read.table("${_output:n}_F.out"),
                lambda = read.table("${_output:n}_lambda.out"),
                sigma2 = read.table("${_output:n}_sigma2.out"),
                alpha = read.table("${_output:n}_alpha.out")), ${_output:r})
bash: workdir = cwd, expand = '${ }'
    rm -f *{_F.out,_lambda.out,_sigma2.out,_alpha.out} ${tmpfile}
    rm -r output

###  Create and refine multi-rank covariance matrices
Here we create 3 covariance matrices:

* SFA (rank 5, previously computed)
* SVD (rank 3, to be computed)
* Empirical covariance

and apply [Extreme Deconvolution](https://github.com/jobovy/extreme-deconvolution) to refine the matrices. We observed that Extreme Deconvolution perserves rank.

In [3]:
[mash_scripts_download: provides = mash_src]
# Download / install MASH scripts (no need if running from docker `gaow/mash-paper`)
output: mash_src
download: decompress = True, dest_dir = cwd
    https://github.com/stephenslab/mashr-paper/archive/v0.2-1.zip
bash: expand = True, workdir = cwd
    mkdir -p {mash_src:ad}
    cp mashr-paper-0.2-1/R/* {mash_src:ad} && rm -rf mashr-paper-0.2-1

[mash-paper_1: shared = {'mash_input': '_input'}]
# Compute data-driven prior matrices (time estimate: ~30min)
depends: R_library("ExtremeDeconvolution"), mash_src, sfa_data
K = 3
P = 3
input: f"{data:a}"
output: f"{cwd:a}/{data:bn}.coved.K{K}.P{P}.rds"
R: expand = "${ }", workdir = cwd
    setwd(${mash_src:dar})
    ret = sapply(list.files(pattern = "*.R"), source, .GlobalEnv)
    setwd(${cwd:ar})
    dat = readRDS(${_input:r})
    t.stat = dat$test.z
    mean.mat = matrix(rep(0,ncol(t.stat)*nrow(t.stat)),ncol=ncol(t.stat),nrow=nrow(t.stat))
    s.j = matrix(rep(1,ncol(t.stat)*nrow(t.stat)),ncol=ncol(t.stat),nrow=nrow(t.stat))
    v.mat = dat$vhat
    v.j=list()
    for(i in 1:nrow(t.stat)){v.j[[i]]=v.mat}
    K = ${K}
    P = ${P}
    R = ncol(t.stat)
    sfa = readRDS(${sfa_data:r})
    init.cov = init.covmat(t.stat=t.stat, factor.mat = as.matrix(sfa$F),lambda.mat = as.matrix(sfa$lambda), K=K,P=P)
    init.cov.list = list()
    for(i in 1:K){init.cov.list[[i]]=init.cov[i,,]}
    projection = list();for(l in 1:nrow(t.stat)){projection[[l]]=diag(1,R)}
    e = ExtremeDeconvolution::extreme_deconvolution(ydata=t.stat, ycovar=v.j,
                                                      xamp=rep(1/K,K), xmean=mean.mat, xcovar=init.cov.list,
                                                      fixmean=T, projection=projection, logfile=${_output:nr})
    true.covs = array(dim=c(K,R,R))
    for(i in 1:K){true.covs[i,,]=e$xcovar[[i]]}
    saveRDS(list(true.covs=true.covs,pi=e$xamp), ${_output:r})

### Add in canonical and single-rank covariance matrices

Now additionally we include 2 other types of covariance matrices:
* canonical configurations (aka `bmalite`)
* single rank SFA

We also expand the list of matrices by grid. At the end of this step (cell below) we are ready to fit the mash model.

In [4]:
[mash-paper_2: shared = {'prior_matrices': '_output'}]
# Add in canonical configurations and single rank SFA priors (time estimate: <1min)
depends: sos_variable('mash_input'), sfa_data
output: f"{_input:n}.lite.single.expanded.rds"
R: expand = "${ }", workdir = cwd
    setwd(${mash_src:dar})
    ret = sapply(list.files(pattern = "*.R"), source, .GlobalEnv)
    setwd(${cwd:ar})
    dat = readRDS(${mash_input:r})
    z.stat = dat$test.z
    rownames(z.stat) = NULL
    colnames(z.stat) = NULL
    v.mat = dat$vhat
    s.j = matrix(rep(1,ncol(z.stat)*nrow(z.stat)),ncol=ncol(z.stat),nrow=nrow(z.stat))
    sfa = readRDS(${sfa_data:r})
    res = compute.hm.covmat.all.max.step(b.hat=z.stat,se.hat=s.j,
                                          t.stat=z.stat,Q=5,
                                          lambda.mat=as.matrix(sfa$lambda),
                                          A='.remove_before_rerun',
                                          factor.mat=as.matrix(sfa$F),
                                          max.step=readRDS(${_input:r}),
                                          zero=TRUE)
    saveRDS(res, ${_output:r})

bash: workdir = cwd
    rm -f *.remove_before_rerun.rds

## Fit MASH mixture model
Using a training set, the cell below computes the weights for input covariance matrices (priors) in MASH mixture. The output contains matrix of log-likelihoods as well as weights learned from the hierarchical model.

In [5]:
[mash-paper_3]
# Fit MASH mixture model (time estimate: ~2.5hr)
depends: sos_variable('mash_input'), R_library("SQUAREM")
output: f"{_input:n}.pihat.rds", f"{_input:n}.loglik.rds"
R: expand = "${ }", workdir = cwd
    library("SQUAREM")
    setwd(${mash_src:dar})
    ret = sapply(list.files(pattern = "*.R"), source, .GlobalEnv)
    setwd(${cwd:ar})
    dat = readRDS(${mash_input:r})
    v.mat = dat$vhat
    covmat = readRDS(${_input:r})$covmat
    train.z = as.matrix(dat$train.z)
    rownames(train.z) = NULL
    colnames(train.z) = NULL
    train.v = train.z/train.z
    res = compute.hm.train.log.lik.pen.vmat(train.b=train.z,
                                            covmat=covmat,
                                            A='.remove_before_rerun', 
                                            pen=1,
                                            train.s=train.v,
                                            cormat=v.mat)
    saveRDS(res$pis, ${_output[0]:r})
    saveRDS(res$lik.mat, ${_output[1]:r})

bash: workdir = cwd
    rm -f *.remove_before_rerun.rds

## Posterior inference
Applying hyperparameters learned from the training set to the test set, the cell below computes posterior quantities. 

In [6]:
[mash-paper_4]
# Posterior inference on the "top" set of gene-snp pairs (time estimate: ~3.5hr)
depends: sos_variable('mash_input'), sos_variable('prior_matrices')
output: f"{_input[0]:nn}.posterior.rds"
R: expand = "${ }", workdir = cwd
    setwd(${mash_src:dar})
    ret = sapply(list.files(pattern = "*.R"), source, .GlobalEnv)
    setwd(${cwd:ar})
    dat = readRDS(${mash_input:r})
    z.stat = dat$test.z
    v.mat = dat$vhat
    s.j = matrix(rep(1,ncol(z.stat)*nrow(z.stat)),ncol=ncol(z.stat),nrow=nrow(z.stat))
    pis = readRDS(${_input[0]:r})$pihat
    covmat = readRDS(${prior_matrices:r})$covmat
    res = lapply(seq(1:nrow(z.stat)), function(j){
        total.quant.per.snp.with.vmat(j=j, covmat=covmat, 
                                      b.gp.hat=z.stat, 
                                      cormat=v.mat, 
                                      se.gp.hat=s.j, 
                                      pis=pis, 
                                      A='remove_before_rerun', 
                                      checkpoint=TRUE)})
    # data formatting.
    out = do.call(Map, c(f = rbind, res))
    saveRDS(out, ${_output:r})

bash: workdir = cwd
    rm -f *.remove_before_rerun.rds

Now MASH analysis is complete. I will use a separate notebook to summarize, plot and visualize the result of analysis.

## Analysis with `mashr` package

Since the submission of Urbut 2017 we have improved implementation of MASH algorithm to produce an R package, [`mashr`](https://github.com/stephenslab/mashr). Major improvements relevant to Urbut 2017 are:

1. Faster computation of likelihood and posterior matrices via matrix algebra tricks and a C++ implementation.
2. Faster computation of MASH mixture via convex optimization.

Below is the same workflow as `mash` workflow previously discussed, but implemented with `mashr`.

In [8]:
[mash-fast_1: shared = {'mash_input': '_input'}]
# Compute data-driven / canonical prior matrices (time estimate: <10min)
depends: R_library("ExtremeDeconvolution"), R_library("mashr@stephenslab/mashr"), sfa_data
K = 3 # as in mash paper
P = 3 # as in mash paper
input: f"{data:a}"
output: f"{cwd:a}/{data:bn}.cov{empirical_cov}.K{K}.P{P}.rds"
R: expand = "${ }", workdir = cwd
    dat = readRDS(${_input:r})
    sfa_data = readRDS(${sfa_data:r})
    mash_data = mashr::mash_set_data(as.matrix(dat$test.z), matrix(1, nrow(dat$test.z), ncol(dat$test.z)))
    sfa_res = as.matrix(sfa_data$lambda) %*% as.matrix(sfa_data$F)
    # SFA matrices
    U.sfa = c(mashr::cov_from_factors(as.matrix(sfa_data$F), "sfa${K}"), list("tSFA" = t(sfa_res) %*% sfa_res / nrow(dat$test.z)))
    # SVD matrices
    U.pca = mashr::cov_pca(mash_data, ${P})
    # Emperical data matrices
    # `cov_ed` will take significantly longer when this empirical convariance matrix is added
    D.center = apply(as.matrix(dat$test.z), 2, function(x) x - mean(x))
    # Denoised data-driven matrices
    U.dd = c(U.sfa, U.pca, if (${empirical_cov} == 1) list("XX" = t(D.center) %*% D.center / nrow(dat$test.z)) else list())
    U.ed = mashr::cov_ed(mash_data, U.dd)
    # Canonical matrices
    U.can = mashr::cov_canonical(mash_data)
    saveRDS(list(Ulist = c(U.ed, U.can), DD_raw = U.dd), ${_output:r})

[mash-fast_2]
# Fit MASH mixture model (time estimate: <5min)
depends: R_library("REBayes")
output: f"{_input:n}.V{vhat}.mash_model.rds"
R: expand = "${ }", workdir = cwd, env = {'MOSEKLM_LICENSE_FILE': str(mosek_license)}
    dat = readRDS(${mash_input:r})
    null.z = rbind(dat$train.z, dat$validate.z)
    if (${vhat}) {
        V = cor(null.z[which(apply(abs(null.z),1, max) < 2),])
    } else {
        V = diag(ncol(null.z))
    }
    mash_data = mashr::mash_set_data(as.matrix(null.z), matrix(1, nrow(null.z), ncol(null.z)), V=as.matrix(V))
    saveRDS(mashr::mash(mash_data, Ulist = readRDS(${_input:r})$Ulist, outputlevel = 1), ${_output:r})
  
[mash-fast_3]
# Posterior inference on the "top" set of gene-snp pairs (time estimate: <3min)
output: f"{_input:n}.posterior.rds"
R: expand = "${ }", workdir = cwd
    dat = readRDS(${mash_input:r})
    null.z = rbind(dat$train.z, dat$validate.z)
    if (${vhat}) {
        V = cor(null.z[which(apply(abs(null.z),1, max) < 2),])
    } else {
        V = diag(ncol(null.z))
    }
    mash_data = mashr::mash_set_data(as.matrix(dat$test.z), matrix(1, nrow(dat$test.z), ncol(dat$test.z)), V=as.matrix(V))
    saveRDS(mashr::mash_compute_posterior_matrices(readRDS(${_input:r}), mash_data), ${_output:r})

## Export and run default pipeline

In [7]:
[export]
# Export notebook to HTML file
input: [item for item in paths(sys.argv) if item.suffix == '.ipynb'], group_by = 1
output: [(f'{cwd:a}/{item:bn}.full.html', f'{cwd:a}/{item:bn}.lite.html') for item in paths(sys.argv) if item.suffix == '.ipynb'], group_by = 2
bash: expand = True, stderr = False
  sos convert {_input} {_output[0]}
  sos convert {_input} {_output[1]} --template sos-report
  
[default]
# Run all analysis in this notebook
if algorithm == 'paper':
    sos_run('export+mash-paper')
else:
    fail_if(not mosek_license.is_file(), msg = f'Please put a valid copy (NOT a symbolic link!) of MOSEK license to: \n``{mosek_license}``')
    sos_run('export+mash-fast')

In [9]:
%sessioninfo

0,1
SoS Version,0.9.13.3
