# Multivariate adaptive shrinkage (MASH) analysis of GTEx data

This notebook contains code used to generate results for the Urbut *et al* (2017) manuscript.

Important notes:

1. Although you may open this notebook in Jupyter, you should not step through the code sequentially as you would in a typical Jupyter notebook; this is because the code in this notebook is meant to be run using the [Script of Scripts (SoS)](https://github.com/vatlab/SoS)
framework.

2. This notebook is meant to reproduce Urbut 2017, even though with the `--data` option you can provide your own data-set and perform MASH analysis, we recommand using [a more recent version of MASH implementation](https://github.com/stephenslab/mashr) for your analysis, which is a lot faster due to improvements in both algorithm and coding.

3. In this notebook, we did not apply the inference to all the gene-snp pairs. Rather we focused on the "top" gene-snp pairs as a demonstration. It should be straightforward to configure the Posterior computatoin step to work on all gene-snp pairs instead.

## Run MASH as implemented in Urbut 2017
For repeated runs it might be easier to execute from commandline instead of in Jupyter:

```bash
sos run workflows/gtex6_mash_analysis.ipynb # --data ... --cwd ...
```

The notebook runs default setting, ie, all the analysis steps. To view all available analysis:

```bash
sos run workflows/gtex6_mash_analysis.ipynb -h
```

Additionally I run it for dataset after LD pruning (for LD related discussion in supplemental information):

```bash
sos run workflows/gtex6_mash_analysis.ipynb --data data/MatrixEQTLSumStats.Portable.ld2.Z.rds
```

The outcome of this notebook should be found under `./output` folder (can be configured), with the following output:

```
gtex6_mash_analysis.html
MatrixEQTLSumStats.Portable.Z.sfa.rds.log
MatrixEQTLSumStats.Portable.Z.sfa.rds
MatrixEQTLSumStats.Portable.Z.coved.K3.P3.rds
MatrixEQTLSumStats.Portable.Z.coved.K3.P3.lite.single.expanded.rds
MatrixEQTLSumStats.Portable.Z.coved.K3.P3.lite.single.expanded.pihat.rds
MatrixEQTLSumStats.Portable.Z.coved.K3.P3.lite.single.expanded.loglik.rds
MatrixEQTLSumStats.Portable.Z.coved.K3.P3.lite.single.expanded.posterior.rds
```

We keep track of results from every MASH step, though the inference of interest should be found in the `*.posterior.rds` file generated at the end of the pipeline.

### Analysis Steps

The pipeline automatically executes the following:

+ Compute a sparse factorization of the (centered) z-scores using the
  [SFA software](http://stephenslab.uchicago.edu/software.html#sfa),
  with K = 5 factors, and save the factors in an `.rds` file. This
  will be used to construct the mixture-of-multivariate normals
  prior. This step is labeled `sfa`, and should only take a few
  minutes to run.

+ Compute additional "data-driven" prior matrices by computing a
  singular value decomposition of the (centered) z-scores and low-rank
  approximations to the empirical covariance matrices. Most of the
  work in this step involves running the Extreme Deconvolution
  method. The outcome of running the Extreme Deconvolution method is
  saved to a new `.rds` file. This step is labeled `mash-paper_1` and
  may take several hours to run (in one run on a MacBook Pro with
  a 3.5 GHz Intel Core i7, it took over 6 hours to complete).

+ A final collection of "canonical" and single-rank prior matrices
  based on SFA and the "BMAlite" models of Flutre *et al*
  (2013). These matrices are again written to another `.rds` file. This
  step is labeled `mash-paper_2`, and should take at most a minute to
  run.

+ The `mash-paper_3` step fits the MASH ("multivariate adaptive
  shrinkage") model to the GTEx data (the centered z-scores); the
  model parameters estimated in this fitting step are the weights of
  the multivariate normal mixture. The outputs from this step are the
  estimated mixture weights and the conditional likelihood
  matrix. These two outputs are saved to two separate `.rds` files.
  This step is expected to take at most a few hours to complete.

+ The `mash-paper_4` step computes posterior statistics using the
  fitted MASH model from the previous step. These posterior statistics
  are summarized and visualized in subsequent analyses. The posterior
  statistics are saved to another `.rds` file. This step is expected
  to take a few hours to complete.

### The prototype (default) MASH implementation: some details

Some notes about the code used for the `mash` workflow analysis (some available from the `mash` workflow, some in the downloaded `mash_script_download` code):
    
    1. `deconvolution.em.with.bovy(...)`

produces an object with the denoised matrices for feeding into the
*mash* covariance code. The *factor.mat* and *lambda.mat* called
within have been produced by SFA and are single rank factors and
loadings approximating the empirical covariance.

    2. `compute.hm.covmat.all.max.step(...)$covmat` 

produces a list of covariance matrices entitled *covmat"A".rds* upon
which to base the mixture of multivariate normals.

    3. `compute.hm.train.log.lik(...)`

computes the HM weights on training datauses the set of randomly chosen genes to train our model and produces
a matrix of likelihoods and corresponding hierarchical weights, as well as the mixture proportions.

    4. `total.quant.per.snp(...)`

produces files containing the posterior means, upper and lower
tail probabilities, null probabilites, and lfsr for all J genes across
44 conditions.

In [None]:
[global]
parameter: cwd = path('./output')
parameter: data = path("data/MatrixEQTLSumStats.Portable.Z.rds")
# path configured to /opt folder outside $HOME, to make it easier to use with `docker`
parameter: mash_src = file_target("/opt/mash-paper/main.R")
parameter: sfa_exe = file_target("/opt/sfa/bin/sfa_linux")
parameter: vhat = 1
sfa_data = file_target(f"{cwd:a}/{data:bn}.sfa.rds")

## Covariance pattern discovery
This obtains covariance matrices, ie, the priors, for `mash` model.

### SFA
We analyze data with SFA. The cell below downloads SFA software and run it on data with rank `K = 5`.

In [None]:
[sfa_download: provides = sfa_exe]
# Download / install SFA (no need if running from docker `gaow/mash-paper`)
download: decompress = True, dest_dir = f'{sfa_exe:ad}'
    http://stephenslab.uchicago.edu/assets/software/sfa/sfa1.0.tar.gz

[sfa: provides = sfa_data]
# Perform SFA analysis (time estimate: <1min)
depends: sfa_exe
K = 5
tmpfile = path(f"{cwd:a}/{data:bn}.max.txt")
input: f"{data:a}"
output: sfa_data
R: expand = "${ }", stdout = f"{_output:n}.log", workdir = cwd
    z = readRDS(${_input:r})$strong.z
    write.table(z, ${tmpfile:r}, col.names=F,row.names=F)
    cmd = paste0('${sfa_exe} -gen ${tmpfile} -g ', dim(z)[1], ' -n ', dim(z)[2], 
                 ' -k ${K} -iter 50 -rand 999 -o ${_output:bn}')
    system(cmd)
    saveRDS(list(F = read.table("${_output:n}_F.out"),
                lambda = read.table("${_output:n}_lambda.out"),
                sigma2 = read.table("${_output:n}_sigma2.out"),
                alpha = read.table("${_output:n}_alpha.out")), ${_output:r})
bash: workdir = cwd, expand = '${ }'
    rm -f *{_F.out,_lambda.out,_sigma2.out,_alpha.out} ${tmpfile}
    rm -r output

###  Create and refine multi-rank covariance matrices
Here we create 3 covariance matrices:

* SFA (rank 5, previously computed)
* SVD (rank 3, to be computed)
* Empirical covariance

and apply [Extreme Deconvolution](https://github.com/jobovy/extreme-deconvolution) to refine the matrices. We observed that Extreme Deconvolution perserves rank.

In [None]:
[mash_scripts_download: provides = mash_src]
# Download / install MASH scripts (no need if running from docker `gaow/mash-paper`)
output: mash_src
download: decompress = True, dest_dir = cwd
    https://github.com/stephenslab/mashr-paper/archive/v0.2-1.zip
bash: expand = True, workdir = cwd
    mkdir -p {mash_src:ad}
    cp mashr-paper-0.2-1/R/* {mash_src:ad} && rm -rf mashr-paper-0.2-1

[mash-paper_1: shared = {'mash_input': '_input'}]
# Compute data-driven prior matrices 
# (time estimate: 40min to 4hrs depending on the machine power)
depends: R_library("mashr"), mash_src, sfa_data
K = 3
P = 3
input: f"{data:a}"
output: f"{cwd:a}/{data:bn}.coved.K{K}.P{P}.rds"
R: expand = "${ }", workdir = cwd
    library(mashr) # `mashr` package offers function `extreme_deconvolution`
    script.source = sapply(list.files(${mash_src:dar}, pattern = "*.R", full.names=TRUE), source, .GlobalEnv)
    z.stat = readRDS(${_input:r})$strong.z
    sfa = readRDS(${sfa_data:r})
    s.j = matrix(rep(1,ncol(z.stat)*nrow(z.stat)),ncol=ncol(z.stat),nrow=nrow(z.stat))
    res = deconvolution.em.with.bovy(z.stat, 
                                      as.matrix(sfa$F), 
                                      s.j, 
                                      as.matrix(sfa$lambda),
                                      K = ${K}, P = ${P})
    saveRDS(res, ${_output:r})

### Add in canonical and single-rank covariance matrices

Now additionally we include 2 other types of covariance matrices:
* canonical configurations (aka `bmalite`)
* single rank SFA

We also expand the list of matrices by grid. At the end of this step (cell below) we are ready to fit the mash model.

In [None]:
[mash-paper_2: shared = {'prior_matrices': '_output'}]
# Add in canonical configurations and single rank SFA priors (time estimate: <1min)
depends: sos_variable('mash_input'), sfa_data
output: f"{_input:n}.lite.single.expanded.rds"
R: expand = "${ }", workdir = cwd
    script.source = sapply(list.files(${mash_src:dar}, pattern = "*.R", full.names=TRUE), source, .GlobalEnv)
    z.stat = readRDS(${mash_input:r})$strong.z
    rownames(z.stat) = NULL
    colnames(z.stat) = NULL
    s.j = matrix(rep(1,ncol(z.stat)*nrow(z.stat)),ncol=ncol(z.stat),nrow=nrow(z.stat))
    sfa = readRDS(${sfa_data:r})
    res = compute.hm.covmat.all.max.step(b.hat=z.stat,se.hat=s.j,
                                          t.stat=z.stat,Q=5,
                                          lambda.mat=as.matrix(sfa$lambda),
                                          A='.${_input:bn}.remove_before_rerun',
                                          factor.mat=as.matrix(sfa$F),
                                          max.step=readRDS(${_input:r}),
                                          zero=TRUE)
    saveRDS(res$covmat, ${_output:r})

bash: workdir = cwd, expand = True
    rm -f *.{_input:bn}.remove_before_rerun.*

## Fit MASH mixture model
Using the random SNP set, the cell below computes the weights for input covariance matrices (priors) in MASH mixture. The output contains matrix of log-likelihoods as well as weights learned from the hierarchical model.

In [None]:
[mash-paper_3]
# Fit MASH mixture model (time estimate: ~2.5hr)
depends: sos_variable('mash_input'), R_library("SQUAREM")
output: f"{_input:n}.V{vhat}.pihat.rds", f"{_input:n}.V{vhat}.loglik.rds"
R: expand = "${ }", workdir = cwd
    library("SQUAREM")
    script.source = sapply(list.files(${mash_src:dar}, pattern = "*.R", full.names=TRUE), source, .GlobalEnv)
    dat = readRDS(${mash_input:r})
    covmat = readRDS(${_input:r})
    random.z = as.matrix(dat$random.z)
    rownames(random.z) = NULL
    colnames(random.z) = NULL
    random.v = matrix(rep(1,ncol(random.z)*nrow(random.z)),ncol=ncol(random.z),nrow=nrow(random.z))
    res = compute.hm.train.log.lik.pen.vmat(train.b=random.z,
                                            covmat=covmat,
                                            cormat=${"dat$vhat" if vhat else "diag(nrow(dat$vhat))"},
                                            A='.${_output[0]:bnn}.remove_before_rerun', 
                                            pen=TRUE,
                                            train.s=random.v)
    saveRDS(res$pis, ${_output[0]:r})
    saveRDS(res$lik.mat, ${_output[1]:r})

bash: workdir = cwd, expand = True
    rm -f *.{_output[0]:bnn}.remove_before_rerun.*

## Posterior inference
Applying hyperparameters learned from the random set to the top set (strongest cis SNP of eQTL association), the cell below computes posterior quantities. 

In [None]:
[mash-paper_4]
# Posterior inference on the "top" set of gene-snp pairs 
# (time estimate: ~5hr on single thread)
depends: sos_variable('mash_input'), sos_variable('prior_matrices')
output: f"{_input[0]:nn}.posterior.rds"
R: expand = "${ }", workdir = cwd
    script.source = sapply(list.files(${mash_src:dar}, pattern = "*.R", full.names=TRUE), source, .GlobalEnv)
    dat = readRDS(${mash_input:r})
    z.stat = dat$strong.z
    s.j = matrix(rep(1,ncol(z.stat)*nrow(z.stat)),ncol=ncol(z.stat),nrow=nrow(z.stat))
    pis = readRDS(${_input[0]:r})$pihat
    covmat = readRDS(${prior_matrices:r})
    res = lapply(seq(1:nrow(z.stat)), function(j){
          total.quant.per.snp.with.vmat(j=j, 
                                      covmat=covmat, 
                                      b.gp.hat=z.stat,
                                      se.gp.hat=s.j, 
                                      cormat=${"dat$vhat" if vhat else "diag(nrow(dat$vhat))"},
                                      pis=pis, 
                                      A='', 
                                      checkpoint=TRUE)})
    # data formatting.
    out = do.call(Map, c(f = rbind, res))
    saveRDS(out, ${_output:r})

Now MASH analysis is complete. I will use separate vignettes to summarize, plot and visualize the result of analysis.

## Export and run default pipeline

In [None]:
[export]
# Export notebook to HTML file
input: [item for item in paths(sys.argv) if item.suffix == '.ipynb'], group_by = 1
output: [(f'{cwd:a}/{item:bn}.full.html', f'{cwd:a}/{item:bn}.lite.html') for item in paths(sys.argv) if item.suffix == '.ipynb'], group_by = 2
bash: expand = True, stderr = False
  sos convert {_input} {_output[0]}
  sos convert {_input} {_output[1]} --template sos-report-toc
  
[default]
# Run all analysis in this notebook
sos_run('export')
sos_run('mash-paper')

In [1]:
%sessioninfo

0,1
SoS Version,0.9.14.2
