analysis/gtex.Rmd

---
title: "mash analysis of GTEx data"
author: "Sarah Urbut, Gao Wang, Peter Carbonetto and Matthew Stephens"
site: workflowr::wflow_site
output:
  workflowr::wflow_html:
    toc: false
---

## Overview

To reproduce the results of Urbut, Wang & Stephens (2017), please
follow these instructions. You are welcome to adapt these steps to
your own study. Please also visit the
[mashr R package repository][mashr], which has a more user-friendly
interface and tutorials on how to apply multivariate adaptive
shrinkage (mash) to association analysis gene expression (eQTL
analysis).

The complete analyses of the GTEx data require installation of several
programs and libraries, as well as large data sets that are
specifically prepared for mash. To facilitate reproducing our results,
we provide data that was pre-processed using the
[fastqtl2mash preprocessing pipeline](fastqtl2mash.html). We have also
developed a [Docker container][docker-mash-paper] that includes all
software components necessary to run the analyses. Docker can run on
most popular operating systems (Mac, Windows and Linux) and cloud
computing services such as Amazon Web Services and Microsoft Azure. If
you have not used Docker before, you might want to read
[this][docker-overview] to learn the basic concepts and understand the
main benefits of Docker.

For details on how the Docker image was configured, see
`mash.dockerfile` in the `workflows` directory of the
[git repository][gtexresults]. The Docker image used for our analyses
is based on [gaow/lab-base][docker-lab-base], a customized Docker
image for development with R and Python.

If you find a bug in any of these steps, please post an
[issue][issues].

## Download and install Docker

Download [Docker][docker-download] (note that a free
[community edition][docker-ce] of Docker is available), and install it
following the instructions provided on the Docker website. Once you
have installed Docker, check that Docker is working correctly by
following Part 1 of the ["Getting Started" guide][docker-getting-started].
If you are new to Docker, we recommend reading the entire "Getting
Started" guide.

**Note:** Setting up Docker requires that you have administrator
access to your computer. [Singularity][singularity] is an
alternative that accepts Docker images and does not require
administrator access.

## Download and test Docker image

Run this `alias` command in the shell, which will be used below to run
commands inside the Docker container:

```bash
alias mash-docker='docker run --security-opt label:disable -t '\
'-P -h MASH -w $PWD -v $HOME:/home/$USER -v /tmp:/tmp -v $PWD:$PWD '\
'-u $UID:${GROUPS[0]} -e HOME=/home/$USER -e USER=$USER gaow/mash-paper'
```

The `-v` flags in this command map directories between the standard
computing environment and the Docker container. Since the analyses
below will write files to these directories, it is important to ensure
that:

  + Environment variables `$HOME` and `$PWD` are set to valid and
    writeable directories (usually your home and current working
    directories, respectively).

  + `/tmp` should also be a valid and writeable directory.

If any of these statements are not true, please adjust the `alias`
accordingly. The remaining options only affect operation of the
container, and so should function the same regardless of your operating
system.

Next, run a simple command in the Docker container to check that has
loaded successfully:

```
mash-docker uname -sn
```

This command will download the Docker image if it has not already been
downloaded.

If the container was successfully run, you should see this information
about the Docker container outputted to the screen:

```
Linux MASH
```

You can also run these commands to show the information about the
image downloaded to your computer and the container that has run
(and exited):

```bash
docker image list
docker container list --all
```

*Note:* If you get error "Cannot connect to the Docker daemon. Is the
docker daemon running on this host?" in Linux or macOS, see
[here for Linux][docker-daemon-linux]
or [here for Mac][docker-daemon-mac] for
suggestions on how to resolve this issue.

## Clone or download the gtexresults repository

Clone or download the [gtexresults][gtexresults] repository to your
computer, then change your working directory in the shell to the root
of the repository, e.g.,

```bash
cd gtexresults
```

All the commands below will be run from this directory.

## Fit mash model and compute posterior statistics

Assuming your working directory is the root of the git repository (you
can check this by running `pwd`), run all the steps of the analysis
with this command:

```bash
mash-docker sos run workflows/gtex6_mash_analysis.ipynb
```

This command will take several hours to run—see below for more
information on the individual steps. All outputs generated by this
command will be saved to folder `output` inside the repository.

Note that you may recognize file `gtex6_mash_analysis.ipynb` as a
Jupyter notebook. Indeed, you may open this notebook in Jupyter.
However, you should not step through the code sequentially as you
would in a typical Jupyter notebook; this is because the code in this
notebook is meant to be run using the
[Script of Scripts (SoS)](https://github.com/vatlab/SoS) framework.

This command will execute the following steps of the analysis:

+ Compute a sparse factorization of the (centered) *z*-scores using the
  [SFA software](http://stephenslab.uchicago.edu/software.html#sfa),
  with K = 5 factors, and save the factors in an `.rds` file. This
  will be used to construct the mixture-of-multivariate normals
  prior. This step is labeled `sfa`, and should only take a few
  minutes to run.

+ Compute additional "data-driven" prior matrices by computing a
  singular value decomposition of the (centered) *z*-scores and
  low-rank approximations to the empirical covariance matrices. Most
  of the work in this step involves running the Extreme Deconvolution
  method. The outcome of running the Extreme Deconvolution method is
  saved to a new `.rds` file. This step is labeled `mash-paper_1` and
  may take several hours to run (in one run on a MacBook Pro with
  a 3.5 GHz Intel Core i7, it took over 6 hours to complete).

+ Compute a final collection of "canonical" and single-rank prior
  matrices based on SFA and the "BMAlite" models of Flutre *et al*
  (2013). These matrices are again written to another `.rds`
  file. This step is labeled `mash-paper_2`, and should take at most a
  minute to run.

+ The `mash-paper_3` step fits the mash model to the GTEx data (the
  centered *z*-scores); the model parameters estimated in this fitting
  step are the weights of the multivariate normal mixture. The outputs
  from this step are the estimated mixture weights and the conditional
  likelihood matrix. These two outputs are saved to two separate
  `.rds` files.  This step is expected to take at most a few hours to
  complete.

+ The `mash-paper_4` step computes posterior statistics using the
  fitted mash model from the previous step. These posterior statistics
  are summarized and visualized in subsequent analysis steps. The
  posterior statistics are saved to another `.rds` file. This step is
  expected to take a few hours to complete.

**Note:** All containers that have run and exited will still be
retained in the Docker system. Run `docker container list --all` to
list all previous run containers. To clear these previously run
containers, run `docker container prune`. See
[here](https://stackoverflow.com/questions/17014263/should-i-be-concerned-about-excess-non-running-docker-containers)
for more information.

## Generate figures and tables summarizing results of mash analysis

Once you have completed the mash analysis pipeline, the next step is
to examine and interpret the results. We provide R code implementing
this step; you can either view the webpages listed below, or view the
R Markdown source files in the `analysis` directory of the
[gtexresults][gtexresults] repository.

If you were unable to complete the mash analysis pipeline, we have
provided the outputs needed to generate the figures and tables
below. (If you were able to successfully complete the mash analysis,
then these files will be overwritten by your outputs.) For
convenience, the results needed to generate the figures and tables
have been saved in the `output` folder.

Before running the R code in the pages listed below, you will need to
first install several R packages that are used in the code:

```R
install.packages(c("colorRamps","rmeta"))
```

Note that at the bottom of each webpage, we have recorded information
about the exact version of R and the R Packages that were used. This
might be useful if you would like to replicate our computing setup as
closely as possible. Also note that the webpages were generated from
the R Markdown files using [workflowr][workflowr].

Visit the following links to view the R code we used to generate
summarizies of the GTEx results incorporated into the manuscript:

+ [Primary correlation patterns identified by mash in GTEx data](Uk3.html).

+ Correlation patterns from other components with larger weights:
  [second](Uk2.html), [fourth](Uk4.html), [fifth](Uk5.html) and
  [eighth](Uk8.html) covariance components.

+ [Examples illustrating how mash uses patterns of sharing to inform
effect estimates](GTExExamples.html).

+ [Overall sharing among top eQTLs](HeterogeneityTables.html).

+ [Pairwise sharing by magnitude of eQTLs among tissues](SharingMag.html).

+ [Pairwise sharing by sign of eQTLs among tissues](SharingSign.html).

+ [Comparison of genes with tissue-specific eQTLs against other genes](ExpressionAnalysis.html).

+ [Tissue effective sample sizes](SampleSize.html).

+ [Number of tissue-specific eQTLs in each tissue](Tspecific.html).

+ [Distribution of eQTL sharing across tissues](SharingHist.html).

## Additional usage notes

+ All containers that have run and exited will still be retained in
the Docker system. Run `docker container list --all` to list all
previous run containers. To clear these previously run containers, run
`docker container prune`. See [here][docker-prune] for more
information.

+ See the Jupyter notebook to get more details; how the notebook
should be interpreted.

+ In the `data` folder, we have provided a file
[MatrixEQTLSumStats.Portable.Z.rds](data/MatrixEQTLSumStats.Portable.Z.rds)
containing eQTL summary statistics from the GTEx study, in a format
suitable for running mash. This was generated from the original eQTL
summary statistics downloaded from the GTEx Portal website, then
converted using the code in `fastqtl_to_mash.ipynb`. See
[here](fastqtl2mash.html) for details on this step.

+ The `output` folder contains some additional results files that will
not be generated directly from the mash analysis pipeline (described
above); these are files that were generated by running the mash
analysis on only brain tissues, and on all GTEx tissues excluding
brain tissues. In order to generate results similar to these, the mash
analysis pipeline has an option to specific subsets of tissues. These
analyses on tissue subsets were only used to verify some of the
results of the main mash analysis run on all the tissues jointly.

+ Run the following command to update the Docker image: `docker pull
gaow/mash-paper`

[gtexresults]: https://github.com/stephenslab/gtexresults
[mashr]: https://github.com/stephenslab/mashr
[issues]: https://github.com/stephenslab/gtexresults/issues
[workflowr]: https://github.com/jdblischak/workflowr
[docker-overview]: https://docs.docker.com/engine/docker-overview
[docker-getting-started]: https://docs.docker.com/get-started
[docker-download]: https://docs.docker.com/install
[docker-ce]: https://www.docker.com/community-edition
[docker-lab-base]: https://hub.docker.com/r/gaow/lab-base
[docker-mash-paper]: https://hub.docker.com/r/gaow/mash-paper
[singularity]: https://singularity.lbl.gov/docs-docker
[docker-daemon-linux]: https://askubuntu.com/questions/477551/how-can-i-use-docker-without-sudo
[docker-daemon-mac]: https://github.com/wodby/docker4drupal/issues/15 encounter fact