analysis/gtex.Rmd

---
title: "MASH analysis of GTEx data"
author: "Sarah Urbut, Gao Wang, Peter Carbonetto and Matthew Stephens"
site: workflowr::wflow_site
output:
  workflowr::wflow_html:
    toc: false
---

If you find a bug, please post an
[issue](https://github.com/stephenslab/gtexresults/issues).


## Overview

To reproduce the results of Urbut, Wang & Stephens (2017), please
follow these instructions.

For more information, please see the
[README](https://github.com/stephenslab/gtexresults).

The complete analyses require installation of several programs and
libraries, and requires several large data sets. To facilitate 
reproducing our results, we provide pre-processed data for use with 
the core analysis, and a bioinformatics pipeline with a small toy 
data-set to demonstrate the pre-processing step. We have also developed a
[Docker container](https://hub.docker.com/r/gaow/mash-paper) that
includes all software components necessary to run the analyses. Docker
can run on most popular operating systems (Mac, Windows and Linux).
It also runs on cloud computing
services such as Amazon Web Services and Microsoft Azure. If you have
not used Docker before, you might want to read
[this](https://docs.docker.com/engine/docker-overview) to learn the
basic concepts and understand the benefits of Docker.

For details on how the Docker image was configured, see the
[Dockerfile](workflows/Dockerfile). The Docker image used for our
analyses is based on
[gaow/lab-base](https://hub.docker.com/r/gaow/lab-base), a Docker
image for development with R and Python.

If you prefer to run the analyses without Docker, *add a few details
about where you can find out more about software and libraries used,
and other computing environment setup steps (mention Python 3.x, R, SFA,
ExtremeDeconvolution, MOSEK, OpenMP, MKL, GSL, HDF5 tools,
pytables rhdf5, and for an improved MASH implementation mashr is also
installed).*

## 1. Download and install Docker

Download [Docker](https://docs.docker.com/install) (note that a free
[community edition](https://www.docker.com/community-edition) of
Docker is available), and install it following the instructions
provided on the Docker website. Once you have installed Docker, check
that Docker is working correctly by following
[Part 1 of the Getting Started guide](https://docs.docker.com/get-started).
If you are using Docker for the first time, we recommend reading the
entire Getting Started guide. *Note that setting up Docker requires
that you have administrator access to your computer.*
([Singularity](https://singularity.lbl.gov/docs-docker) is an
alternative that accepts Docker images and does not require
administrator access.)

## 2. Download and test Docker image

Run this `alias` command in the shell, which will be used below to run
commands inside the Docker container:

```bash
alias mash-docker='docker run --security-opt label:disable -t -P -h MASH '\
'-w $PWD -v $HOME:/home/$USER -v /tmp:/tmp -v $PWD:$PWD '\
'-u $UID:${GROUPS[0]} -e HOME=/home/$USER -e USER=$USER gaow/mash-paper'
```

The `-v` flags in this command map directories between the standard
computing environment and the Docker container. Since the analyses
below will write files to these directories, it is important to ensure
that:

  + Environment variables `$HOME` and `$PWD` are set to valid and
    writeable directories (usually your home and current working
    directories, respectively).

  + `/tmp` should also be a valid and writeable directory.

If any of these statements are not true, please adjust the `alias`
accordingly. The remaining options only affect operation of the
container, and so should function the same regardless of your operating
system.

Next, run a simple command in the Docker container to check that has
loaded successfully:

```
mash-docker uname -sn
```

This command will download the Docker image if it has not already been
downloaded.

If the container was successfully run, you should see this information
about the Docker container outputted to the screen:

```
Linux MASH
```

You can also run these commands to show the information about the
image downloaded to your computer and the container that has run
(and exited):

```bash
docker image list
docker container list --all
```

*Note:* If you get error "Cannot connect to the Docker daemon. Is the
docker daemon running on this host?" in Linux or macOS, see
[here for Linux](https://askubuntu.com/questions/477551/how-can-i-use-docker-without-sudo)
or [here for Mac](https://github.com/wodby/docker4drupal/issues/15) for
suggestions on how to resolve this issue.

## 3. Clone or download this repository

Clone or download the `gtexresults` repository to your computer, then
change your working directory in the shell to the root of the
repository, e.g.,

```bash
cd gtexresults
```

After doing this, running `ls -1` should show the top-level contents
of this repository:

```
LICENSE
README.md
TODO.txt
analysis
data
docs
output
workflows
```

All commands below will be run from this directory.

### 4. Fit MASH model and compute posterior statistics

Assuming your working directory is the root of the git repository (you
can check by running `pwd`), run all the steps of the analysis with
this command:

```bash
mash-docker sos run workflows/gtex6_mash_analysis.ipynb
```

This command will take several hours to run—see below for more
information on the individual steps. All outputs generated by this
command will be saved to folder `output` inside the
repository.

Note that you may recognize this file as a Jupyter notebook. Indeed,
you may open this notebook in Jupyter. However, you should not step
through the code sequentially as you would in a typical Jupyter
notebook; this is because the code in this notebook is meant to be run
using the [Script of Scripts (SoS)](https://github.com/vatlab/SoS)
framework.

This command will execute the following steps of the analysis:

+ Compute a sparse factorization of the (centered) z-scores using the
  [SFA software](http://stephenslab.uchicago.edu/software.html#sfa),
  with K = 5 factors, and save the factors in an `.rds` file. This
  will be used to construct the mixture-of-multivariate normals
  prior. This step is labeled `sfa`, and should only take a few
  minutes to run.

+ Compute additional "data-driven" prior matrices by computing a
  singular value decomposition of the (centered) z-scores and low-rank
  approximations to the empirical covariance matrices. Most of the
  work in this step involves running the Extreme Deconvolution
  method. The outcome of running the Extreme Deconvolution method is
  saved to a new `.rds` file. This step is labeled `mash-paper_1` and
  may take several hours to run (in one run on a MacBook Pro with
  a 3.5 GHz Intel Core i7, it took over 6 hours to complete).

+ A final collection of "canonical" and single-rank prior matrices
  based on SFA and the "BMAlite" models of Flutre *et al*
  (2013). These matrices are again written to another `.rds` file. This
  step is labeled `mash-paper_2`, and should take at most a minute to
  run.

+ The `mash-paper_3` step fits the MASH ("multivariate adaptive
  shrinkage") model to the GTEx data (the centered z-scores); the
  model parameters estimated in this fitting step are the weights of
  the multivariate normal mixture. The outputs from this step are the
  estimated mixture weights and the conditional likelihood
  matrix. These two outputs are saved to two separate `.rds` files.
  This step is expected to take at most a few hours to complete.

+ The `mash-paper_4` step computes posterior statistics using the
  fitted MASH model from the previous step. These posterior statistics
  are summarized and visualized in subsequent analyses. The posterior
  statistics are saved to another `.rds` file. This step is expected
  to take a few hours to complete.

Finally, note that all containers that have run and exited will still
be retained in the Docker system. Run `docker container list --all` to
list all previous run containers. To clear these previously run
containers, run `docker container prune`. See
[here](https://stackoverflow.com/questions/17014263/should-i-be-concerned-about-excess-non-running-docker-containers)
for more information.

### 5. Add Step 5 title here

Install some packages from CRAN:

```R
# Add commands here to install packages.
```

**For convenience, the results needed to generate the figures and
tables have been saved in the `output` folder.**

**FIXME: update figure plotting instructions**

The input data necessary to run this analysis is all available under
inputs. This may take some time to run.  We have provided the outputs
of running mash in `Data_vhat`.

This repo is organized so that you can run Mash using the gteX data
contained in **Inputs** to produce the parameters and posteriors from
mashr.

The directory **Plots_for_Paper_vmat** contains .Rmd files to plot
figures from the paper, using our results which are provided in
**Results_Data**.

Figure 3:[Summary of primary patterns identified by mash in GTEx
data](Fig.Uk3.html)

Figure 4:[Examples illustrating of how mash uses patterns of sharing to inform effect estimates in the GTEx data.](Fig.GTExExamples.html)

Figure 5:[Histogram of Sharing](Fig.SharingHist.pdf)

Figure 6:[Pairwise sharing by magnitude of eQTL among tissues](Fig.SharingMag.html)

Supplementary Figure 1:[Sample sizes and effective sample sizes from mash analysis across tissues](Fig.SampleSize.html)

Supplementary Figure 2:There are 4 figures here:

[Summary of covariance matrices Uk with largest estimated weight (> 1%) in GTEx data:Uk2](Fig.Uk2.pdf)

[Summary of covariance matrices Uk with largest estimated weight (> 1%) in GTEx data:Uk4](Fig.Uk4.html)

[Summary of covariance matrices Uk with largest estimated weight (> 1%) in GTEx data:Uk5](Fig.Uk5.html)

[Summary of covariance matrices Uk with largest estimated weight (> 1%) in GTEx data:Uk8](Fig.Uk8.html)


Supplementary Figure 3: Illustration of how Linkage Disequilibrium can impact effect estimate [table](TwoSNP.html) and [figure](TwoSNPPlot.pdf)

Supplementary Figure 4:[Pairwise Sharing By Sign](Fig.SharingSign.html)

Supplementary Figure 5:[Number of “tissue-specific eQTLs” in each tissues.](Fig.Tspecific.html)

Supplementary Figure 6:[Expression levels in genes with “tissue-specific eQTLs” are similar to those in other genes](Fig.ExpressionAnalysis.html)

Table 1: Heterogeneity Analysis [Simulation](Table.hettablesim.html) and [Data](Table.HeterogeneityTables.html).

## More detailed usage notes

Above we have given the minimal instructions necessary to reproduce
the results of Urbut *et al* (2017). Here are some additional details
about the analyses.

*TO DO: Things that will go here:*

+ Explain how to get a summary of the possible analysis steps that can
  be run.

+ See the Jupyter notebook to get more details; how the notebook
  should be interpreted.

+ Explain how to run the analysis using the improved (faster)
  implementation of the [mashr R
  package](https://github.com/stephenslab/mashr).

+ *MOSEK?*

+ In the [data](data) folder, we have provided a file
  [MatrixEQTLSumStats.Portable.Z.rds](data/MatrixEQTLSumStats.Portable.Z.rds)
  containing eQTL summary statistics in a format convenient for
  running MASH. This was generated from the original eQTL summary
  statistics downloaded from the GTEx Portal website, then converted
  using the code in `fastqtl_to_mash.ipynb`. See below for details on
  this step.

## Developer notes

Run the following command to update the Docker image:

```bash
docker pull gaow/mash-paper
```