-
Notifications
You must be signed in to change notification settings - Fork 11
/
fastqtl2mash.Rmd
199 lines (159 loc) · 7.84 KB
/
fastqtl2mash.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
title: "Converting FastQTL results to mash format"
author: "Sarah Urbut, Gao Wang, Peter Carbonetto and Matthew Stephens"
site: workflowr::wflow_site
output:
workflowr::wflow_html:
toc: false
---
## Overview
We provide code to convert association statistics in
[FastQTL][fastqtl] format, or a format similar to FastQTL, to a format
that is more suited for analysis with mash. This code was used to
generate `MatrixEQTLSumStats.Portable.Z.rds` in the
[git repository][gtexresults] from the SNP-gene association statistics
included as part of Release 6 of the [GTEx Project][gtex] (the source
file was named `GTEx_Analysis_V6_all-snp-gene-associations.tar`).
Here we give instructions for using this code, and demonstrate how to
convert a toy FastQTL data set. This toy data set is included in the
[git repository][gtexresults].
To facilitate running our conversion procedure, we have also developed
a [Docker container][docker-hdf5tools] that includes all the required
software components, notably the HDF5 libraries used to create
intermediate data files that can be efficiently queried. Docker can
run on most popular operating systems (Mac, Windows and Linux) and
cloud computing services such as Amazon Web Services and Microsoft
Azure. If you have not used Docker before, you might want to read
[this][docker-overview] to learn the basic concepts and understand the
main benefits of Docker.
For details on how the Docker image was configured, see
`hdf5tools.dockerfile` in the `workflows` directory of the
[git repository][gtexresults]. The Docker image used for our analyses is
based on [gaow/lab-base][docker-lab-base], a customized Docker image
for development with R and Python.
If you find a bug in any of these steps, please post an
[issue][issues].
## Download and install Docker
Download [Docker][docker-download] (note that a free
[community edition][docker-ce] of Docker is available), and install it
following the instructions provided on the Docker website. Once you
have installed Docker, check that Docker is working correctly by
following Part 1 of the ["Getting Started" guide][docker-getting-started].
If you are new to Docker, we recommend reading the entire "Getting
Started" guide.
**Note:** Setting up Docker requires that you have administrator
access to your computer. [Singularity][singularity] is an
alternative that accepts Docker images and does not require
administrator access.
## Download and test Docker image
Run this `alias` command in the shell, which will be used below to run
commands inside the Docker container:
```bash
alias fastqtl2mash-docker='docker run --security-opt label:disable -t '\
'-P -h MASH -w $PWD -v $HOME:/home/$USER -v /tmp:/tmp -v $PWD:$PWD '\
'-u $UID:${GROUPS[0]} -e HOME=/home/$USER -e USER=$USER gaow/hdf5tools'
```
The `-v` flags in this command map directories between the standard
computing environment and the Docker container. Since the analyses
below will write files to these directories, it is important to ensure
that:
+ Environment variables `$HOME` and `$PWD` are set to valid and
writeable directories (usually your home and current working
directories, respectively).
+ `/tmp` should also be a valid and writeable directory.
If any of these statements are not true, please adjust the `alias`
accordingly. The remaining options only affect operation of the
container, and so should function the same regardless of your operating
system.
Next, run a simple command in the Docker container to check that has
loaded successfully:
```
fastqtl2mash-docker uname -sn
```
This command will download the Docker image if it has not already been
downloaded.
If the container was successfully run, you should see this information
about the Docker container outputted to the screen:
```
Linux MASH
```
You can also run these commands to show the information about the
image downloaded to your computer and the container that has run
(and exited):
```bash
docker image list
docker container list --all
```
*Note:* If you get error "Cannot connect to the Docker daemon. Is the
docker daemon running on this host?" in Linux or macOS, see
[here for Linux][docker-daemon-linux]
or [here for Mac][docker-daemon-mac] for
suggestions on how to resolve this issue.
## Clone or download the gtexresults repository
Clone or download the [gtexresults][gtexresults] repository to your
computer, then change your working directory in the shell to the root
of the repository, e.g.,
```bash
cd gtexresults
```
All the commands below will be run from this directory.
## Convert eQTL summary statistics
Next, use the `fastqtl_to_mash.ipynb` code in the `workflows`
directory to convert the toy data set in FastQTL format to the mash
format. The toy data are stored in the `data/fastqtl` subdirectory of
the git repository.
Having followed the above steps to set up the Docker container on your
computer, the data conversion can be carried out with the following
command:
```bash
fastqtl2mash-docker sos run workflows/fastqtl_to_mash.ipynb \
--data_list data/fastqtl/FastQTLSumStats.list \
--gene_list data/fastqtl/GTEx_genes.txt
```
If successful, this command will write several files to a newly
created directory, `fastqtl_to_mash_output`. One file,
`FastQTLSumStats.mash.rds`, contains the eQTL summary statistics in an
RDS file, which is easily loaded into R; see `help(readRDS)` in R for
detailsf. For more information about the contents of this file, and
how they can be provided as input to the mash methods using the
`set_mash_data` function, see the documentation inside the
[fastqtl2mash notebook][fastqtl2mash-notebook] and the vignettes in
the [mashr package][mashr].
## Additional usage notes
+ All containers that have run and exited will still be retained in
the Docker system. Run `docker container list --all` to list all
previous run containers. To clear these previously run containers, run
`docker container prune`. See [here][docker-prune] for more
information.
+ The conversion procedure has several options which were not
illustrated in the example above. View the `fastqtl_to_mash.ipynb`
file in Jupyter, or in your Web browser
[here][fastqtl2mash-notebook], for more details about the available
options, specifications of the input files, and other usage information.
+ Converting the full GTEx data set is computationally intensive and
is best done in high-performance computing environment with
configurations to run the workflow across different compute nodes. See
[here][sos-cluster-info] for details.
+ Results labeled "test" in the outputted RDS file have been more
appropriately relabeled as "strong". So if you use any of the existing
R code on your data, you may need to rename some of the variables;
e.g., `test.z` in the previous output is now `strong.z`.
+ Run the following command to update the Docker image: `docker pull
gaow/hdf5tools`
[gtex]: http://gtexportal.org
[gtexresults]: https://github.com/stephenslab/gtexresults
[issues]: https://github.com/stephenslab/gtexresults/issues
[fastqtl]: http://fastqtl.sourceforge.net
[mashr]: https://github.com/stephenslab/mashr
[docker-lab-base]: https://hub.docker.com/r/gaow/lab-base
[docker-overview]: https://docs.docker.com/engine/docker-overview
[docker-download]: https://docs.docker.com/install
[docker-ce]: https://www.docker.com/community-edition
[docker-getting-started]: https://docs.docker.com/get-started
[docker-hdf5tools]: https://hub.docker.com/r/gaow/hdf5tools
[singularity]: https://singularity.lbl.gov/docs-docker
[docker-prune]: https://stackoverflow.com/questions/17014263/should-i-be-concerned-about-excess-non-running-docker-containers
[docker-daemon-linux]: https://askubuntu.com/questions/477551/how-can-i-use-docker-without-sudo
[docker-daemon-mac]: https://github.com/wodby/docker4drupal/issues/15
[sos-cluster-info]: https://vatlab.github.io/sos-docs/doc/documentation/Remote_Execution.html
[fastqtl2mash-notebook]: https://github.com/stephenslab/gtexresults/blob/master/workflows/fastqtl_to_mash.ipynb