Skip to content

Commit

Permalink
fix: major rewrite of SRA download workflow (#157)
Browse files Browse the repository at this point in the history
Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch>
  • Loading branch information
mkatsanto and uniqueg committed Feb 3, 2024
1 parent ede615b commit 6db184c
Show file tree
Hide file tree
Showing 26 changed files with 485 additions and 188 deletions.
2 changes: 2 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,8 @@ jobs:
- name: Run htsinfer test script
run: bash tests/test_htsinfer_workflow/test.local.sh

- name: Run SRA downloads workflow
run: bash tests/test_sra_download_with_singularity/test.local.sh

integration-singularity-tempflag:
needs:
Expand Down
41 changes: 22 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -269,24 +269,20 @@ your run.
# Sample downloads from SRA

An independent Snakemake workflow `workflow/rules/sra_download.smk` is included
for the download of SRA samples with [sra-tools].
for the download of sequencing libraries from the Sequence Read Archive and
conversion into FASTQ.

> Note: as of Snakemake 7.3.1, only profile conda is supported.
> Singularity fails because the *sra-tools* Docker container only has `sh`
but `bash` is required.

> Note: The workflow uses the implicit temporary directory
from snakemake, which is called with [resources.tmpdir].

The workflow expects the following config:
* `samples`, a sample table (tsv) with column *sample* containing *SRR* identifiers,
see example [here](tests/input_files/sra_samples.tsv).
The workflow expects the following parameters in the configuration file:
* `samples`, a sample table (tsv) with column *sample* containing *SRR*
identifiers (ERR and DRR are also supported), see
[example](tests/input_files/sra_samples.tsv).
* `outdir`, an output directory
* `samples_out`, a pointer to a modified sample table with location of fastq files
* `samples_out`, a pointer to a modified sample table with the locations of
the corresponding FASTQ files
* `cluster_log_dir`, the cluster log directory.

For executing the example one can use the following
(with activated *zarp* environment):
For executing the example with Conda environments, one can use the following
command (from within the activated `zarp` Conda environment):

```bash
snakemake --snakefile="workflow/rules/sra_download.smk" \
Expand All @@ -297,11 +293,19 @@ snakemake --snakefile="workflow/rules/sra_download.smk" \
log_dir="logs" \
cluster_log_dir="logs/cluster_log"
```
After successful execution, `results/sra_downloads/sra_samples.out.tsv` should contain:

Alternatively, change the argument to `--profile` from `local-conda` to
`local-singularity` to execute the workflow steps within Singularity
containers.

After successful execution, `results/sra_downloads/sra_samples.out.tsv` should
contain:

```tsv
sample fq1 fq2
SRR18552868 results/sra_downloads/SRR18552868/SRR18552868.fastq.gz
SRR18549672 results/sra_downloads/SRR18549672/SRR18549672_1.fastq.gz results/sra_downloads/SRR18549672/SRR18549672_2.fastq.gz
sample fq1 fq2
SRR18552868 results/sra_downloads/compress/SRR18552868/SRR18552868.fastq.gz
SRR18549672 results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq.gz results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq.gz
ERR2248142 results/sra_downloads/compress/ERR2248142/ERR2248142.fastq.gz
```


Expand Down Expand Up @@ -355,5 +359,4 @@ After successful execution - if all parameters could be either inferred or were
[slurm]: <https://slurm.schedmd.com/documentation.html>
[zavolan-lab]: <https://www.biozentrum.unibas.ch/research/researchgroups/overview/unit/zavolan/research-group-mihaela-zavolan/>
[pipeline-documentation]: pipeline_documentation.md
[sra-tools]: <https://github.com/ncbi/sra-tools>
[resources.tmpdir]: <https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html?#standard-resources>
4 changes: 2 additions & 2 deletions config/config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
# Required fields
samples: "tests/input_files/samples.tsv"
samples: "samples.tsv"
output_dir: "results"
log_dir: "logs"
cluster_log_dir: "logs/cluster"
Expand All @@ -15,4 +15,4 @@
report_url: "https://zavolan.biozentrum.unibas.ch/"
author_name: "NA"
author_email: "NA"
...
...
2 changes: 1 addition & 1 deletion install/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ channels:
- bioconda
dependencies:
- python>=3.10, <3.12
- snakemake
- snakemake<8
- graphviz
65 changes: 65 additions & 0 deletions pipeline_documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,16 @@ on installation and usage please see [here](README.md).
- [`map_genome_star`](#map_genome_star)
- [`quantification_salmon`](#quantification_salmon)
- [`genome_quantification_kallisto`](#genome_quantification_kallisto)
- [Description of SRA download workflow steps](#description-of-sra-download-workflow-steps)
- [SRA Sequencing mode-independent ](#sra-sequencing-mode-independent)
- [`get_layout`](#get_layout)
- [`prefetch`](#prefetch)
- [`add_fq_file_path`](#add_fq_file_path)
- [SRA Sequencing mode-specific](#sra-sequencing-mode-specific)
- [`fasterq_dump`](#fasterq_dump)
- [`compress_fastq`](#remove_polya_cutadapt)
- [`process_fastq`](#process_fastq)


## Third-party software used

Expand All @@ -59,13 +69,16 @@ on installation and usage please see [here](README.md).
| **bedtools** | [GPLv2][license-gpl2] | _"[...] intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF"_ | [code][code-bedtools] / [manual][code-bedtools] |
| **cutadapt** | [MIT][license-mit] | _"[...] finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads"_ | [code][code-cutadapt] / [manual][docs-cutadapt] / [publication][pub-cutadapt] |
| **gffread** | [MIT][license-mit] | _"[...] validate, filter, convert and perform various other operations on GFF files"_ | [code][code-gffread] / [manual][docs-gffread] |
| **Entrez Direct** | [custom][license-entrez-direct] | _"[...] an advanced method for accessing the NCBI's set of interconnected databases from a UNIX terminal window"_ | [code][code-entrez-direct] / [manual][docs-entrez-direct] / [publication][pub-entrez-direct] |
| **FastQC** | [GPLv3][license-gpl3] | _"A quality control analysis tool for high throughput sequencing data"_ | [code][code-fastqc] / [manual][docs-fastqc] |
| **ImageMagick** | [custom][license-imagemagick]^ | _"[...] create, edit, compose, or convert bitmap images"_ | [code][code-imagemagick] / [manual][docs-imagemagick] |
| **kallisto** | [BSD-2][license-bsd2] | _"[...] program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads"_ | [code][code-kallisto] / [manual][docs-kallisto] / [publication][pub-kallisto] |
| **MultiQC** | [GPLv3][license-gpl3] | _"Aggregate results from bioinformatics analyses across many samples into a single report"_ | [code][code-multiqc] / [manual][docs-multiqc] / [publication][pub-multiqc] |
| **pigz** | [custom][license-pigz] | _"[...] parallel implementation of gzip, is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data"_ | [code][code-pigz] / [manual][docs-pigz] |
| **RSeqC** | [GPLv3][license-gpl3] | _"[...] comprehensively evaluate different aspects of RNA-seq experiments, such as sequence quality, GC bias, polymerase chain reaction bias, nucleotide composition bias, sequencing depth, strand specificity, coverage uniformity and read distribution over the genome structure."_ | [code][code-rseqc] / [manual][docs-rseqc] / [publication][pub-rseqc] |
| **Salmon** | [GPLv3][license-gpl3] | _"Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment"_ | [code][code-salmon] / [manual][docs-salmon] / [publication][pub-salmon] |
| **SAMtools** | [MIT][license-mit] | _"[...] suite of programs for interacting with high-throughput sequencing data"_ | [code][code-samtools] / [manual][docs-samtools] / [publication][pub-samtools] |
| **SRA Tools** | [custom][license-sra-tools] | _"[...] collection of tools and libraries for using data in the INSDC Sequence Read Archives"_ | [code][code-sra-tools] / [manual][docs-sra-tools] |
| **STAR** | [MIT][license-mit] | _"**S**pliced **T**ranscripts **A**lignment to a **R**eference"_ - _"RNA-seq aligner"_ | [code][code-star] / [manual][docs-star] / [publication][pub-star] |

^ compatible with [GPLv3][license-gpl3]
Expand Down Expand Up @@ -715,18 +728,63 @@ Generate pseudoalignments of reads to transcripts with
- `--single`: Quantify single-end reads **(single-end only)**
- `--pseudobam`: Save pseudoalignments to transcriptome to BAM file


## Description of SRA download workflow steps

> This separate workflow consists of three Snakemake files: A main `sra_download.smk` and an
> individual Snakemake file for each sequencing mode (single-end and
> paired-end), as parameters for some tools differ between sequencing modes.
> The main `sra_download.smk` contains general steps for downloading the samples
> from the SRA repository and determining the sequencing mode in order to execute
> the appropriate subsequent rules.Individual steps of the workflow are described
> briefly, and links to the respective software manuals are given. Parameters that
> can be modified by the user (via the samples table) are also described. Descriptions
> for steps for which individual "rules" exist for single- and paired-end
> sequencing libraries are combined, and only differences between the modes are
> highlighted.

### SRA Sequencing mode-independent

#### `get_layout`
Get the library type of each sample (paired or single-end) using efetch [**Entrez direct**](#third-party-software-used).
- **Output**
- A file with a name either PAIRED or SINGLE which is used downstream to run the
appropriate subworkflow.

#### `prefetch`
Download the SRA entry using [**SRA Tools**](#third-party-software-used)

#### `add_fq_file_path`
Aggregate the fastq file(s) path(s) in a table for all samples.

### SRA Sequencing mode-specific

#### `fasterq_dump`
Converts SRA entry to fastq file(s) using [**SRA Tools**](#third-party-software-used)

### `compress_fastq`
Compresses fastq file(s) to .gz format. [**pigz**](#third-party-software-used)

### `process_fastq`
Keep the fastq.gz file path in a table, later aggregated in one table in `add_fq_file_path`.


[code-alfa]: <https://github.com/biocompibens/ALFA>
[code-bedgraphtobigwig]: <https://github.com/ucscGenomeBrowser/kent>
[code-bedtools]: <https://github.com/arq5x/bedtools2>
[code-cutadapt]: <https://github.com/marcelm/cutadapt>
[code-gffread]: <https://github.com/gpertea/gffread>
[code-entrez-direct]: <https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/>
[code-fastqc]: <https://github.com/s-andrews/FastQC>
[code-imagemagick]: <https://github.com/ImageMagick/ImageMagick/>
[code-kallisto]: <https://github.com/pachterlab/kallisto>
[code-multiqc]: <https://github.com/ewels/MultiQC>
[code-pigz]: <https://github.com/madler/pigz>
[code-rseqc]: <http://rseqc.sourceforge.net/>
[code-salmon]: <https://github.com/COMBINE-lab/salmon>
[code-samtools]: <https://github.com/samtools/samtools>
[code-sra-tools]: <https://github.com/ncbi/sra-tools>
[code-star]: <https://github.com/alexdobin/STAR>
[custom-script-gtf-to-bed12]: <https://github.com/zavolanlab/zgtf>
[custom-script-tin]: <https://github.com/zavolanlab/tin-score-calculation>
Expand All @@ -738,25 +796,32 @@ Generate pseudoalignments of reads to transcripts with
[docs-cutadapt]: <https://cutadapt.readthedocs.io/en/stable/>
[docs-cutadapt-m]: <https://cutadapt.readthedocs.io/en/stable/guide.html#filtering-reads>
[docs-gffread]: <http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread>
[docs-entrez-direct]: <https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/README>
[docs-fastqc]: <http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/>
[docs-imagemagick]: <https://imagemagick.org/>
[docs-kallisto]: <http://pachterlab.github.io/kallisto/manual.html>
[docs-multiqc]: <https://multiqc.info/docs/>
[docs-pigz]:<https://zlib.net/pigz/pigz.pdf>
[docs-rseqc]: <http://rseqc.sourceforge.net/#usage-information>
[docs-salmon]: <https://salmon.readthedocs.io/en/latest/>
[docs-salmon-selective-alignment]: <https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/>
[docs-samtools]: <http://www.htslib.org/doc/samtools.html>
[docs-snakemake]: <https://snakemake.readthedocs.io/en/stable/>
[docs-snakemake-target-rule]: <https://snakemake.readthedocs.io/en/stable/tutorial/basics.html#step-7-adding-a-target-rule>
[docs-sra-tools]: <https://github.com/ncbi/sra-tools/wiki>
[docs-star]: <https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf>
[docs-star-rpm-norm]: <https://ycl6.gitbooks.io/rna-seq-data-analysis/visualization.html>
[license-bsd2]: <https://opensource.org/licenses/BSD-2-Clause>
[license-entrez-direct]: <https://www.ncbi.nlm.nih.gov/books/NBK179288/>
[license-gpl2]: <https://opensource.org/licenses/GPL-2.0>
[license-gpl3]: <https://opensource.org/licenses/GPL-3.0>
[license-imagemagick]: <https://github.com/ImageMagick/ImageMagick/blob/master/LICENSE>
[license-mit]: <https://opensource.org/licenses/MIT>
[license-pigz]: <https://github.com/madler/pigz/blob/master/README>
[license-sra-tools]: <https://github.com/ncbi/sra-tools/blob/master/LICENSE>
[pub-alfa]: <https://doi.org/10.1186/s12864-019-5624-2>
[pub-cutadapt]: <https://doi.org/10.14806/ej.17.1.200>
[pub-entrez-direct]: <https://www.ncbi.nlm.nih.gov/books/NBK179288/>
[pub-kallisto]: <https://doi.org/10.1038/nbt.3519>
[pub-multiqc]: <https://doi.org/10.1093/bioinformatics/btw354>
[pub-rseqc]: <https://doi.org/10.1093/bioinformatics/bts356>
Expand Down
4 changes: 1 addition & 3 deletions tests/test_htsinfer_workflow/test.local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,4 @@ snakemake \
--keep-incomplete

# Check md5 sum of some output files
#find results/ -type f -name \*\.gz -exec gunzip '{}' \;
#find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
md5sum --check "expected_output.md5"
md5sum --check "expected_output.md5"
1 change: 0 additions & 1 deletion tests/test_integration_workflow/test.local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -83,4 +83,3 @@ diff \
diff \
<(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')

2 changes: 1 addition & 1 deletion tests/test_integration_workflow/test.slurm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -82,4 +82,4 @@ diff \
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')
diff \
<(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')
11 changes: 0 additions & 11 deletions tests/test_integration_workflow/test.temp.flag.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,17 +41,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
md5sum --check "expected_output_temp_flag.md5"

# Checksum file generated with
# find results/ \
# -type f \
# -name \*\.gz \
# -exec gunzip '{}' \;
# find results/ \
# -type f \
# -name \*\.zip \
# -exec sh -c 'unzip -o {} -d $(dirname {})' \;
# md5sum $(cat expected_output.files) > expected_output_temp_flag.md5

# Check whether STAR produces expected alignments
# STAR alignments need to be fully within ground truth alignments for tests to pass; not checking
# vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1'
Expand Down
3 changes: 0 additions & 3 deletions tests/test_integration_workflow_multiple_lanes/test.local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
md5sum --check "expected_output.md5"



# Check whether STAR produces expected alignments
# STAR alignments need to be fully within ground truth alignments for tests to pass; not checking
# vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1'
Expand Down Expand Up @@ -74,4 +72,3 @@ diff \
diff \
<(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')

13 changes: 0 additions & 13 deletions tests/test_integration_workflow_multiple_lanes/test.slurm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,17 +42,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
md5sum --check "expected_output.md5"

# Checksum file generated with
# find results/ \
# -type f \
# -name \*\.gz \
# -exec gunzip '{}' \;
# find results/ \
# -type f \
# -name \*\.zip \
# -exec sh -c 'unzip -o {} -d $(dirname {})' \;
# md5sum $(cat expected_output.files) > expected_output.md5

# Check whether STAR produces expected alignments
# STAR alignments need to be fully within ground truth alignments for tests to pass; not checking
# vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1'
Expand Down Expand Up @@ -83,5 +72,3 @@ diff \
diff \
<(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')


12 changes: 0 additions & 12 deletions tests/test_integration_workflow_with_conda/test.local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,17 +42,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
md5sum --check "expected_output.md5"

# Checksum file generated with
#find results/ \
# -type f \
# -name \*\.gz \
# -exec gunzip '{}' \;
#find results/ \
# -type f \
# -name \*\.zip \
# -exec sh -c 'unzip -o {} -d $(dirname {})' \;
#md5sum $(cat expected_output.files) > expected_output.md5

# Check whether STAR produces expected alignments
# STAR alignments need to be fully within ground truth alignments for tests to pass; not checking
# vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1'
Expand Down Expand Up @@ -83,4 +72,3 @@ diff \
diff \
<(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')

13 changes: 0 additions & 13 deletions tests/test_integration_workflow_with_conda/test.slurm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,17 +42,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
md5sum --check "expected_output.md5"

# Checksum file generated with
# find results/ \
# -type f \
# -name \*\.gz \
# -exec gunzip '{}' \;
# find results/ \
# -type f \
# -name \*\.zip \
# -exec sh -c 'unzip -o {} -d $(dirname {})' \;
# md5sum $(cat expected_output.files) > expected_output.md5

# Check whether STAR produces expected alignments
# STAR alignments need to be fully within ground truth alignments for tests to pass; not checking
# vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1'
Expand Down Expand Up @@ -83,5 +72,3 @@ diff \
diff \
<(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')


6 changes: 3 additions & 3 deletions tests/test_sra_download_with_conda/expected_output.files
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
results/sra_downloads/sra_samples.out.tsv
results/sra_downloads/SRR18549672/SRR18549672_1.fastq
results/sra_downloads/SRR18549672/SRR18549672_2.fastq
results/sra_downloads/SRR18552868/SRR18552868.fastq
results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq
results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq
results/sra_downloads/compress/ERR2248142/ERR2248142.fastq

0 comments on commit 6db184c

Please sign in to comment.