fix: major rewrite of SRA download workflow (#157)

Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch>
zavolanlab · Feb 3, 2024 · 6db184c · 6db184c
1 parent ede615b
commit 6db184c
Show file tree

Hide file tree

Showing 26 changed files with 485 additions and 188 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -97,6 +97,8 @@ jobs:
       - name: Run htsinfer test script
         run: bash tests/test_htsinfer_workflow/test.local.sh
 
+      - name: Run SRA downloads workflow
+        run: bash tests/test_sra_download_with_singularity/test.local.sh
 
   integration-singularity-tempflag:
     needs:

diff --git a/README.md b/README.md
@@ -269,24 +269,20 @@ your run.
 # Sample downloads from SRA
 
 An independent Snakemake workflow `workflow/rules/sra_download.smk` is included
-for the download of SRA samples with [sra-tools].
+for the download of sequencing libraries from the Sequence Read Archive and
+conversion into FASTQ.
 
-> Note: as of Snakemake 7.3.1, only profile conda is supported. 
-> Singularity fails because the *sra-tools* Docker container only has `sh` 
-but `bash` is required.
-
-> Note: The workflow uses the implicit temporary directory 
-from snakemake, which is called with [resources.tmpdir].
-
-The workflow expects the following config:
-* `samples`, a sample table (tsv) with column *sample* containing *SRR* identifiers,
-see example [here](tests/input_files/sra_samples.tsv).
+The workflow expects the following parameters in the configuration file:
+* `samples`, a sample table (tsv) with column *sample* containing *SRR*
+  identifiers (ERR and DRR are also supported), see
+  [example](tests/input_files/sra_samples.tsv).
 * `outdir`, an output directory
-* `samples_out`, a pointer to a modified sample table with location of fastq files
+* `samples_out`, a pointer to a modified sample table with the locations of
+  the corresponding FASTQ files
 * `cluster_log_dir`, the cluster log directory.
 
-For executing the example one can use the following
-(with activated *zarp* environment):
+For executing the example with Conda environments, one can use the following
+command (from within the activated `zarp` Conda environment):
 
 ```bash
 snakemake --snakefile="workflow/rules/sra_download.smk" \
@@ -297,11 +293,19 @@ snakemake --snakefile="workflow/rules/sra_download.smk" \
                    log_dir="logs" \
                    cluster_log_dir="logs/cluster_log"
 ```
-After successful execution, `results/sra_downloads/sra_samples.out.tsv` should contain:
+
+Alternatively, change the argument to `--profile` from `local-conda` to
+`local-singularity` to execute the workflow steps within Singularity
+containers.
+
+After successful execution, `results/sra_downloads/sra_samples.out.tsv` should
+contain:
+
 ```tsv
-sample	fq1	fq2
-SRR18552868	results/sra_downloads/SRR18552868/SRR18552868.fastq.gz	
-SRR18549672	results/sra_downloads/SRR18549672/SRR18549672_1.fastq.gz	results/sra_downloads/SRR18549672/SRR18549672_2.fastq.gz
+sample  fq1     fq2
+SRR18552868     results/sra_downloads/compress/SRR18552868/SRR18552868.fastq.gz 
+SRR18549672     results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq.gz       results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq.gz
+ERR2248142      results/sra_downloads/compress/ERR2248142/ERR2248142.fastq.gz 
 ```
 
 
@@ -355,5 +359,4 @@ After successful execution - if all parameters could be either inferred or were
 [slurm]: <https://slurm.schedmd.com/documentation.html>
 [zavolan-lab]: <https://www.biozentrum.unibas.ch/research/researchgroups/overview/unit/zavolan/research-group-mihaela-zavolan/>
 [pipeline-documentation]: pipeline_documentation.md
-[sra-tools]: <https://github.com/ncbi/sra-tools>
 [resources.tmpdir]: <https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html?#standard-resources>
diff --git a/config/config.yaml b/config/config.yaml
@@ -1,6 +1,6 @@
 ---
   # Required fields
-  samples: "tests/input_files/samples.tsv"
+  samples: "samples.tsv"
   output_dir: "results"
   log_dir: "logs"
   cluster_log_dir: "logs/cluster"
@@ -15,4 +15,4 @@
   report_url: "https://zavolan.biozentrum.unibas.ch/"
   author_name: "NA"
   author_email: "NA"
-...
+...
diff --git a/install/environment.yml b/install/environment.yml
@@ -4,5 +4,5 @@ channels:
   - bioconda
 dependencies:
   - python>=3.10, <3.12
-  - snakemake
+  - snakemake<8
   - graphviz
diff --git a/pipeline_documentation.md b/pipeline_documentation.md
@@ -47,6 +47,16 @@ on installation and usage please see [here](README.md).
     - [`map_genome_star`](#map_genome_star)
     - [`quantification_salmon`](#quantification_salmon)
     - [`genome_quantification_kallisto`](#genome_quantification_kallisto)
+- [Description of SRA download workflow steps](#description-of-sra-download-workflow-steps)
+  - [SRA Sequencing mode-independent ](#sra-sequencing-mode-independent)
+    - [`get_layout`](#get_layout)
+    - [`prefetch`](#prefetch)
+    - [`add_fq_file_path`](#add_fq_file_path)
+  - [SRA Sequencing mode-specific](#sra-sequencing-mode-specific)
+    - [`fasterq_dump`](#fasterq_dump)
+    - [`compress_fastq`](#remove_polya_cutadapt)
+    - [`process_fastq`](#process_fastq)
+
 
 ## Third-party software used
 
@@ -59,13 +69,16 @@ on installation and usage please see [here](README.md).
 | **bedtools** | [GPLv2][license-gpl2] | _"[...] intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF"_ | [code][code-bedtools] / [manual][code-bedtools] |
 | **cutadapt** | [MIT][license-mit] | _"[...] finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads"_ | [code][code-cutadapt] / [manual][docs-cutadapt] / [publication][pub-cutadapt] |
 | **gffread** | [MIT][license-mit] | _"[...] validate, filter, convert and perform various other operations on GFF files"_ | [code][code-gffread] / [manual][docs-gffread] |
+| **Entrez Direct** | [custom][license-entrez-direct] | _"[...] an advanced method for accessing the NCBI's set of interconnected databases from a UNIX terminal window"_ | [code][code-entrez-direct] / [manual][docs-entrez-direct] / [publication][pub-entrez-direct] |
 | **FastQC** | [GPLv3][license-gpl3] | _"A quality control analysis tool for high throughput sequencing data"_ | [code][code-fastqc] / [manual][docs-fastqc] |
 | **ImageMagick** | [custom][license-imagemagick]^ | _"[...] create, edit, compose, or convert bitmap images"_ | [code][code-imagemagick] / [manual][docs-imagemagick] |
 | **kallisto** | [BSD-2][license-bsd2] | _"[...] program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads"_ | [code][code-kallisto] / [manual][docs-kallisto] / [publication][pub-kallisto] |
 | **MultiQC** | [GPLv3][license-gpl3] | _"Aggregate results from bioinformatics analyses across many samples into a single report"_ | [code][code-multiqc] / [manual][docs-multiqc] / [publication][pub-multiqc] |
+| **pigz** | [custom][license-pigz] | _"[...] parallel implementation of gzip, is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data"_ | [code][code-pigz] / [manual][docs-pigz]  |
 | **RSeqC** | [GPLv3][license-gpl3] | _"[...] comprehensively evaluate different aspects of RNA-seq experiments, such as sequence quality, GC bias, polymerase chain reaction bias, nucleotide composition bias, sequencing depth, strand specificity, coverage uniformity and read distribution over the genome structure."_ | [code][code-rseqc] / [manual][docs-rseqc] / [publication][pub-rseqc] |
 | **Salmon** | [GPLv3][license-gpl3] | _"Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment"_ | [code][code-salmon] / [manual][docs-salmon] / [publication][pub-salmon] |
 | **SAMtools** | [MIT][license-mit] | _"[...] suite of programs for interacting with high-throughput sequencing data"_ | [code][code-samtools] / [manual][docs-samtools] / [publication][pub-samtools] |
+| **SRA Tools** | [custom][license-sra-tools] | _"[...] collection of tools and libraries for using data in the INSDC Sequence Read Archives"_ | [code][code-sra-tools] / [manual][docs-sra-tools] |
 | **STAR** | [MIT][license-mit] | _"**S**pliced **T**ranscripts **A**lignment to a **R**eference"_ - _"RNA-seq aligner"_ | [code][code-star] / [manual][docs-star] / [publication][pub-star] |
 
 ^ compatible with [GPLv3][license-gpl3]
@@ -715,18 +728,63 @@ Generate pseudoalignments of reads to transcripts with
   - `--single`: Quantify single-end reads **(single-end only)**
   - `--pseudobam`: Save pseudoalignments to transcriptome to BAM file
 
+
+## Description of SRA download workflow steps
+
+> This separate workflow consists of three Snakemake files: A main `sra_download.smk` and an
+> individual Snakemake file for each sequencing mode (single-end and
+> paired-end), as parameters for some tools differ between sequencing modes.
+> The main `sra_download.smk` contains general steps for downloading the samples
+> from the SRA repository and determining the sequencing mode in order to execute
+> the appropriate subsequent rules.Individual steps of the workflow are described 
+> briefly, and links to the respective software manuals are given. Parameters that 
+> can be modified by the user (via the samples table) are also described. Descriptions
+> for steps for which individual "rules" exist for single- and paired-end
+> sequencing libraries are combined, and only differences between the modes are
+> highlighted.
+
+
+### SRA Sequencing mode-independent
+
+#### `get_layout`
+Get the library type of each sample (paired or single-end) using efetch [**Entrez direct**](#third-party-software-used).
+- **Output**
+  - A file with a name either PAIRED or SINGLE which is used downstream to run the 
+  appropriate subworkflow.
+
+#### `prefetch`
+Download the SRA entry using [**SRA Tools**](#third-party-software-used)
+
+#### `add_fq_file_path`
+Aggregate the fastq file(s) path(s) in a table for all samples.
+
+### SRA Sequencing mode-specific
+
+#### `fasterq_dump`
+Converts SRA entry to fastq file(s) using [**SRA Tools**](#third-party-software-used)
+
+### `compress_fastq`
+Compresses fastq file(s) to .gz format. [**pigz**](#third-party-software-used)
+
+### `process_fastq`
+Keep the fastq.gz file path in a table, later aggregated in one table in `add_fq_file_path`.
+
+
 [code-alfa]: <https://github.com/biocompibens/ALFA>
 [code-bedgraphtobigwig]: <https://github.com/ucscGenomeBrowser/kent>
 [code-bedtools]: <https://github.com/arq5x/bedtools2>
 [code-cutadapt]: <https://github.com/marcelm/cutadapt>
 [code-gffread]: <https://github.com/gpertea/gffread>
+[code-entrez-direct]: <https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/>
 [code-fastqc]: <https://github.com/s-andrews/FastQC>
 [code-imagemagick]: <https://github.com/ImageMagick/ImageMagick/>
 [code-kallisto]: <https://github.com/pachterlab/kallisto>
 [code-multiqc]: <https://github.com/ewels/MultiQC>
+[code-pigz]: <https://github.com/madler/pigz>
 [code-rseqc]: <http://rseqc.sourceforge.net/>
 [code-salmon]: <https://github.com/COMBINE-lab/salmon>
 [code-samtools]: <https://github.com/samtools/samtools>
+[code-sra-tools]: <https://github.com/ncbi/sra-tools>
 [code-star]: <https://github.com/alexdobin/STAR>
 [custom-script-gtf-to-bed12]: <https://github.com/zavolanlab/zgtf>
 [custom-script-tin]: <https://github.com/zavolanlab/tin-score-calculation>
@@ -738,25 +796,32 @@ Generate pseudoalignments of reads to transcripts with
 [docs-cutadapt]: <https://cutadapt.readthedocs.io/en/stable/>
 [docs-cutadapt-m]: <https://cutadapt.readthedocs.io/en/stable/guide.html#filtering-reads>
 [docs-gffread]: <http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread>
+[docs-entrez-direct]: <https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/README>
 [docs-fastqc]: <http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/>
 [docs-imagemagick]: <https://imagemagick.org/>
 [docs-kallisto]: <http://pachterlab.github.io/kallisto/manual.html>
 [docs-multiqc]: <https://multiqc.info/docs/>
+[docs-pigz]:<https://zlib.net/pigz/pigz.pdf>
 [docs-rseqc]: <http://rseqc.sourceforge.net/#usage-information>
 [docs-salmon]: <https://salmon.readthedocs.io/en/latest/>
 [docs-salmon-selective-alignment]: <https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/>
 [docs-samtools]: <http://www.htslib.org/doc/samtools.html>
 [docs-snakemake]: <https://snakemake.readthedocs.io/en/stable/>
 [docs-snakemake-target-rule]: <https://snakemake.readthedocs.io/en/stable/tutorial/basics.html#step-7-adding-a-target-rule>
+[docs-sra-tools]: <https://github.com/ncbi/sra-tools/wiki>
 [docs-star]: <https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf>
 [docs-star-rpm-norm]: <https://ycl6.gitbooks.io/rna-seq-data-analysis/visualization.html>
 [license-bsd2]: <https://opensource.org/licenses/BSD-2-Clause>
+[license-entrez-direct]: <https://www.ncbi.nlm.nih.gov/books/NBK179288/>
 [license-gpl2]: <https://opensource.org/licenses/GPL-2.0>
 [license-gpl3]: <https://opensource.org/licenses/GPL-3.0>
 [license-imagemagick]: <https://github.com/ImageMagick/ImageMagick/blob/master/LICENSE>
 [license-mit]: <https://opensource.org/licenses/MIT>
+[license-pigz]: <https://github.com/madler/pigz/blob/master/README>
+[license-sra-tools]: <https://github.com/ncbi/sra-tools/blob/master/LICENSE>
 [pub-alfa]: <https://doi.org/10.1186/s12864-019-5624-2>
 [pub-cutadapt]: <https://doi.org/10.14806/ej.17.1.200>
+[pub-entrez-direct]: <https://www.ncbi.nlm.nih.gov/books/NBK179288/>
 [pub-kallisto]: <https://doi.org/10.1038/nbt.3519>
 [pub-multiqc]: <https://doi.org/10.1093/bioinformatics/btw354>
 [pub-rseqc]: <https://doi.org/10.1093/bioinformatics/bts356>

diff --git a/tests/test_htsinfer_workflow/test.local.sh b/tests/test_htsinfer_workflow/test.local.sh
@@ -42,6 +42,4 @@ snakemake \
     --keep-incomplete
 
 # Check md5 sum of some output files
-#find results/ -type f -name \*\.gz -exec gunzip '{}' \;
-#find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
-md5sum --check "expected_output.md5"
+md5sum --check "expected_output.md5"
diff --git a/tests/test_integration_workflow/test.local.sh b/tests/test_integration_workflow/test.local.sh
@@ -83,4 +83,3 @@ diff \
 diff \
     <(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
     <(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')
-
diff --git a/tests/test_integration_workflow/test.slurm.sh b/tests/test_integration_workflow/test.slurm.sh
@@ -82,4 +82,4 @@ diff \
     <(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')
 diff \
     <(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
-    <(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')
+    <(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')
diff --git a/tests/test_integration_workflow/test.temp.flag.sh b/tests/test_integration_workflow/test.temp.flag.sh
@@ -41,17 +41,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
 find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
 md5sum --check "expected_output_temp_flag.md5"
 
-# Checksum file generated with
-# find results/ \
-#    -type f \
-#    -name \*\.gz \
-#    -exec gunzip '{}' \;
-# find results/ \
-#    -type f \
-#    -name \*\.zip \
-#    -exec sh -c 'unzip -o {} -d $(dirname {})' \;
-# md5sum $(cat expected_output.files) > expected_output_temp_flag.md5
-
 # Check whether STAR produces expected alignments
 # STAR alignments need to be fully within ground truth alignments for tests to pass; not checking 
 # vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1' 

diff --git a/tests/test_integration_workflow_multiple_lanes/test.local.sh b/tests/test_integration_workflow_multiple_lanes/test.local.sh
@@ -42,8 +42,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
 find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
 md5sum --check "expected_output.md5"
 
-
-
 # Check whether STAR produces expected alignments
 # STAR alignments need to be fully within ground truth alignments for tests to pass; not checking 
 # vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1' 
@@ -74,4 +72,3 @@ diff \
 diff \
     <(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
     <(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')
-
diff --git a/tests/test_integration_workflow_multiple_lanes/test.slurm.sh b/tests/test_integration_workflow_multiple_lanes/test.slurm.sh
@@ -42,17 +42,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
 find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
 md5sum --check "expected_output.md5"
 
-# Checksum file generated with
-# find results/ \
-#     -type f \
-#     -name \*\.gz \
-#     -exec gunzip '{}' \;
-# find results/ \
-#     -type f \
-#     -name \*\.zip \
-#     -exec sh -c 'unzip -o {} -d $(dirname {})' \;
-# md5sum $(cat expected_output.files) > expected_output.md5
-
 # Check whether STAR produces expected alignments
 # STAR alignments need to be fully within ground truth alignments for tests to pass; not checking 
 # vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1' 
@@ -83,5 +72,3 @@ diff \
 diff \
     <(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
     <(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')
-
-
diff --git a/tests/test_integration_workflow_with_conda/test.local.sh b/tests/test_integration_workflow_with_conda/test.local.sh
@@ -42,17 +42,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
 find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
 md5sum --check "expected_output.md5"
 
-# Checksum file generated with
-#find results/ \
-#    -type f \
-#    -name \*\.gz \
-#    -exec gunzip '{}' \;
-#find results/ \
-#    -type f \
-#    -name \*\.zip \
-#    -exec sh -c 'unzip -o {} -d $(dirname {})' \;
-#md5sum $(cat expected_output.files) > expected_output.md5
-
 # Check whether STAR produces expected alignments
 # STAR alignments need to be fully within ground truth alignments for tests to pass; not checking 
 # vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1' 
@@ -83,4 +72,3 @@ diff \
 diff \
     <(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
     <(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')
-
diff --git a/tests/test_integration_workflow_with_conda/test.slurm.sh b/tests/test_integration_workflow_with_conda/test.slurm.sh
@@ -42,17 +42,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
 find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
 md5sum --check "expected_output.md5"
 
-# Checksum file generated with
-# find results/ \
-#     -type f \
-#     -name \*\.gz \
-#     -exec gunzip '{}' \;
-# find results/ \
-#     -type f \
-#     -name \*\.zip \
-#     -exec sh -c 'unzip -o {} -d $(dirname {})' \;
-# md5sum $(cat expected_output.files) > expected_output.md5
-
 # Check whether STAR produces expected alignments
 # STAR alignments need to be fully within ground truth alignments for tests to pass; not checking 
 # vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1' 
@@ -83,5 +72,3 @@ diff \
 diff \
     <(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
     <(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')
-
-
diff --git a/tests/test_sra_download_with_conda/expected_output.files b/tests/test_sra_download_with_conda/expected_output.files
@@ -1,4 +1,4 @@
 results/sra_downloads/sra_samples.out.tsv
-results/sra_downloads/SRR18549672/SRR18549672_1.fastq
-results/sra_downloads/SRR18549672/SRR18549672_2.fastq
-results/sra_downloads/SRR18552868/SRR18552868.fastq
+results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq
+results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq
+results/sra_downloads/compress/ERR2248142/ERR2248142.fastq