Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: major rewrite of SRA download workflow #157

Merged
merged 13 commits into from
Feb 3, 2024
Merged
2 changes: 2 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,8 @@ jobs:
- name: Run htsinfer test script
run: bash tests/test_htsinfer_workflow/test.local.sh

- name: Run SRA downloads workflow
run: bash tests/test_sra_download_with_singularity/test.local.sh

integration-singularity-tempflag:
needs:
Expand Down
41 changes: 22 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -269,24 +269,20 @@ your run.
# Sample downloads from SRA

An independent Snakemake workflow `workflow/rules/sra_download.smk` is included
for the download of SRA samples with [sra-tools].
for the download of sequencing libraries from the Sequence Read Archive and
conversion into FASTQ.

> Note: as of Snakemake 7.3.1, only profile conda is supported.
> Singularity fails because the *sra-tools* Docker container only has `sh`
but `bash` is required.

> Note: The workflow uses the implicit temporary directory
from snakemake, which is called with [resources.tmpdir].

The workflow expects the following config:
* `samples`, a sample table (tsv) with column *sample* containing *SRR* identifiers,
see example [here](tests/input_files/sra_samples.tsv).
The workflow expects the following parameters in the configuration file:
* `samples`, a sample table (tsv) with column *sample* containing *SRR*
identifiers (ERR and DRR are also supported), see
[example](tests/input_files/sra_samples.tsv).
* `outdir`, an output directory
* `samples_out`, a pointer to a modified sample table with location of fastq files
* `samples_out`, a pointer to a modified sample table with the locations of
the corresponding FASTQ files
* `cluster_log_dir`, the cluster log directory.

For executing the example one can use the following
(with activated *zarp* environment):
For executing the example with Conda environments, one can use the following
command (from within the activated `zarp` Conda environment):

```bash
snakemake --snakefile="workflow/rules/sra_download.smk" \
Expand All @@ -297,11 +293,19 @@ snakemake --snakefile="workflow/rules/sra_download.smk" \
log_dir="logs" \
cluster_log_dir="logs/cluster_log"
```
After successful execution, `results/sra_downloads/sra_samples.out.tsv` should contain:

Alternatively, change the argument to `--profile` from `local-conda` to
`local-singularity` to execute the workflow steps within Singularity
containers.

After successful execution, `results/sra_downloads/sra_samples.out.tsv` should
contain:

```tsv
sample fq1 fq2
SRR18552868 results/sra_downloads/SRR18552868/SRR18552868.fastq.gz
SRR18549672 results/sra_downloads/SRR18549672/SRR18549672_1.fastq.gz results/sra_downloads/SRR18549672/SRR18549672_2.fastq.gz
sample fq1 fq2
SRR18552868 results/sra_downloads/compress/SRR18552868/SRR18552868.fastq.gz
SRR18549672 results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq.gz results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq.gz
ERR2248142 results/sra_downloads/compress/ERR2248142/ERR2248142.fastq.gz
```


Expand Down Expand Up @@ -355,5 +359,4 @@ After successful execution - if all parameters could be either inferred or were
[slurm]: <https://slurm.schedmd.com/documentation.html>
[zavolan-lab]: <https://www.biozentrum.unibas.ch/research/researchgroups/overview/unit/zavolan/research-group-mihaela-zavolan/>
[pipeline-documentation]: pipeline_documentation.md
[sra-tools]: <https://github.com/ncbi/sra-tools>
[resources.tmpdir]: <https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html?#standard-resources>
4 changes: 2 additions & 2 deletions config/config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
# Required fields
samples: "tests/input_files/samples.tsv"
samples: "samples.tsv"
output_dir: "results"
log_dir: "logs"
cluster_log_dir: "logs/cluster"
Expand All @@ -15,4 +15,4 @@
report_url: "https://zavolan.biozentrum.unibas.ch/"
author_name: "NA"
author_email: "NA"
...
...
2 changes: 1 addition & 1 deletion install/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ channels:
- bioconda
dependencies:
- python>=3.10, <3.12
- snakemake
- snakemake<8
- graphviz
65 changes: 65 additions & 0 deletions pipeline_documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,16 @@ on installation and usage please see [here](README.md).
- [`map_genome_star`](#map_genome_star)
- [`quantification_salmon`](#quantification_salmon)
- [`genome_quantification_kallisto`](#genome_quantification_kallisto)
- [Description of SRA download workflow steps](#description-of-sra-download-workflow-steps)
- [SRA Sequencing mode-independent ](#sra-sequencing-mode-independent)
- [`get_layout`](#get_layout)
- [`prefetch`](#prefetch)
- [`add_fq_file_path`](#add_fq_file_path)
- [SRA Sequencing mode-specific](#sra-sequencing-mode-specific)
- [`fasterq_dump`](#fasterq_dump)
- [`compress_fastq`](#remove_polya_cutadapt)
- [`process_fastq`](#process_fastq)


## Third-party software used

Expand All @@ -59,13 +69,16 @@ on installation and usage please see [here](README.md).
| **bedtools** | [GPLv2][license-gpl2] | _"[...] intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF"_ | [code][code-bedtools] / [manual][code-bedtools] |
| **cutadapt** | [MIT][license-mit] | _"[...] finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads"_ | [code][code-cutadapt] / [manual][docs-cutadapt] / [publication][pub-cutadapt] |
| **gffread** | [MIT][license-mit] | _"[...] validate, filter, convert and perform various other operations on GFF files"_ | [code][code-gffread] / [manual][docs-gffread] |
| **Entrez Direct** | [custom][license-entrez-direct] | _"[...] an advanced method for accessing the NCBI's set of interconnected databases from a UNIX terminal window"_ | [code][code-entrez-direct] / [manual][docs-entrez-direct] / [publication][pub-entrez-direct] |
| **FastQC** | [GPLv3][license-gpl3] | _"A quality control analysis tool for high throughput sequencing data"_ | [code][code-fastqc] / [manual][docs-fastqc] |
| **ImageMagick** | [custom][license-imagemagick]^ | _"[...] create, edit, compose, or convert bitmap images"_ | [code][code-imagemagick] / [manual][docs-imagemagick] |
| **kallisto** | [BSD-2][license-bsd2] | _"[...] program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads"_ | [code][code-kallisto] / [manual][docs-kallisto] / [publication][pub-kallisto] |
| **MultiQC** | [GPLv3][license-gpl3] | _"Aggregate results from bioinformatics analyses across many samples into a single report"_ | [code][code-multiqc] / [manual][docs-multiqc] / [publication][pub-multiqc] |
| **pigz** | [custom][license-pigz] | _"[...] parallel implementation of gzip, is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data"_ | [code][code-pigz] / [manual][docs-pigz] |
| **RSeqC** | [GPLv3][license-gpl3] | _"[...] comprehensively evaluate different aspects of RNA-seq experiments, such as sequence quality, GC bias, polymerase chain reaction bias, nucleotide composition bias, sequencing depth, strand specificity, coverage uniformity and read distribution over the genome structure."_ | [code][code-rseqc] / [manual][docs-rseqc] / [publication][pub-rseqc] |
| **Salmon** | [GPLv3][license-gpl3] | _"Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment"_ | [code][code-salmon] / [manual][docs-salmon] / [publication][pub-salmon] |
| **SAMtools** | [MIT][license-mit] | _"[...] suite of programs for interacting with high-throughput sequencing data"_ | [code][code-samtools] / [manual][docs-samtools] / [publication][pub-samtools] |
| **SRA Tools** | [custom][license-sra-tools] | _"[...] collection of tools and libraries for using data in the INSDC Sequence Read Archives"_ | [code][code-sra-tools] / [manual][docs-sra-tools] |
| **STAR** | [MIT][license-mit] | _"**S**pliced **T**ranscripts **A**lignment to a **R**eference"_ - _"RNA-seq aligner"_ | [code][code-star] / [manual][docs-star] / [publication][pub-star] |

^ compatible with [GPLv3][license-gpl3]
Expand Down Expand Up @@ -715,18 +728,63 @@ Generate pseudoalignments of reads to transcripts with
- `--single`: Quantify single-end reads **(single-end only)**
- `--pseudobam`: Save pseudoalignments to transcriptome to BAM file


## Description of SRA download workflow steps

> This separate workflow consists of three Snakemake files: A main `sra_download.smk` and an
> individual Snakemake file for each sequencing mode (single-end and
> paired-end), as parameters for some tools differ between sequencing modes.
> The main `sra_download.smk` contains general steps for downloading the samples
> from the SRA repository and determining the sequencing mode in order to execute
> the appropriate subsequent rules.Individual steps of the workflow are described
> briefly, and links to the respective software manuals are given. Parameters that
> can be modified by the user (via the samples table) are also described. Descriptions
> for steps for which individual "rules" exist for single- and paired-end
> sequencing libraries are combined, and only differences between the modes are
> highlighted.


### SRA Sequencing mode-independent

#### `get_layout`
Get the library type of each sample (paired or single-end) using efetch [**Entrez direct**](#third-party-software-used).
- **Output**
- A file with a name either PAIRED or SINGLE which is used downstream to run the
appropriate subworkflow.

#### `prefetch`
Download the SRA entry using [**SRA Tools**](#third-party-software-used)

#### `add_fq_file_path`
Aggregate the fastq file(s) path(s) in a table for all samples.

### SRA Sequencing mode-specific

#### `fasterq_dump`
Converts SRA entry to fastq file(s) using [**SRA Tools**](#third-party-software-used)

### `compress_fastq`
Compresses fastq file(s) to .gz format. [**pigz**](#third-party-software-used)

### `process_fastq`
Keep the fastq.gz file path in a table, later aggregated in one table in `add_fq_file_path`.


[code-alfa]: <https://github.com/biocompibens/ALFA>
[code-bedgraphtobigwig]: <https://github.com/ucscGenomeBrowser/kent>
[code-bedtools]: <https://github.com/arq5x/bedtools2>
[code-cutadapt]: <https://github.com/marcelm/cutadapt>
[code-gffread]: <https://github.com/gpertea/gffread>
[code-entrez-direct]: <https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/>
[code-fastqc]: <https://github.com/s-andrews/FastQC>
[code-imagemagick]: <https://github.com/ImageMagick/ImageMagick/>
[code-kallisto]: <https://github.com/pachterlab/kallisto>
[code-multiqc]: <https://github.com/ewels/MultiQC>
[code-pigz]: <https://github.com/madler/pigz>
[code-rseqc]: <http://rseqc.sourceforge.net/>
[code-salmon]: <https://github.com/COMBINE-lab/salmon>
[code-samtools]: <https://github.com/samtools/samtools>
[code-sra-tools]: <https://github.com/ncbi/sra-tools>
[code-star]: <https://github.com/alexdobin/STAR>
[custom-script-gtf-to-bed12]: <https://github.com/zavolanlab/zgtf>
[custom-script-tin]: <https://github.com/zavolanlab/tin-score-calculation>
Expand All @@ -738,25 +796,32 @@ Generate pseudoalignments of reads to transcripts with
[docs-cutadapt]: <https://cutadapt.readthedocs.io/en/stable/>
[docs-cutadapt-m]: <https://cutadapt.readthedocs.io/en/stable/guide.html#filtering-reads>
[docs-gffread]: <http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread>
[docs-entrez-direct]: <https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/README>
[docs-fastqc]: <http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/>
[docs-imagemagick]: <https://imagemagick.org/>
[docs-kallisto]: <http://pachterlab.github.io/kallisto/manual.html>
[docs-multiqc]: <https://multiqc.info/docs/>
[docs-pigz]:<https://zlib.net/pigz/pigz.pdf>
[docs-rseqc]: <http://rseqc.sourceforge.net/#usage-information>
[docs-salmon]: <https://salmon.readthedocs.io/en/latest/>
[docs-salmon-selective-alignment]: <https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/>
[docs-samtools]: <http://www.htslib.org/doc/samtools.html>
[docs-snakemake]: <https://snakemake.readthedocs.io/en/stable/>
[docs-snakemake-target-rule]: <https://snakemake.readthedocs.io/en/stable/tutorial/basics.html#step-7-adding-a-target-rule>
[docs-sra-tools]: <https://github.com/ncbi/sra-tools/wiki>
[docs-star]: <https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf>
[docs-star-rpm-norm]: <https://ycl6.gitbooks.io/rna-seq-data-analysis/visualization.html>
[license-bsd2]: <https://opensource.org/licenses/BSD-2-Clause>
[license-entrez-direct]: <https://www.ncbi.nlm.nih.gov/books/NBK179288/>
[license-gpl2]: <https://opensource.org/licenses/GPL-2.0>
[license-gpl3]: <https://opensource.org/licenses/GPL-3.0>
[license-imagemagick]: <https://github.com/ImageMagick/ImageMagick/blob/master/LICENSE>
[license-mit]: <https://opensource.org/licenses/MIT>
[license-pigz]: <https://github.com/madler/pigz/blob/master/README>
[license-sra-tools]: <https://github.com/ncbi/sra-tools/blob/master/LICENSE>
[pub-alfa]: <https://doi.org/10.1186/s12864-019-5624-2>
[pub-cutadapt]: <https://doi.org/10.14806/ej.17.1.200>
[pub-entrez-direct]: <https://www.ncbi.nlm.nih.gov/books/NBK179288/>
[pub-kallisto]: <https://doi.org/10.1038/nbt.3519>
[pub-multiqc]: <https://doi.org/10.1093/bioinformatics/btw354>
[pub-rseqc]: <https://doi.org/10.1093/bioinformatics/bts356>
Expand Down
4 changes: 1 addition & 3 deletions tests/test_htsinfer_workflow/test.local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,4 @@ snakemake \
--keep-incomplete

# Check md5 sum of some output files
#find results/ -type f -name \*\.gz -exec gunzip '{}' \;
#find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
md5sum --check "expected_output.md5"
md5sum --check "expected_output.md5"
1 change: 0 additions & 1 deletion tests/test_integration_workflow/test.local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -83,4 +83,3 @@ diff \
diff \
<(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')

2 changes: 1 addition & 1 deletion tests/test_integration_workflow/test.slurm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -82,4 +82,4 @@ diff \
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')
diff \
<(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')
11 changes: 0 additions & 11 deletions tests/test_integration_workflow/test.temp.flag.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,17 +41,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
md5sum --check "expected_output_temp_flag.md5"

# Checksum file generated with
# find results/ \
# -type f \
# -name \*\.gz \
# -exec gunzip '{}' \;
# find results/ \
# -type f \
# -name \*\.zip \
# -exec sh -c 'unzip -o {} -d $(dirname {})' \;
# md5sum $(cat expected_output.files) > expected_output_temp_flag.md5

# Check whether STAR produces expected alignments
# STAR alignments need to be fully within ground truth alignments for tests to pass; not checking
# vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1'
Expand Down
3 changes: 0 additions & 3 deletions tests/test_integration_workflow_multiple_lanes/test.local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
md5sum --check "expected_output.md5"



# Check whether STAR produces expected alignments
# STAR alignments need to be fully within ground truth alignments for tests to pass; not checking
# vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1'
Expand Down Expand Up @@ -74,4 +72,3 @@ diff \
diff \
<(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')

13 changes: 0 additions & 13 deletions tests/test_integration_workflow_multiple_lanes/test.slurm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,17 +42,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
md5sum --check "expected_output.md5"

# Checksum file generated with
# find results/ \
# -type f \
# -name \*\.gz \
# -exec gunzip '{}' \;
# find results/ \
# -type f \
# -name \*\.zip \
# -exec sh -c 'unzip -o {} -d $(dirname {})' \;
# md5sum $(cat expected_output.files) > expected_output.md5

# Check whether STAR produces expected alignments
# STAR alignments need to be fully within ground truth alignments for tests to pass; not checking
# vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1'
Expand Down Expand Up @@ -83,5 +72,3 @@ diff \
diff \
<(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')


12 changes: 0 additions & 12 deletions tests/test_integration_workflow_with_conda/test.local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,17 +42,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
md5sum --check "expected_output.md5"

# Checksum file generated with
#find results/ \
# -type f \
# -name \*\.gz \
# -exec gunzip '{}' \;
#find results/ \
# -type f \
# -name \*\.zip \
# -exec sh -c 'unzip -o {} -d $(dirname {})' \;
#md5sum $(cat expected_output.files) > expected_output.md5

# Check whether STAR produces expected alignments
# STAR alignments need to be fully within ground truth alignments for tests to pass; not checking
# vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1'
Expand Down Expand Up @@ -83,4 +72,3 @@ diff \
diff \
<(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')

13 changes: 0 additions & 13 deletions tests/test_integration_workflow_with_conda/test.slurm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,17 +42,6 @@ find results/ -type f -name \*\.gz -exec gunzip '{}' \;
find results/ -type f -name \*\.zip -exec sh -c 'unzip -o {} -d $(dirname {})' \;
md5sum --check "expected_output.md5"

# Checksum file generated with
# find results/ \
# -type f \
# -name \*\.gz \
# -exec gunzip '{}' \;
# find results/ \
# -type f \
# -name \*\.zip \
# -exec sh -c 'unzip -o {} -d $(dirname {})' \;
# md5sum $(cat expected_output.files) > expected_output.md5

# Check whether STAR produces expected alignments
# STAR alignments need to be fully within ground truth alignments for tests to pass; not checking
# vice versa because processing might cut off parts of reads (if testing STAR directly, add '-f 1'
Expand Down Expand Up @@ -83,5 +72,3 @@ diff \
diff \
<(cat results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/synthetic_10_reads_paired_synthetic_10_reads_paired.salmon.pe/quant.genes.sf | cut -f1,5 | tail -n +2 | sort -k1,1) \
<(cat ../input_files/synthetic.mate_1.bed | cut -f7 | sort | uniq -c | sort -k2nr | awk '{printf($2"\t"$1"\n")}')


6 changes: 3 additions & 3 deletions tests/test_sra_download_with_conda/expected_output.files
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
results/sra_downloads/sra_samples.out.tsv
results/sra_downloads/SRR18549672/SRR18549672_1.fastq
results/sra_downloads/SRR18549672/SRR18549672_2.fastq
results/sra_downloads/SRR18552868/SRR18552868.fastq
results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq
results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq
results/sra_downloads/compress/ERR2248142/ERR2248142.fastq