Genome DNA sequence and annotations were download from Ensembl.
Pyfaidx [1] was used to filter non-cannonical
chromosomes. Agat [2] was used to correct common
issues found in Ensembl genome annotation files, filter non-
cannonical chromosomes, and remove transcripts with TSL being
equal to NA. Samtools [3] and Picard [4]
were used to index genome sequences.
Raw fastq file quality was assessed with FastQC [5].
Raw fastq files were trimmed using Fastp [6] . Cleaned
reads were aligned over indexed Ensembl genome with Bowtie2
[7]. Sambamba [8] was used to sort,
filter, mark duplicates, and compress aligned reads. Quality
controls were done on cleaned, sorted, deduplicated aligned reads
using Picard [4] and Samtools [3].
Additonal quality assessments are done with RSeQC [9],
NGSderive [10], and GOleft [11].
Quality repord produced during both trimming and mapping steps
have been aggregated with MultiQC [12].
The whole pipeline was powered by Snakemake [13].
This pipeline is freely available on Github, details about
installation usage, and resutls can be found on the
Snakemake workflow page.
[1] | Shirley, Matthew D., et al. Efficient" pythonic" access to FASTA files using pyfaidx. No. e1196. PeerJ PrePrints, 2015. |
[2] | Dainat J. AGAT: Another Gff Analysis Toolkit to handle annotations in any GTF/GFF format. (Version v0.7.0). Zenodo. https://www.doi.org/10.5281/zenodo.3552717 |
[3] | (1, 2) Li, Heng, et al. "The sequence alignment/map format and SAMtools." bioinformatics 25.16 (2009): 2078-2079. |
[4] | (1, 2) McKenna, Aaron, et al. "The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data." Genome research 20.9 (2010): 1297-1303. |
[5] | Andrews, S. Fastqc. "A quality control tool for high throughput sequence data. Augen, J.(2004). Bioinformatics in the post-genomic era: Genome, transcriptome, proteome, and information-based medicine." (2010). |
[6] | Chen, Shifu, et al. "fastp: an ultra-fast all-in-one FASTQ preprocessor." Bioinformatics 34.17 (2018): i884-i890. |
[7] | Langmead, Ben, and Steven L. Salzberg. "Fast gapped-read alignment with Bowtie 2." Nature methods 9.4 (2012): 357-359. |
[8] | Tarasov, Artem, et al. "Sambamba: fast processing of NGS alignment formats." Bioinformatics 31.12 (2015): 2032-2034. |
[9] | Wang, Liguo, Shengqin Wang, and Wei Li. "RSeQC: quality control of RNA-seq experiments." Bioinformatics 28.16 (2012): 2184-2185. |
[10] | McLeod, Clay, et al. "St. Jude Cloud: a pediatric cancer genomic data-sharing ecosystem." Cancer discovery 11.5 (2021): 1082-1099. |
[11] | Pedersen, Brent S., et al. "Indexcov: fast coverage quality control for whole-genome sequencing." Gigascience 6.11 (2017): gix090. |
[12] | Ewels, Philip, et al. "MultiQC: summarize analysis results for multiple tools and samples in a single report." Bioinformatics 32.19 (2016): 3047-3048. |
[13] | Köster, Johannes, and Sven Rahmann. "Snakemake—a scalable bioinformatics workflow engine." Bioinformatics 28.19 (2012): 2520-2522. |
Authors: | Thibault Dayris |
Version: | 3.5.1 of 06/09/2024 |