Skip to content

Latest commit

 

History

History
62 lines (55 loc) · 4.47 KB

material_methods.rst

File metadata and controls

62 lines (55 loc) · 4.47 KB

Material and methods

Genome DNA sequence and annotations were download from Ensembl. Pyfaidx [1] was used to filter non-cannonical chromosomes. Agat [2] was used to correct common issues found in Ensembl genome annotation files, filter non- cannonical chromosomes, and remove transcripts with TSL being equal to NA. Samtools [3] and Picard [4] were used to index genome sequences.

Raw fastq file quality was assessed with FastQC [5]. Raw fastq files were trimmed using Fastp [6] . Cleaned reads were aligned over indexed Ensembl genome with Bowtie2 [7]. Sambamba [8] was used to sort, filter, mark duplicates, and compress aligned reads. Quality controls were done on cleaned, sorted, deduplicated aligned reads using Picard [4] and Samtools [3]. Additonal quality assessments are done with RSeQC [9], NGSderive [10], and GOleft [11]. Quality repord produced during both trimming and mapping steps have been aggregated with MultiQC [12].

The whole pipeline was powered by Snakemake [13]. This pipeline is freely available on Github, details about installation usage, and resutls can be found on the Snakemake workflow page.

[1]Shirley, Matthew D., et al. Efficient" pythonic" access to FASTA files using pyfaidx. No. e1196. PeerJ PrePrints, 2015.
[2]Dainat J. AGAT: Another Gff Analysis Toolkit to handle annotations in any GTF/GFF format. (Version v0.7.0). Zenodo. https://www.doi.org/10.5281/zenodo.3552717
[3](1, 2) Li, Heng, et al. "The sequence alignment/map format and SAMtools." bioinformatics 25.16 (2009): 2078-2079.
[4](1, 2) McKenna, Aaron, et al. "The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data." Genome research 20.9 (2010): 1297-1303.
[5]Andrews, S. Fastqc. "A quality control tool for high throughput sequence data. Augen, J.(2004). Bioinformatics in the post-genomic era: Genome, transcriptome, proteome, and information-based medicine." (2010).
[6]Chen, Shifu, et al. "fastp: an ultra-fast all-in-one FASTQ preprocessor." Bioinformatics 34.17 (2018): i884-i890.
[7]Langmead, Ben, and Steven L. Salzberg. "Fast gapped-read alignment with Bowtie 2." Nature methods 9.4 (2012): 357-359.
[8]Tarasov, Artem, et al. "Sambamba: fast processing of NGS alignment formats." Bioinformatics 31.12 (2015): 2032-2034.
[9]Wang, Liguo, Shengqin Wang, and Wei Li. "RSeQC: quality control of RNA-seq experiments." Bioinformatics 28.16 (2012): 2184-2185.
[10]McLeod, Clay, et al. "St. Jude Cloud: a pediatric cancer genomic data-sharing ecosystem." Cancer discovery 11.5 (2021): 1082-1099.
[11]Pedersen, Brent S., et al. "Indexcov: fast coverage quality control for whole-genome sequencing." Gigascience 6.11 (2017): gix090.
[12]Ewels, Philip, et al. "MultiQC: summarize analysis results for multiple tools and samples in a single report." Bioinformatics 32.19 (2016): 3047-3048.
[13]Köster, Johannes, and Sven Rahmann. "Snakemake—a scalable bioinformatics workflow engine." Bioinformatics 28.19 (2012): 2520-2522.
Authors:Thibault Dayris
Version:3.5.1 of 06/09/2024