🧬 Combinatorial Bioinformatic Meta-Framework

Efficient Bioinformatics Workflows for High-Throughput Sequence Analysis

🔭 Overview

The Combinatorial Bioinformatic Meta-Framework (CBMF) is a single point of access to thousands of biomedical research software packages with additional tools and resources for analyzing high-throughput sequencing data. By leveraging the power of the micromamba package manager to create conda environments, install software and manage dependencies. The framework is designed to be modular, allowing users to select the tools they need for their specific analyses.

📄 Abstract

The rapid growth of next-generation sequencing technologies has generated an unprecedented volume of biological data, posing significant challenges for bioinformatics analysis. Traditional scripting approaches often lack reproducibility and can be complex for users without extensive programming expertise. This project introduces a collection of Linux shell scripts designed to address these challenges. These scripts implement standardized workflows for quality control, alignment, and report generation tasks across diverse datasets. By automating these processes, the scripts ensure reproducibility, minimize human error, and promote consistent data processing. This suite offers a scalable and reliable solution for comprehensive bioinformatics analysis, representing an important advancement in making high-throughput sequencing data more accessible and manageable for the broader research community.

🚀 Quick Start

Clone the repository and run initialization script in the cbmf directory:

git clone https://github.com/rdnajac/cbmf && cd cbmf && cbmf init

Then skip ahead to the Sequencing Workflows section. Otherwise, read on for more information about the CBMF, how it works, and how to use it.

📦 Package Management

Bundled with the CBMF is the lightweight package manager micromamba with access to the entire suite of bioinformatics software available on Bioconda¹, a channel for the conda package manager (including all available Bioconductor² software.

Micromamba is not a conda distribution, but a statically linked C++ executable that can be used to install conda environments. It is a lightweight binary that handles the installation of conda environments without root privileges, or the need for a base environment or a Python installation, making it ideal for use in high-performance computing clusters.

If you want to install it on your own, skip the init scripts and run:

"$SHELL" <(curl -L micro.mamba.pm/install.sh)

CBMF comes with some sensible defaults and pre-configured environments for common bioinformatics tasks, but users can easily create their own environments using the Mamba API.

🧬 Sequencing Workflows

The following sections outline the steps involved in analyzing high-throughput sequencing data:

Demultiplexing
Quality Control
Alignment
Assembly and Quantification

But first, a quick note on file types and formats:

Caution

This table is a work in progress and may not be complete.

File Type	Description	Typical Extension
FASTQ	Raw reads from sequencer (often compressed)	.fastq, .fastq.gz
FASTA	Sequence data	.fasta, .fa, .fna
GTF	Gene Transfer Format (GFF2 variant)	.gtf, .gtf.gz
GFF	General Feature Format	.gff3, .gff3.gz
SAM	Sequence Alignment/Map	.sam, .sam.gz
BAM	Binary Alignment/Map	.bam
CRAM	Compressed Reference-oriented Alignment Map	.cram
VCF	Variant Call Format	.vcf,
BED	Browser Extensible Data	.bed,
TSV	Tab-Separated Values	.tsv,
CSV	Comma-Separated Values	.csv,
JSON	JavaScript Object Notation	.json,
Markdown	Markup language for documentation	.md, .markdown
TXT	Plain Text	.txt,
md5	Checksum for file integrity	.md5, .txt

🔀 Demultiplexing

Skip this section and read about Quality Control if you have already received the demultiplexed FASTQ files.

The following command is the default bcl2fastq command for demultiplexing on the Nextseq, but with the --no-lane-splitting option added to combine the reads from all four lanes into a single FASTQ file:

bcl2fastq --no-lane-splitting \
    --ignore-missing-bcls \
    --ignore-missing-filter \
    --ignore-missing-positions \
    --ignore-missing-controls \
    --auto-set-to-zero-barcode-mismatches \
    --find-adapters-with-sliding-window \
    --adapter-stringency 0.9 \
    --mask-short-adapter-reads 35 \
    --minimum-trimmed-read-length 35 \
    -R "$run_folder" \
    -o "$output_folder" \
    --sample-sheet "$sample_sheet" \

You can copy and paste this command if you set the variables $run_folder, $output_folder, and $sample_sheet to the appropriate values.

Warning

As of this document's last revision, bcl2fastq is no longer supported; use bclconvert if you have a used recent Illumina sequencer (NovaSeq, NextSeq 1000/2000, etc.).

🔍 Quality Control

Quality control is an essential step in the analysis of high-throughput sequencing data. It allows us to assess the quality of the reads and identify any issues that may affect downstream analysis, like adapter contamination or low-quality reads. More interesting quality issues include GC bias, mitochondrial contamination, and over-representation of certain sequences.

Tool	Description	Source
FastQC³	Generates html reports containing straightforward metrics	GitHub
GATK⁴	Analyzes high-throughput sequencing data	GitHub
Picard Tools	Manipulates high-throughput sequencing data	Comes packaged with GATK4
MultiQC[^4]	Aggregates results from bioinformatics analyses	GitHub

To run these QC applications, you need a suitable Java Runtime Environment (JRE). Let micromamba handle the installation of the JRE and the tools from bioconda:

micromamba create -n qc -c conda-forge -c bioconda fastqc gatk4 picard multiqc
micromamba run -n qc fastqc -o <output_dir> <fastq_file>

You can also use the qc configuration file in the dev folder to create the environment:

micromamba env create -f dev/qc.yml
micromamba activate qc

Tip

After aligning the reads to the reference genome, these tools can be re-ran on the resulting SAM/BAM files to ensure that the alignment was successful or to consolidate the results from paired-end sequencing.

🏗️ Alignment

Before we can analyze the data, we need to align the reads to a reference genome. Before aligning the reads, we need download the reference genome and build the index files. The most recent major releases from NCBI Datasets can be found on the Genome Reference Consortium page.

Reference Genomes

species	assembly	release date	accession	ftp link
human	GRCh38	xxxx-xx-xx	GCA_000001405.15	ftp
mouse	GRCm39	xxxx-xx-xx	GCA_000001635.9	ftp

Tip

Skip building indexes from scratch and use the pre-built indexes for bowtie2, bwa, and hisat2 and samtools in the seqs_for_alignment_pipelines.ucsc_ids folder. (It even has the 'GTT' and 'GFF' annotation files we'll need later).

Aligners

Tool	Description	Key Features	Best For	Source
BWA⁵	Maps short DNA sequences to reference genome	- Uses Burrows-Wheeler Transform (BWT) for indexing - Efficient for short reads - Supports paired-end reads	Whole Genome Sequencing (WGS), Exome Sequencing	GitHub
STAR⁶	Specialized for RNA-Seq alignment	- Uses seed-extension search - Detects novel splice junctions - Fast and accurate for long reads	RNA-Seq, especially with long reads	GitHub
HISAT2⁷	Splice-aware aligner for DNA and RNA sequences	- Uses graph-based alignment - Memory-efficient - Supports both DNA and RNA alignment	RNA-Seq, WGS, particularly for large genomes	GitHub
Bowtie2⁸	Efficient short read aligner	- Uses FM-index (similar to BWT) - Supports gapped, local, and paired-end alignment - Memory-efficient	ChIP-seq, WGS	GitHub
Subread⁹	Seed-and-vote algorithm-based aligner	- Fast and accurate - Supports indel detection - Includes read counting functionality (featureCounts)	RNA-Seq, DNA-Seq	GitHub
Subjunc¹⁰	Exon-exon junction detector (part of Subread package)	- Detects novel exon-exon junctions - Uses seed-and-vote algorithm - Can be used independently or with Subread	RNA-Seq, specifically for junction detection	GitHub

FASTQ to BAM/CRAM

(Work in progress)

🔬 Assembly and Quantification

Read the wiki for details on experiment-specific processing and analysis.

📑 Additional Resources

Writing guides:

Nerd stuff:

Message boards:

FAQs:

FAQs - NCBI

👍 Acknowledgements

Shout out to these awesome docs:

Thank you to my labmates in the Palomero Lab for their feedback and guidance.

Grüning, B., Dale, R., Sjödin, A. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods 15, 475–476 (2018). https://doi.org/10.1038/s41592-018-0046-7 ↩
Gentleman, R.C., Carey, V.J., Bates, D.M. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5, R80 (2004). https://doi.org/10.1186/gb-2004-5-10-r80 ↩
Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ ↩
McKenna A, Hanna M, Banks E, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297-1303. PMID: 20644199 ↩
Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. PMID: 19451168 ↩
Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15-21. PMID: 23104886 ↩
Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12(4):357-360. PMID: 25751142 ↩
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357-359. PMID: 22388286 ↩
Liao Y, Smyth GK, Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research. 2013;41(10):e108. PMID: 23558742 ↩
Liao Y, Smyth GK, Shi W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research. 2019;47(8):e47. PMID: 30783653 ↩

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
a_series_of_tubes		a_series_of_tubes
doc		doc
etc		etc
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
cbmf		cbmf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 Combinatorial Bioinformatic Meta-Framework

🔭 Overview

📄 Abstract

🚀 Quick Start

📦 Package Management

🧬 Sequencing Workflows

🔀 Demultiplexing

🔍 Quality Control

🏗️ Alignment

Reference Genomes

Aligners

FASTQ to BAM/CRAM

🔬 Assembly and Quantification

📑 Additional Resources

👍 Acknowledgements

About

Releases

Packages

Languages

License

rdnajac/cbmf

Folders and files

Latest commit

History

Repository files navigation

🧬 Combinatorial Bioinformatic Meta-Framework

🔭 Overview

📄 Abstract

🚀 Quick Start

📦 Package Management

🧬 Sequencing Workflows

🔀 Demultiplexing

🔍 Quality Control

🏗️ Alignment

Reference Genomes

Aligners

FASTQ to BAM/CRAM

🔬 Assembly and Quantification

📑 Additional Resources

👍 Acknowledgements

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages