BAM Statistics, Feature Counting and Annotation
Clone or download
Latest commit 61354e4 Sep 20, 2018
Permalink
Failed to load latest commit information.
client 10X Genomics example Sep 20, 2018
docker removed pkg Jul 10, 2018
example atac and chip examples Sep 20, 2018
gtf added GRCh38 Jan 23, 2018
maps ATAC-Seq Apr 20, 2018
motif motif parsing Aug 24, 2018
scripts moved Rscripts Aug 22, 2018
src updated htslib to v1.9 Sep 18, 2018
.gitignore ignore motif folder Aug 24, 2018
.gitmodules rm boost submodule May 17, 2018
.travis.yml boost travis May 17, 2018
LICENSE Initial commit Feb 25, 2016
Makefile initializer string htslib error Sep 18, 2018
README.md updated readme Aug 30, 2018
alfred.png logo: smaller font Aug 21, 2018

README.md

install with bioconda Anaconda-Server Badge Build Status Docker Build GitHub license GitHub Releases GitHub Issues

Alfred installation

Statically linked binaries are available from the Alfred github release page. There is also an Alfred Bioconda package.

Alfred installation from source

To build Alfred from source you need some build essentials and the Boost libraries, i.e. for Ubuntu:

apt install build-essential g++ cmake git-all liblzma-dev zlib1g-dev libbz2-dev liblzma-dev libboost-date-time-dev libboost-program-options-dev libboost-system-dev libboost-filesystem-dev libboost-iostreams-dev

Once you have installed these system libraries you can compile and link Alfred.

git clone --recursive https://github.com/tobiasrausch/alfred.git

cd alfred/

make all

make install

./bin/alfred -h

BAM Alignment Quality Control

Alfred computes various alignment metrics and summary statistics by read group

./src/alfred qc -r <ref.fa> -o qc.tsv.gz <align.bam>

Plotting alignment statistics

Rscript scripts/stats.R qc.tsv.gz

To convert all the alignment metrics from column format to row format for readability

zgrep ^ME qc.tsv.gz | cut -f 2- | datamash transpose | column -t

Interactive Quality Control Browser

Quality control metrics can be browsed interactively using the web front end of Alfred.

./src/alfred qc -r <ref.fa> -f json -o qc.json.gz <align.bam>

Then just upload the qc.json.gz file to the Alfred GUI https://gear.embl.de/alfred. A convenient feature of the web-front end is that multiple samples can be compared (as shown in the online example) if json files are merged prior to the upload.

python ./scripts/merge.py sample1.json.gz sample2.json.gz sampleN.json.gz | gzip -c > multisample.json.gz

BAM Alignment Quality Control for Targeted Sequencing

If target regions are provided, Alfred computes the average coverage for each target and the on-target rate.

./src/alfred qc -r <ref.fa> -b <targets.bed.gz> -o <qc.tsv.gz> <align.bam>

For instance, for a human whole-exome data set.

cd maps/ && Rscript exon.R

./src/alfred qc -r <hg19.fa> -b maps/exonic.hg19.bed.gz -o qc.tsv.gz <exome.bam>

Rscript scripts/stats.R qc.tsv.gz

Alternatively, one can use the interactive GUI and upload the json file.

./src/alfred qc -r <hg19.fa> -b maps/exonic.hg19.bed.gz -f json -o qc.json.gz <exome.bam>

BAM Alignment Quality Control for ATAC-Seq

For ATAC-Seq data, the insert size distribution should reveal the DNA pitch and a clear nucleosome pattern with a peak for single nucleosomes and dimers. The transcription start site (TSS) enrichment should be >5 for a good ATAC-Seq library and ideally the duplicate rate is <20%, the alignment rate >70% and the standardized SD in coverage >0.3.

cd maps/ && Rscript promoter.R

./src/alfred qc -r <hg19.fa> -b maps/hg19.promoter.bed.gz -o qc.tsv.gz <atac.bam>

Rscript scripts/stats.R qc.tsv.gz

zgrep ^ME qc.tsv.gz | datamash transpose | egrep "^Dup|^MappedFraction|^SD|^Enrich"

ATAC-Seq libraries often have a large number of mitochondrial reads depending on the library preparation.

zgrep ^CM qc.tsv.gz | egrep "Mapped|chrM"

Alternatively, one can use the interactive GUI and upload the json file.

./src/alfred qc -r <hg19.fa> -b maps/hg19.promoter.bed.gz -f json -o qc.json.gz <atac.bam>

BAM Feature Counting for RNA-Seq

Alfred can also assign reads to gene annotation features from a GTF file such as counting reads by gene or transcript identifier.

cd gtf/ && ./downloadGTF.sh

./src/alfred count_rna -g gtf/Homo_sapiens.GRCh37.75.gtf.gz <align.GRCh37.bam>

BAM Read Counting for DNA-Seq

For DNA sequencing, Alfred can be used to calculate the coverage in overlapping or non-overlapping windows or in given set of intervals.

./src/alfred count_dna -o <cov.gz> <align.GRCh37.bam>

To plot the whole-chromosome coverage profile for chr1-22 and chrX.

Rscript scripts/rd.R <cov.gz>

BAM Feature Annotation

Alfred can also be used to annotate peaks from ChIP-Seq or ATAC-Seq experiments. For instance to annotate overlapping/neighboring genes up to a distance of 10,000bp:

./src/alfred annotate -d 10000 -g gtf/Homo_sapiens.GRCh37.75.gtf.gz <peaks.bed>

Motif annotation can also be done. For example:

cd motif/ && ./downloadMotifs.sh

./src/alfred annotate -r <hg19.fa> -m motif/jaspar.gz <peaks.bed>

Example E. coli data set

The github source code includes a minimal example to check that alfred compiled properly from source and that the web front end is working.

./src/alfred qc -r example/E.coli.fa.gz -o example/stats.tsv.gz example/E.coli.cram

Rscript scripts/stats.R example/stats.tsv.gz

For the web front end.

./src/alfred qc -r example/E.coli.fa.gz -f json -o ecoli.json.gz example/E.coli.cram

Please upload ecoli.json.gz to the Alfred web application.

Example plots

The Alfred web application includes a multi-sample example for whole-exome data including multiple read-groups. Additional examples for whole-genome sequencing, long-read sequencing, Hi-C, ATAC-Seq and other data sets are coming soon.

Credits

htseq-count was used to benchmark and validate the RNA read counting.