🚦 Run Picard on BAM files and collate 90 metrics into one file.
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data/project1
man
out
scripts fix picard version for travis May 29, 2015
test
.travis.yml
CHANGELOG.md
LICENSE initial commit Mar 16, 2015
Makefile
README.md
picardmetrics update documentation Jul 6, 2016
picardmetrics.conf

README.md

picardmetrics

Run Picard tools and collate multiple metrics files. Check the quality of your sequencing data.

DOI Build Status

Summary

Run picardmetrics like this:

for bam in data/project1/sample?/sample?.bam
do
  # -k keeps the BAM file with marked duplicate reads
  # -r runs RNA-seq Picard metrics
  # -o specifies where to put the output files
  picardmetrics run -k -r -o out/rnaseq $bam
done

# The final output file will be called "project1-all-metrics.tsv"
picardmetrics collate project1 out/rnaseq

picardmetrics runs up to 12 Picard tools on each BAM file and collates all of the output files into a single table with up to 90 different metrics. It also automatically creates the .refFlat and .rRNA.list files required for CollectRnaSeqMetrics.

See the picardmetrics manual for more details.

Next, plot and explore the metrics in R:

library(ggplot2)

dat <- read.delim("project1-all-metrics.tsv", stringsAsFactors = FALSE)

ggplot(dat) +
  geom_point(aes(PF_READS, PF_ALIGNED_BASES))

See two example BAM files in the data/ folder. The test/test.sh script illustrates the usage of picardmetrics and tests that it works correctly. See the outputs in the out/ folder. You can also download the reference files used to test picardmetrics.

Example

Genes detected vs. Mean MAPQ and Percent of bases vs. Sample

Use Picard to assess the quality of your sequencing data. This example shows RNA-seq data from hundreds of glioblastoma cells and gliomasphere cell lines.

On the left, each point represents an RNA-seq sample. We see that samples with high mean mapping quality have the greatest number of detected genes. Further, the color reveals variation in the percent of reads per sample that are assigned to exons.

On the right, each bar represents an RNA-seq sample. Each sample is broken down into the percent of sequenced bases coming from different genomic regions. We see that many samples have few sequenced bases coming from coding regions relative to intergenic regions.

Installation

# Download the code.
git clone https://github.com/slowkow/picardmetrics

cd picardmetrics

# Download and install the dependencies.
make get-deps PREFIX=~/.local

# Install picardmetrics and the man page.
make install PREFIX=~/.local

# Edit the configuration file for your project.
vim ~/picardmetrics.conf

If you wish, you can manually install the dependencies:

Contributing

Please submit an issue to report bugs or ask questions.

Please contribute bug fixes or new features with a pull request to this repository.

Related work

RNA-SeQC

RNA-SeQC is a java program which computes a series of quality control metrics for RNA-seq data. The input can be one or more BAM files. The output consists of HTML reports and tab delimited files of metrics data. This program can be valuable for comparing sequencing quality across different samples or experiments to evaluate different experimental parameters. It can also be run on individual samples as a means of quality control before continuing with downstream analysis.

RSeQC

RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data. Some basic modules quickly inspect sequence quality, nucleotide composition bias, PCR bias and GC bias, while RNA-seq specific modules evaluate sequencing saturation, mapped reads distribution, coverage uniformity, strand specificity, etc.

QoRTs

The QoRTs software package is a fast, efficient, and portable multifunction toolkit designed to assist in the analysis, quality control, and data management of RNA-Seq datasets. Its primary function is to aid in the detection and identification of errors, biases, and artifacts produced by paired-end high-throughput RNA-Seq technology. In addition, it can produce count data designed for use with differential expression and differential exon usage tools 2, as well as individual-sample and/or group-summary genome track files suitable for use with the UCSC genome browser (or any compatible browser).