Seqqs (SEQuence Quality Statistics, pronounced "seeks") is a C library for quickly gathering quality statistics from sequence files. It's mostly adapted from qrqc, except it is designed to be run in quality processing pipelines. It can also be compiled as a dynamic library and called from other programs.
Seqqs is meant to check nucleotide composition, k-mer abundance, length distribution, and base quality at many points on in a quality control pipeline. Why might you want to do this? Quality control programs can misbehave — don't trust your tools or data (the "golden rule of bioinformatics"). In several cases, I've seen pathologically bad data quality lead to a program severely misbehaving to. This may lead to confounding during downstream analysis if uncaught, as one sequencing sample of initially poor quality may be overtrimmed, or many reads removed (I've seen this in practice, and the statistical consequences). It's much easier to put Seqqs in your pipeline, and quickly check the results to ensure both your data and tools are working as they should be.
Requirements and Installation
Seqqs can be compiled using GCC or Clang; compilation during development used the latter. Seqqs relies on Heng Li's kseq.h and khash.h, which is bundled with the source.
Seqqs requires Zlib, which can be obtained at http://www.zlib.net/.
To install, just run
make in the
Documentation is internal; just compile and run
./seqqs. Here are
some usage examples.
Without any options,
seqqs works like so:
cat in.fq | seqqs - # or: seqqs in.fq
seqqs to read from standard input. Without any
options, this will create
seqqs is designed to be placed in pipelines and act as a quality
gathering step without disrupting the flow (similar to Unix
enable this, use
-e (for emit):
cat in.fq | seqqs -e -
For complex quality pipelines,
seqqs can also take a prefix argument
to prevent overwriting output files. If we wanted to create a complex
workflow that gathers quality on raw input, gathers quality
statistics, then trims using Heng Li's
seqtk trimfq command, and then gathers
output statistics, we could use:
cat in.fq | seqqs -e -p raw-$(date +%F) - | seqtk trimfq - | \ seqqs -e -p trimmed-$(date +%F) > trimmed.fq
seqqs can also gather positional k-mers, which can help in
discovering enrichment due to positional contaminants like untrimmed
barcodes and adapters. As a quick aside: you should check for these!
Many sequencing data set are plagued by positional contaminants,
especially as barcoding grows in popularity. The k-mer option is
-k <n> where n is the k-mer size:
cat in.fq | seqqs -k 6
seqqs can also work with interleaved paired-end files. The
results are no different, but two output files (one for each set of
reads in a pair) are created. These have the names like the default,
except they have
_2.txt suffixes. Also,
warn if pairing looks incorrect. If
-s (strict) is set,
error out if interleaved pairs do not have the same name (ignoring
/2 and excluding the comment).
All tables are tab-delimited with headers, and can be easily analyzed by a program of your choice. qrqc will soon have functions to gather this output and make plots from it.
- BAM support