seqqs is a C program/library for gathering quality statistics from sequencing data
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
tests
wrappers
Makefile
README.md
khash.h
kseq.h
pairs.c
seqqs.c

README.md

seqqs

Seqqs (SEQuence Quality Statistics, pronounced "seeks") is a C library for quickly gathering quality statistics from sequence files. It's mostly adapted from qrqc, except it is designed to be run in quality processing pipelines. It can also be compiled as a dynamic library and called from other programs.

Seqqs is meant to check nucleotide composition, k-mer abundance, length distribution, and base quality at many points on in a quality control pipeline. Why might you want to do this? Quality control programs can misbehave — don't trust your tools or data (the "golden rule of bioinformatics"). In several cases, I've seen pathologically bad data quality lead to a program severely misbehaving to. This may lead to confounding during downstream analysis if uncaught, as one sequencing sample of initially poor quality may be overtrimmed, or many reads removed (I've seen this in practice, and the statistical consequences). It's much easier to put Seqqs in your pipeline, and quickly check the results to ensure both your data and tools are working as they should be.

Requirements and Installation

Seqqs can be compiled using GCC or Clang; compilation during development used the latter. Seqqs relies on Heng Li's kseq.h and khash.h, which is bundled with the source.

Seqqs requires Zlib, which can be obtained at http://www.zlib.net/.

To install, just run make in the seqqs directory.

Usage

Documentation is internal; just compile and run ./seqqs. Here are some usage examples.

Without any options, seqqs works like so:

cat in.fq | seqqs -
# or:
seqqs in.fq

Note that - tells seqqs to read from standard input. Without any options, this will create qual.txt, nucl.txt, and len.txt.

seqqs is designed to be placed in pipelines and act as a quality gathering step without disrupting the flow (similar to Unix tee). To enable this, use -e (for emit):

cat in.fq | seqqs -e -

For complex quality pipelines, seqqs can also take a prefix argument to prevent overwriting output files. If we wanted to create a complex workflow that gathers quality on raw input, gathers quality statistics, then trims using Heng Li's seqtk trimfq command, and then gathers output statistics, we could use:

cat in.fq | seqqs -e -p raw-$(date +%F) - | seqtk trimfq - | \
  seqqs -e -p trimmed-$(date +%F) > trimmed.fq

seqqs can also gather positional k-mers, which can help in discovering enrichment due to positional contaminants like untrimmed barcodes and adapters. As a quick aside: you should check for these! Many sequencing data set are plagued by positional contaminants, especially as barcoding grows in popularity. The k-mer option is -k <n> where n is the k-mer size:

cat in.fq | seqqs -k 6

seqqs can also work with interleaved paired-end files. The results are no different, but two output files (one for each set of reads in a pair) are created. These have the names like the default, except they have _1.txt and _2.txt suffixes. Also, seqqs will warn if pairing looks incorrect. If -s (strict) is set, seqqs will error out if interleaved pairs do not have the same name (ignoring /1 and /2 and excluding the comment).

Using Output

All tables are tab-delimited with headers, and can be easily analyzed by a program of your choice. qrqc will soon have functions to gather this output and make plots from it.

Todo

  • BAM support