Skip to content
Utility programs to trim or sort Illumina reads with adapter sequences
C++
Find file

README.rst

Illumina reads adapter screening utilities

This contains several utility programs that removes the adapters and low quality bases from Illumina reads.

Author:Haibao Tang (tanghaibao)
Contributor:Tristan Lefebure
Email:htang@jcvi.org
License:BSD

Installation

The program depends on the excellent SeqAn library. Please download the library, and place seqan/ in the same folder.

Please note that trimReads is no longer compatible with seqan-1.4+. For back-ward compatibility, a copy of older seqan is now included as seqan.tgz.

To install, run:

$ tar zxvf seqan.tgz
$ make

trimReads

Functionality emulates cutadapt. The adapter sequences are identified through Waterman-Eggert algorithm implemented in SeqAn. The quality trimming are a simple algorithm that takes the quality values, deduct a user specified cutoff, and then finds the max-sum segment. This method guarantees that the average base quality is higher than the user cutoff.

There are other options to cut adapters, including cutadapt and FASTX_TOOLKIT. The main advantage of this program:

  • Accepts an adapter FASTA file
  • Fast, robust and flexibility
  • Qual/adapter trimming in one step
  • Can trim both 5`- and 3`- end

Just run:

trimReads

to see a list of program options:

Illumina reads trimming utility
Author: Haibao Tang <htang@jcvi.org>

Usage: trimReads [options] fastqfile

  -h, --help                   displays this help message
  -o, --outfile                Output file name. (default replace suffix with .trimmed.fastq)
  -f, --adapterfile            FASTA formatted file containing the adapters for removal  (default adapters.fasta)
  -s, --score                  Minimum score to call adapter match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 15)
  -q, --quality-cutoff         Trim low-quality regions below quality cutoff. The algorithm is similar to the one used by BWA by finding a max-sum segment within the quality string. Set it to 0 to skip quality trimming.  (default 20)
  -m, --minimum-length         Discard trimmed reads that are shorter than LENGTH. (default 30)
  -Q, --quality-encoding       Read quality encoding for input file. 64 for Illumina, 33 for Sanger.  (default 64)
  -d, --discard-adapter-reads  Discard reads with adapter sequences rather than trim (default 0)

Find a list of adapters to remove (more will slow down search), default is adapters.fasta. When ready:

trimReads test.fastq

to get a trimmed file test.trimmed.fastq. To turn off the quality trimming, just set -q to 0:

trimReads -q 0 test.fastq

The detected adapter stretch will have quality values of AAAAAAAAAAA.... This will help you verify that the sequence masked is indeed adapters. For example:

@SNPSTER4:7:1:2:458#0/1 run=090205_SNPSTER4_0273_30GAUAAXX_PE
ATTGAAGTGTTTGGGGTTCAAACACCGACAGATCGGAAGAGCGGTTCAGCAGGAAAGCCGAGACACACATCGGTATCCGCTTTTTTTTTT
+
aba`aaa]a`aaaaaa]a_aa\aa`aa_^AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBB

sortPairedReads

This program sorts all read pairs into three sets:

  • Adapter set: the pairs with either /1 or /2 match adapters (in most cases both will match). These are fragments up to 1X read length.
  • Overlap set: the pairs with /1 and /2 having dovetail overlap. These are fragments up to 2X read length.
  • Clean set: survived the above two searches.

The reason for this sorting is to get rid of the short fragments (set 1 and set 2) commonly in the Illumina PE library. Some libraries are worse than others. The goal is to input the mated library within nominal insert size ranges.

Just run:

sortPairedReads

to see a list of program options:

Sort pairs of Illumina reads
Author: Haibao Tang <htang@jcvi.org>

Usage: sortPairedReads [options] fastqfile1 fastqfile2

  -h, --help               displays this help message
  -O, --nooverlap          Turn off overlapping reads detection, and do not create .overlap.fastq file. (default 0)
  -f, --adapterfile        FASTA formatted file containing the adapters for removal  (default adapters.fasta)
  -s, --adapterMatchScore  Minimum score to call adapter match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 15)
  -t, --endMatchScore      Minimum score to call dovetail match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 20)
  -Q, --quality-encoding   Read quality encoding for input file. 64 for Illumina, 33 for Sanger.  (default 64)
  -v, --verbose            Print alignments for debugging  (default 0)

For any given two fastq files, the output contains 4 files: fastqfile1.adapters.fastq (set 1), fastqfile1.overlap.fastq (set 2), fastqfile1.clean.fastq and fastqfile2.clean.fastq (set 3). For genome assembler inputs, I recommend discard set 1, treat set 2 as unmated, and treat set 3 as mated.

For example:

$ sortPairedReads s1.fastq s2.fastq
[0] Illumina_PE-1 found 0 times
[1] Illumina_PE-2 found 0 times
[2] Illumina_PE-1rc found 54 times
[3] Illumina_PE-2rc found 83 times

Processed 2500 sequences took 3.33262 seconds.

$ ls *.*.fastq
s1.clean.fastq  s2.clean.fastq  s1.adapters.fastq  s1.overlap.fastq

Turn -O on if you don't like .overlap.fastq:

$ sortPairedReads s1.fastq s2.fastq -O
...

$ ls *.*.fastq
s1.clean.fastq  s2.clean.fastq  s1.adapters.fastq
Something went wrong with that request. Please try again.