Illumina reads adapter screening utilities
This contains several utility programs that removes the adapters and low quality bases from Illumina reads.
|Author:||Haibao Tang (tanghaibao)|
The program depends on the excellent SeqAn library.
Please download the library, and place
seqan/ in the same folder.
Please note that
trimReads is no longer compatible with seqan-1.4+. For
back-ward compatibility, a copy of older seqan is now included as
To install, run:
$ tar zxvf seqan.tgz $ make
Functionality emulates cutadapt. The adapter sequences are identified through Waterman-Eggert algorithm implemented in SeqAn. The quality trimming are a simple algorithm that takes the quality values, deduct a user specified cutoff, and then finds the max-sum segment. This method guarantees that the average base quality is higher than the user cutoff.
- Accepts an adapter FASTA file
- Fast, robust and flexibility
- Qual/adapter trimming in one step
- Can trim both 5`- and 3`- end
to see a list of program options:
Illumina reads trimming utility Author: Haibao Tang <firstname.lastname@example.org> Usage: trimReads [options] fastqfile -h, --help displays this help message -o, --outfile Output file name. (default replace suffix with .trimmed.fastq) -f, --adapterfile FASTA formatted file containing the adapters for removal (default adapters.fasta) -s, --score Minimum score to call adapter match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 15) -q, --quality-cutoff Trim low-quality regions below quality cutoff. The algorithm is similar to the one used by BWA by finding a max-sum segment within the quality string. Set it to 0 to skip quality trimming. (default 20) -m, --minimum-length Discard trimmed reads that are shorter than LENGTH. (default 30) -Q, --quality-encoding Read quality encoding for input file. 64 for Illumina, 33 for Sanger. (default 64) -d, --discard-adapter-reads Discard reads with adapter sequences rather than trim (default 0)
Find a list of adapters to remove (more will slow down search), default is
adapters.fasta. When ready:
to get a trimmed file test.trimmed.fastq. To turn off the quality trimming, just set
trimReads -q 0 test.fastq
The detected adapter stretch will have quality values of
This will help you verify that the sequence masked is indeed adapters. For
@SNPSTER4:7:1:2:458#0/1 run=090205_SNPSTER4_0273_30GAUAAXX_PE ATTGAAGTGTTTGGGGTTCAAACACCGACAGATCGGAAGAGCGGTTCAGCAGGAAAGCCGAGACACACATCGGTATCCGCTTTTTTTTTT + aba`aaa]a`aaaaaa]a_aa\aa`aa_^AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBB
This program sorts all read pairs into three sets:
- Adapter set: the pairs with either /1 or /2 match adapters (in most cases both will match). These are fragments up to 1X read length.
- Overlap set: the pairs with /1 and /2 having dovetail overlap. These are fragments up to 2X read length.
- Clean set: survived the above two searches.
The reason for this sorting is to get rid of the short fragments (set 1 and set 2) commonly in the Illumina PE library. Some libraries are worse than others. The goal is to input the mated library within nominal insert size ranges.
to see a list of program options:
Sort pairs of Illumina reads Author: Haibao Tang <email@example.com> Usage: sortPairedReads [options] fastqfile1 fastqfile2 -h, --help displays this help message -O, --nooverlap Turn off overlapping reads detection, and do not create .overlap.fastq file. (default 0) -f, --adapterfile FASTA formatted file containing the adapters for removal (default adapters.fasta) -s, --adapterMatchScore Minimum score to call adapter match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 15) -t, --endMatchScore Minimum score to call dovetail match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 20) -Q, --quality-encoding Read quality encoding for input file. 64 for Illumina, 33 for Sanger. (default 64) -v, --verbose Print alignments for debugging (default 0)
For any given two fastq files, the output contains 4 files:
fastqfile1.adapters.fastq (set 1),
fastqfile1.overlap.fastq (set 2),
fastqfile2.clean.fastq (set 3). For genome assembler inputs, I recommend
discard set 1, treat set 2 as unmated, and treat set 3 as mated.
$ sortPairedReads s1.fastq s2.fastq  Illumina_PE-1 found 0 times  Illumina_PE-2 found 0 times  Illumina_PE-1rc found 54 times  Illumina_PE-2rc found 83 times Processed 2500 sequences took 3.33262 seconds. $ ls *.*.fastq s1.clean.fastq s2.clean.fastq s1.adapters.fastq s1.overlap.fastq
-O on if you don't like
$ sortPairedReads s1.fastq s2.fastq -O ... $ ls *.*.fastq s1.clean.fastq s2.clean.fastq s1.adapters.fastq