findprimer

This program checks if primer or linker exists in a read straight from Illumina sequencing machine, assuming that the linker or primer is at the beginning of a read. The program works as follows. Given a primer such as gaaaatctctagca, or a linker such as GGCACATCGATTTCTGCGAGNNNNNNNNNNNNCTCCGCTTAAGGGACT, the program calculates the edit distance between the primer(linker) and the beginning part of a read, and output the coordinates where the reads should be cut. Here are examples of output:

@M03249:8:000000000-ABY6R:1:1101:19266:2177     1:N:0:0 GGCACATCGATTTCTGCGAGNNNNNNNNNNNNCTCCGCTTAAGGGACT        0
       47      0
@M03249:8:000000000-ABY6R:1:1101:11955:2181     1:N:0:0 GGCACATCGATTTCTGCGAGNNNNNNNNNNNNCTCCGCTTAAGGGACT        0
       48      1
@M03249:8:000000000-ABY6R:1:1101:16499:2188     1:N:0:0 GGCACATCGATTTCTGCGAGNNNNNNNNNNNNCTCCGCTTAAGGGACT        0
       47      0
@M03249:8:000000000-ABY6R:1:1101:13346:2194     1:N:0:0 GGCACATCGATTTCTGCGAGNNNNNNNNNNNNCTCCGCTTAAGGGACT        0
       47      0
@M03249:8:000000000-ABY6R:1:1101:20494:2279     1:N:0:0 GGCACATCGATTTCTGCGAGNNNNNNNNNNNNCTCCGCTTAAGGGACT        0
       46      1

where 46,47,48's are 0-based last match points, and the last numbers are edit distances. One should filter the reads based on the edit distances. Normally, primer/linker length divided by 10 is a good choice.

trimprimer

Given sequence names and cut points, this program selects the reads and cuts them. Note that the sequence names should be in the same order as in the fastq file, as the program only does sequential search once looking for the reads. Examples of input are listed below.

@M03249:8:000000000-ABY6R:1:1101:15413:1732	0	47	0
@M03249:8:000000000-ABY6R:1:1101:16481:1751	0	48	0
@M03249:8:000000000-ABY6R:1:1101:17183:1757	0	47	0
@M03249:8:000000000-ABY6R:1:1101:18023:1770	0	46	0

A typical work flow for Illumina miseq run

1. demultiplex with idemp.

idemp

1. Find inker at the begining of R1 reads, primer in R2 reads

findprimer -f R1.fastq.gz -p GGCACATCGATTTCTGCGAGNNNNNNNNNNNNCTCCGCTTAAGGGACT -o testR1.txt 
findprimer -f R2.fastq.gz -p gaaaatctctagca -o testR2.txt

1. Join the table by read name

join testR1.txt testR2.txt | tr " " "\t" > R1R2.txt

1. Get read names and cut positions for R1 and R2 separately

awk '$5>0 && $6<5 && $10>0 && $11<4 {OFS="\t"; print $1,$4,$5,$6}' R1R2.txt > R1cuts.txt
awk '$5>0 && $6<5 && $10>0 && $11<4 {OFS="\t"; print $1,$9,$10,$11}' R1R2.txt > R2cuts.txt

1. Extract read from R1 and R2

trimprimer -f R1.fastq.gz -t R1cuts.txt -o R1.filt.fastq.gz
trimprimer -f R2.fastq.gz -t R2cuts.txt -o R2.filt.fastq.gz

1. Check existance of linker and primer in filtered file again; they should give negative coordinates.

findprimer -f R1.filt.fastq.gz -p GGCACATCGATTTCTGCGAGNNNNNNNNNNNNCTCCGCTTAAGGGACT -o testR1.filt.txt 
findprimer -f R2.filt.fastq.gz -p gaaaatctctagca -o testR2.filt.txt

1. Map reads.

Installation

git clone https://github.com/yhwu/primer
cd primer
make
make test

Usage

[yhwu@local primer]$ ./findprimer
Usage:
   findprimer -f fastq -p primer -m n -o outFile

Options:
   fastq    fastq file
   primer   primer or linker sequence
   n        allowed base mismatches, optional, default=1+primer/20
   outFile  output folder, optional, default=.

Output: rows of the following columns
   sequence_name
   sequence_name_comment
   primer/linker
   primer/linker_length
   primer/linker_on_sequence
   edit distance
   primer_start	#0 based
   primer_end	#0 based

Note:
   1. primer is always converted to upper cases.
   2. sequence is not converted.
   3. N, ?, . in primer matches all.
   4. N in sequence does not matche any.
   5. Other characters follow IUPAC nucleotide code.


[yhwu@local primer]$ ./trimprimer
Usage:
   trimprimer -f fastq -t trimfile -o outFile

Options:
   fastq    fastq file
   trimfile primer or linker sequence
   outFile  output folder, optional, default=.

Note: the trimfile should contain rows with the following field
   sequence_name  start with @
   start          ignored, always cut from 0
   end            end is the last chracter removed, 0 based

Note:
   sequence_name does not contain sequence comment, so be careful
   with paired end reads.
   sequence not in the trim file will be discarded
   sequence names must be in the same order as in the fastq file

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
Makefile		Makefile
R1.fastq.gz		R1.fastq.gz
R2.fastq.gz		R2.fastq.gz
README.md		README.md
functions.cpp		functions.cpp
functions.h		functions.h
kseq.h		kseq.h
primer.cpp		primer.cpp
test.sh		test.sh
trimprimer.cpp		trimprimer.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

findprimer

trimprimer

A typical work flow for Illumina miseq run

Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

findprimer

trimprimer

A typical work flow for Illumina miseq run

Installation

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages