Helper for join and split fastq files.
Required python 3.6 or above.
# Linux
## split
python3 join_and_split.py split -m fastq_file
## join
python3 join_and_split.py join -f forward.fastq -r reverse.fastq
# Windows
## split
python join_and_split.py split -m fastq_file
## join
python join_and_split.py join -f forward.fastq -r reverse.fastq
Use -t to set linker text, by default the program use "JOINTEXT".
When split, "fastq_file" could be multiple files, use "*.fastq" (include quotation mark) to represent all ".fastq" files in current folder.
Divide NGS data by barcode and primer.
- Python 3.5 or above
- Biopython
- regex
- vsearch (Optional)
To install Biopython and regex, run as administrator:
pip install biopython regex
Support ambiguous base.
Extend vsearch options. Improve output
Integrate vsearch.
Use regex instead of BLAST. Faster and easier.
Parallel version, use BLAST.
Single core version. Use BLAST.
Deprecated.
It can handle merged pair-end sequence like this:
barcode-adapter-primer-sequence-primer-adapter-barcode
Or just handle one direction:
barcode-adapter-primer-sequence
Sequences will be divided by barcode according to given barcode file. If barcode is wrong even only one base, it will be dropped.
Some one adds sequence between barcode and primer, if you do not have it, just set adapter length to zero by "--adapter 0". The default value is 14.
Use "-m" to set barcode mode, like "8*1", means barcode with length 5 repeats only 1 times. The default is "5*2", i.e., 5-base barcode repeats twice.
Note that the forward and reverse barcode may be different sequence, but they SHOULD FOLLOW THE SAME MODE!
Use "-s" or "--strict" to use strict version. If set, the program will check barcode in head and tail is equal or not and whether barcode in tail (3') is correct. If not, it will only check barcode in head (5') of sequence.
Barcode file looks like this:
sample,barcode-f,barcode-r
S0001,ATACG,ATACG
S0002,ATATA,TATAC
S0003,ATACG
...
The barcode-f means barcode in 5' direction and barcode-r means barcode in 3' direction. All sequences should be forward.
If forward and reverse barcode are same, you can omit the reverse barcode in the table.
To avoid potential error, please do not use space in sample info.
And notice that here it use English comma to seperate two fields rather than Chinese comma.
Primer file looks like this:
gene,forward,reverse
rbcL,ATCGATCGATCGA,TACGTACGTACG
matK,AAAATTTTCCCC,GGGGTTACCAAAA
...
Or:
gene,sequence
rbcL-f,ATCGATCGATCGA
rbcL-r,TACGTACGTACG
You can use Microsoft Excel to prepare these two files and save as CSV format, or use any text editor you prefer.
Make sure you don't miss the first line.
If you use PBS task submitting system, you can use this script to submit the task, and you can finish the work from combine two direction sequence by flash and join_fastq.py to divide them.