A pipeline for open-reference clustering and annotation of amplicon sequencing data
- Input sequence libraries in fasta format
- Reference database in fasta format, with taxonomy included in the sequence description or as a separate file
- (optional and undocumented) Aligned reference sequences in fasta format (to be able to join alignments with different references)
Steps of the pipeline:
For each input sample, align reads to reference database, with a single best hit, to obtain the alignments in .sam format.
Run open-reference clustering using the provided software (test_openref.cpp). The resulted clusters may include reads from all input samples.
Convert the formats and/or add the taxonomy associated with the reference database sequences
The software was tested on Ubuntu Linux, with usearch software to run the alignments.