Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usearch_global match potential match bug #298

Closed
Felipealbornoz opened this issue Mar 22, 2018 · 1 comment
Closed

usearch_global match potential match bug #298

Felipealbornoz opened this issue Mar 22, 2018 · 1 comment
Labels

Comments

@Felipealbornoz
Copy link

Hi, when I use -usearch_global to "blast" my OTUs against a custom database it does not show the best hit. a particular OTU, is matched against the database with maxaccept 1 and -id 0.97, it matches to SPECIES1 with 98% similarity. but when I use maxaccept 3, the third option is matched to SPECIES2 with 99.5% similarity. however, SPECIES2 never gets selected. I am using the following command:

vsearch -usearch_global OTUS.fasta --db db.fasta --id 0.97 --maxaccepts 1 --dbmatched dbmatched.fasta --notmatched notmatched.fasta --output_no_hits --blast6out otu.tax.csv

@torognes
Copy link
Owner

When you run vsearch with usearch_global, it performs a search using a heuristic algorithm. That means that it is not guaranteed to find the best match, but it usually finds a very good match.

The heuristics involves looking at the number of shared k-mers (8-mers) between the query and each database sequence, and starting with those database sequences that have the highest number of k-mers in common with the query. When you specify --maxaccepts 1 it means that it will stop at the first sequence found that satisfy the similarity threshold set with the --id option (e.g. 97%). If you set a higher --maxaccepts value (e.g. 3) it will look at more (i.e. 3) sequences and report those sequences that satisfy the similarity threshold in order of decreasing similarity.

If the sequence with the highest number of shared k-mers is not the one with the highest alignment similarity you will get a suboptimal result when using --maxaccepts 1. This is probably what happened in the example you provided.

The option --maxrejects is also important as it indicates how many database sequences below the similarity threshold will be considered before the search is stopped. By default it is 32.

To get more accurate results you could use --maxaccepts 1000 --maxrejects 1000, but it will take more time.

You could also use --maxaccepts 0 --maxrejects 0, which will cause vsearch to consider all database sequences. It will take much longer as all the heuristics are bypassed.

I hope this clarifies how vsearch and these options work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants