Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is classify-consensus-vsearch so slow? #80

Closed
BenKaehler opened this issue Jul 12, 2017 · 4 comments
Closed

Why is classify-consensus-vsearch so slow? #80

BenKaehler opened this issue Jul 12, 2017 · 4 comments
Assignees
Labels
type:bug Something is wrong.

Comments

@BenKaehler
Copy link
Member

BenKaehler commented Jul 12, 2017

classify-consensus-vsearch was 50 times slower than classify-consensus-blast in the run time analysis for the paper. The users have noticed.

We should double check that there isn't anything strange going on.

@BenKaehler
Copy link
Member Author

The plot thickens. It would appear that longish sequences make vsearch very slow. That may not be surprising, given that we are performing global alignments. I have attached the data files here and here.

This command works ok, with an average sequence length of 230.

$ time vsearch --usearch_global query.fasta --id 0.97 --strand both --maxaccepts 1 --maxrejects 0 --output_no_hits --db ref.fasta --threads 1 --blast6out out
vsearch v2.0.3_linux_x86_64, 1007.8GB RAM, 56 cores
https://github.com/torognes/vsearch

Reading file ref.fasta 100%  
11493263 nt in 50020 seqs, min 32, max 250, avg 230
WARNING: 604 sequences shorter than 32 nucleotides discarded.
Masking 100%  
Counting unique k-mers 100%  
Creating index of unique k-mers 100%  
Searching 100%  
Matching query sequences: 5292 of 6184 (85.58%)

real	23m35.282s
user	16m24.472s
sys	0m0.148s

This command does not work ok, with an average sequence length of 1428. Note that I got sick of waiting and killed it when it was only 30% complete after more than 10 hours.

$ cat not_works_ok/command 
$ time vsearch --usearch_global 10000.fna --id 0.8 --strand both --maxaccepts 1 --maxrejects 0 --output_no_hits --db 10000.fna --threads 1 --blast6out out
vsearch v2.0.3_linux_x86_64, 1007.8GB RAM, 56 cores
https://github.com/torognes/vsearch

Reading file 10000.fna 100%  
14284110 nt in 10000 seqs, min 1274, max 2353, avg 1428
Masking 100%  
Counting unique k-mers 100%  
Creating index of unique k-mers 100%  
Searching 30%^C
real	613m24.420s
user	524m9.868s
sys	0m10.192s

@nbokulich
Copy link
Member

@BenKaehler can we close this issue? It seems that we resolved this — vsearch (unlike, say, BLAST+) experiences a dramatic runtime increase when very long sequences are used.

However, vsearch works fine (with runtimes approximately equivalent to BLAST+) when amplicon sequences are used.

The issue is with vsearch itself, not with how we're wrapping it.

@jairideout jairideout added the type:bug Something is wrong. label Aug 31, 2017
@jairideout
Copy link
Member

@BenKaehler @nbokulich is this safe to close?

@nbokulich
Copy link
Member

Yes, I think it is safe to close. It is not, in any case, a bug — vsearch just performs much slower on full-length sequences. I don't have write privileges so cannot close. Thanks @jairideout !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something is wrong.
Projects
None yet
Development

No branches or pull requests

4 participants