Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory requirements for very large database (> 500GB) #297

Closed
Confurious opened this issue Mar 22, 2018 · 12 comments
Closed

Memory requirements for very large database (> 500GB) #297

Confurious opened this issue Mar 22, 2018 · 12 comments
Labels

Comments

@Confurious
Copy link

Hi, I am wondering what the memory requirements are for searching against very large database (500GB - 1 TB)? On the query side I can do the splitting but splitting the database would produce less than optimal results.
Thanks

@torognes
Copy link
Owner

Just for storing the database in memory VSEARCH requires at least 5 bytes of memory for each nucleotide in the database, plus some more for the headers and other information. With a database of 500GB to 1TB I think you would need at least 3 to 6TB of memory. I have never tested running VSEARCH with such a large database.

@Confurious
Copy link
Author

Ouch I can't have 3 to 6 TB consistently. Just out of curiosity, what would you recommend if one has to do queries against very large database? Is splitting the database and combine the results of each query against each database (through some sorts of rules) the only way? Is that how BLAST handle things when dividing databases into small chunks (max=2GB) of databases? I would very much look for an alternative to BLAST and I was hoping vsearch would be the one. Thanks

@torognes
Copy link
Owner

I think I need some more information about the type of search you want to perform in order to be able to give a good answer. What type of sequences are the query and database sequence, how long are they? Are you looking only for the top hit, or do you need more hits for each query?

It may certainly be possible to split the database and then combine the results in some way.

@Confurious
Copy link
Author

Confurious commented Mar 23, 2018

The aim was basically to construct a large database that includes selected bacteria, viruses, animal genomes etc. and to be able to classify a DNA fragment from samples of different sources. As a results, the database is larger than usual. The database will be a collection of reference genomes or draft genomes, the query will be either read or assembled contigs from reads. I need more than top hit to have a reasonable conservative taxonomy assignment (Using the LCA method).

That's great to hear! Assuming I am to use vsearch for this purpose, what would you recommend the pooling-based decision based on? Percentage identity alone? Is there a blast-like e or bit score in vsearch?
Thanks

@torognes
Copy link
Owner

torognes commented Apr 3, 2018

VSEARCH is designed to work with rather short sequences, like single reads or short fragments. It does not work well with longer sequences, e.g. 5kb or longer, as it will be rather slow. Including entire genomes in the database is not recommended. I will therefore advise you to find another tool for this.

@Confurious
Copy link
Author

Confurious commented Apr 3, 2018 via email

@torognes
Copy link
Owner

torognes commented Apr 3, 2018

Both the queries and database sequences are supposed to be rather short.

@Confurious
Copy link
Author

Confurious commented Apr 3, 2018 via email

@torognes
Copy link
Owner

torognes commented Apr 3, 2018

NT and NR contain many long sequences in addition to the short ones.

VSEARCH performs full optimal global alignment of the entire sequences instead of the hit-and-extend approach in BLAST and other tools. This is why VSEARCH is so slow with long sequences, as it takes time proportional to the product of the length of the sequences.

@colinbrislawn
Copy link
Contributor

If you are looking for faster (but less exact) alignment tools, my I suggest bbmap? It was designed for searching very large databases and is wildly fast. If my database was >500 GB I would start there.

If you are willing to spend more time to get more accuracy, you could try minimap2, written by the famous developer of bwa.

@Confurious
Copy link
Author

Confurious commented Apr 3, 2018 via email

@colinbrislawn
Copy link
Contributor

colinbrislawn commented Apr 3, 2018

That's the impression I got from the preprint, but I'm not sure.

Any heuristic local aligner will be faster then vsearch, which is designed for optimal alignments on short reads. These are really different tools for different jobs.

EDIT: According to the author, yes "minimap2 is a much better mapper than bwa-mem in almost every aspect".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants