Memory requirements for very large database (> 500GB) #297

Confurious · 2018-03-22T05:45:38Z

Hi, I am wondering what the memory requirements are for searching against very large database (500GB - 1 TB)? On the query side I can do the splitting but splitting the database would produce less than optimal results.
Thanks

torognes · 2018-03-22T10:32:36Z

Just for storing the database in memory VSEARCH requires at least 5 bytes of memory for each nucleotide in the database, plus some more for the headers and other information. With a database of 500GB to 1TB I think you would need at least 3 to 6TB of memory. I have never tested running VSEARCH with such a large database.

Confurious · 2018-03-22T15:05:10Z

Ouch I can't have 3 to 6 TB consistently. Just out of curiosity, what would you recommend if one has to do queries against very large database? Is splitting the database and combine the results of each query against each database (through some sorts of rules) the only way? Is that how BLAST handle things when dividing databases into small chunks (max=2GB) of databases? I would very much look for an alternative to BLAST and I was hoping vsearch would be the one. Thanks

torognes · 2018-03-23T11:09:41Z

I think I need some more information about the type of search you want to perform in order to be able to give a good answer. What type of sequences are the query and database sequence, how long are they? Are you looking only for the top hit, or do you need more hits for each query?

It may certainly be possible to split the database and then combine the results in some way.

Confurious · 2018-03-23T17:20:31Z

The aim was basically to construct a large database that includes selected bacteria, viruses, animal genomes etc. and to be able to classify a DNA fragment from samples of different sources. As a results, the database is larger than usual. The database will be a collection of reference genomes or draft genomes, the query will be either read or assembled contigs from reads. I need more than top hit to have a reasonable conservative taxonomy assignment (Using the LCA method).

That's great to hear! Assuming I am to use vsearch for this purpose, what would you recommend the pooling-based decision based on? Percentage identity alone? Is there a blast-like e or bit score in vsearch?
Thanks

torognes · 2018-04-03T11:15:34Z

VSEARCH is designed to work with rather short sequences, like single reads or short fragments. It does not work well with longer sequences, e.g. 5kb or longer, as it will be rather slow. Including entire genomes in the database is not recommended. I will therefore advise you to find another tool for this.

Confurious · 2018-04-03T12:38:14Z

Hello, i did not know that the database is supposed to be short fragments too? Just the queries need to be? Thanks

On Tue, Apr 3, 2018 at 04:15 Torbjørn Rognes ***@***.***> wrote: VSEARCH is designed to work with rather short sequences, like single reads or short fragments. It does not work well with longer sequences, e.g. 5kb or longer, as it will be rather slow. Including entire genomes in the database is not recommended. I will therefore advise you to find another tool for this. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#297 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVzXMV8XoxRVDTCKVep2qGNO4OoAadb4ks5tk1negaJpZM4S2fSr> .

-- Sincerely yours, Chao Jiang

torognes · 2018-04-03T12:39:54Z

Both the queries and database sequences are supposed to be rather short.

Confurious · 2018-04-03T12:50:13Z

I see. I suppose NT and NR are kind of like collections of short fragments. However I assume a lot of average users would attempt to make customized databases with references genomes of bacteria etc., which are still millions of basepairs. Is this because of semi-global alignment instead of local alignment? Thanks

On Tue, Apr 3, 2018 at 05:40 Torbjørn Rognes ***@***.***> wrote: Both the queries and database sequences are supposed to be rather short. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#297 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVzXMQH0fDM7FB_Oa8lpikDjHF8SPXuAks5tk22hgaJpZM4S2fSr> .

-- Sincerely yours, Chao Jiang

torognes · 2018-04-03T12:55:22Z

NT and NR contain many long sequences in addition to the short ones.

VSEARCH performs full optimal global alignment of the entire sequences instead of the hit-and-extend approach in BLAST and other tools. This is why VSEARCH is so slow with long sequences, as it takes time proportional to the product of the length of the sequences.

colinbrislawn · 2018-04-03T15:53:56Z

If you are looking for faster (but less exact) alignment tools, my I suggest bbmap? It was designed for searching very large databases and is wildly fast. If my database was >500 GB I would start there.

If you are willing to spend more time to get more accuracy, you could try minimap2, written by the famous developer of bwa.

Confurious · 2018-04-03T16:18:45Z

Thanks! The minimap2 looks like a direct upgrade to bwa-mem, as it excels at both short and long reads mapping?!

…

On Tue, Apr 3, 2018 at 8:53 AM, Colin Brislawn ***@***.***> wrote: If you are looking for faster (but less exact) alignment tools, my I suggest bbmap <https://jgi.doe.gov/data-and-tools/bbtools/>? It was designed for searching very large databases and is wildly fast. If my database was >500 GB I would start there. If you are willing to spend more time to get more accuracy, you could try minimap2 <https://github.com/lh3/minimap2>, written by the famous developer of bwa <https://github.com/lh3/bwa>. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#297 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVzXMaPXXufcXnDCFJrUZHKncDzjZAJ2ks5tk5sVgaJpZM4S2fSr> .

-- Sincerely yours, Chao Jiang

colinbrislawn · 2018-04-03T16:40:03Z

That's the impression I got from the preprint, but I'm not sure.

Any heuristic local aligner will be faster then vsearch, which is designed for optimal alignments on short reads. These are really different tools for different jobs.

EDIT: According to the author, yes "minimap2 is a much better mapper than bwa-mem in almost every aspect".

torognes added the question label Mar 22, 2018

torognes closed this as completed Apr 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory requirements for very large database (> 500GB) #297

Memory requirements for very large database (> 500GB) #297

Confurious commented Mar 22, 2018

torognes commented Mar 22, 2018

Confurious commented Mar 22, 2018

torognes commented Mar 23, 2018

Confurious commented Mar 23, 2018 •

edited

Loading

torognes commented Apr 3, 2018

Confurious commented Apr 3, 2018 via email

torognes commented Apr 3, 2018

Confurious commented Apr 3, 2018 via email

torognes commented Apr 3, 2018

colinbrislawn commented Apr 3, 2018

Confurious commented Apr 3, 2018 via email

colinbrislawn commented Apr 3, 2018 •

edited

Loading

Memory requirements for very large database (> 500GB) #297

Memory requirements for very large database (> 500GB) #297

Comments

Confurious commented Mar 22, 2018

torognes commented Mar 22, 2018

Confurious commented Mar 22, 2018

torognes commented Mar 23, 2018

Confurious commented Mar 23, 2018 • edited Loading

torognes commented Apr 3, 2018

Confurious commented Apr 3, 2018 via email

torognes commented Apr 3, 2018

Confurious commented Apr 3, 2018 via email

torognes commented Apr 3, 2018

colinbrislawn commented Apr 3, 2018

Confurious commented Apr 3, 2018 via email

colinbrislawn commented Apr 3, 2018 • edited Loading

Confurious commented Mar 23, 2018 •

edited

Loading

colinbrislawn commented Apr 3, 2018 •

edited

Loading