-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory requirements for very large database (> 500GB) #297
Comments
Just for storing the database in memory VSEARCH requires at least 5 bytes of memory for each nucleotide in the database, plus some more for the headers and other information. With a database of 500GB to 1TB I think you would need at least 3 to 6TB of memory. I have never tested running VSEARCH with such a large database. |
Ouch I can't have 3 to 6 TB consistently. Just out of curiosity, what would you recommend if one has to do queries against very large database? Is splitting the database and combine the results of each query against each database (through some sorts of rules) the only way? Is that how BLAST handle things when dividing databases into small chunks (max=2GB) of databases? I would very much look for an alternative to BLAST and I was hoping vsearch would be the one. Thanks |
I think I need some more information about the type of search you want to perform in order to be able to give a good answer. What type of sequences are the query and database sequence, how long are they? Are you looking only for the top hit, or do you need more hits for each query? It may certainly be possible to split the database and then combine the results in some way. |
The aim was basically to construct a large database that includes selected bacteria, viruses, animal genomes etc. and to be able to classify a DNA fragment from samples of different sources. As a results, the database is larger than usual. The database will be a collection of reference genomes or draft genomes, the query will be either read or assembled contigs from reads. I need more than top hit to have a reasonable conservative taxonomy assignment (Using the LCA method). That's great to hear! Assuming I am to use vsearch for this purpose, what would you recommend the pooling-based decision based on? Percentage identity alone? Is there a blast-like e or bit score in vsearch? |
VSEARCH is designed to work with rather short sequences, like single reads or short fragments. It does not work well with longer sequences, e.g. 5kb or longer, as it will be rather slow. Including entire genomes in the database is not recommended. I will therefore advise you to find another tool for this. |
Hello, i did not know that the database is supposed to be short fragments
too? Just the queries need to be? Thanks
On Tue, Apr 3, 2018 at 04:15 Torbjørn Rognes ***@***.***> wrote:
VSEARCH is designed to work with rather short sequences, like single reads
or short fragments. It does not work well with longer sequences, e.g. 5kb
or longer, as it will be rather slow. Including entire genomes in the
database is not recommended. I will therefore advise you to find another
tool for this.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#297 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVzXMV8XoxRVDTCKVep2qGNO4OoAadb4ks5tk1negaJpZM4S2fSr>
.
--
Sincerely yours,
Chao Jiang
|
Both the queries and database sequences are supposed to be rather short. |
I see. I suppose NT and NR are kind of like collections of short fragments.
However I assume a lot of average users would attempt to make customized
databases with references genomes of bacteria etc., which are still
millions of basepairs. Is this because of semi-global alignment instead of
local alignment? Thanks
On Tue, Apr 3, 2018 at 05:40 Torbjørn Rognes ***@***.***> wrote:
Both the queries and database sequences are supposed to be rather short.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#297 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVzXMQH0fDM7FB_Oa8lpikDjHF8SPXuAks5tk22hgaJpZM4S2fSr>
.
--
Sincerely yours,
Chao Jiang
|
NT and NR contain many long sequences in addition to the short ones. VSEARCH performs full optimal global alignment of the entire sequences instead of the hit-and-extend approach in BLAST and other tools. This is why VSEARCH is so slow with long sequences, as it takes time proportional to the product of the length of the sequences. |
If you are looking for faster (but less exact) alignment tools, my I suggest bbmap? It was designed for searching very large databases and is wildly fast. If my database was >500 GB I would start there. If you are willing to spend more time to get more accuracy, you could try minimap2, written by the famous developer of bwa. |
Thanks! The minimap2 looks like a direct upgrade to bwa-mem, as it excels
at both short and long reads mapping?!
…On Tue, Apr 3, 2018 at 8:53 AM, Colin Brislawn ***@***.***> wrote:
If you are looking for faster (but less exact) alignment tools, my I
suggest bbmap <https://jgi.doe.gov/data-and-tools/bbtools/>? It was
designed for searching very large databases and is wildly fast. If my
database was >500 GB I would start there.
If you are willing to spend more time to get more accuracy, you could try
minimap2 <https://github.com/lh3/minimap2>, written by the famous
developer of bwa <https://github.com/lh3/bwa>.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#297 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVzXMaPXXufcXnDCFJrUZHKncDzjZAJ2ks5tk5sVgaJpZM4S2fSr>
.
--
Sincerely yours,
Chao Jiang
|
That's the impression I got from the preprint, but I'm not sure. Any heuristic local aligner will be faster then vsearch, which is designed for optimal alignments on short reads. These are really different tools for different jobs. EDIT: According to the author, yes "minimap2 is a much better mapper than bwa-mem in almost every aspect". |
Hi, I am wondering what the memory requirements are for searching against very large database (500GB - 1 TB)? On the query side I can do the splitting but splitting the database would produce less than optimal results.
Thanks
The text was updated successfully, but these errors were encountered: