-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2Gbp limit on DB blocks? #88
Comments
To provide just a bit more info about our situation, here are the totals:
Based on the 2Gbp number, I'd have to do something like 150-200 separate runs, correct? |
You need to call DBsplit to partition the database into blocks.
daligner is not
a "monolithic" application where you just call it on the data. You have
to split
the DB into blocks, that will be the unit of parallelism on your cluster
runs, and
you can use HPC.daligner to produce a script of commands that will compare
all the blocks against each other.
-- Gene
…On 4/30/19, 9:00 PM, Jeremy Jay wrote:
We're testing daligner on a 13 gigabase genome de novo assembly, but
|daligner| gives me the following error:
daligner2.0: Fatal error, DB blocks are greater than 2Gbp!
Is there documentation somewhere on the limits? We have 41M reads, I
tried just a subset of 10M reads ~ 107Gbp total, but it has the same
error as above. I've also tried different DBsplit sizes but the error
above doesn't seem to be part of that code path.
This system has 2TB ram so it can use quite a bit more memory if that
is the concern.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#88>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/ABUSINR5ASFRYKYM7Q56PB3PTCJONANCNFSM4HJPKIMA>.
|
Thanks for the response! Yes, I'm calling DBsplit on these files, and running the first pair of blocks of each as a test run. The default 200mb split size is clearly too small - it only uses ~50gb of ram per job and spends a lot of time on I/O. Also 1779 blocks and 396k jobs. I have done 1000mb split size but it is still small - it only uses ~154gb of ram. 356 blocks and 16k jobs. Quite a bit better better but still under 10% of available memory. I have tried 1800mb split size, which is 198 blocks and 5k jobs so getting more reasonable. It allocates about 180gb ram and starts the "Comparing" stage but then segfaults. I'm guessing the 2Gbp number is an estimate and I'm hitting the true limit here? I would love to get to around 500gb allocation, any split sizes 3200mb and up give the error message above. |
Looks like you need to do repeat masking. Your genome seems highly
repetitive.
-- Gene
…On 5/2/19, 6:58 PM, Jeremy Jay wrote:
Thanks for the response! Yes, I'm calling DBsplit on these files, and
running the first pair of blocks of each as a test run.
The default 200mb split size is clearly too small - it only uses ~50gb
of ram per job and spends a lot of time on I/O. Also 1779 blocks and
396k jobs.
I have done 1000mb split size but it is still small - it only uses
~154gb of ram. 356 blocks and 16k jobs. Quite a bit better better but
still under 10% of available memory.
I have tried 1800mb split size, which is 198 blocks and 5k jobs so
getting more reasonable. It allocates about 180gb ram and starts the
"Comparing" stage but then segfaults. I'm guessing the 2Gbp number is
an estimate and I'm hitting the true limit here?
I would love to get to around 500gb allocation, any split sizes 3200mb
and up give the error message above.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#88 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABUSINVQJFL7RSY52AQ2JATPTMMS3ANCNFSM4HJPKIMA>.
|
Yes it's a highly repetitive hexaploid genome, and we have an initial assembly (which is highly collapsed). So it would be pointless to mask at this point. I'm happy to dig into the code but was just hoping for some explanation of the limits. It's difficult to tell if this is just an implementation specific issue or a problem inherent to the algorithm itself when applied to data of this scale. |
We're testing daligner on a 13 gigabase genome de novo assembly, but
daligner
gives me the following error:Is there documentation somewhere on the limits? We have 41M reads, I tried just a subset of 10M reads ~ 107Gbp total, but it has the same error as above. I've also tried different DBsplit sizes but the error above doesn't seem to be part of that code path.
This system has 2TB ram so it can use quite a bit more memory if that is the concern.
The text was updated successfully, but these errors were encountered: