Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2Gbp limit on DB blocks? #88

Closed
pbnjay opened this issue Apr 30, 2019 · 5 comments
Closed

2Gbp limit on DB blocks? #88

pbnjay opened this issue Apr 30, 2019 · 5 comments

Comments

@pbnjay
Copy link

pbnjay commented Apr 30, 2019

We're testing daligner on a 13 gigabase genome de novo assembly, but daligner gives me the following error:

daligner2.0: Fatal error, DB blocks are greater than 2Gbp!

Is there documentation somewhere on the limits? We have 41M reads, I tried just a subset of 10M reads ~ 107Gbp total, but it has the same error as above. I've also tried different DBsplit sizes but the error above doesn't seem to be part of that code path.

This system has 2TB ram so it can use quite a bit more memory if that is the concern.

@pbnjay
Copy link
Author

pbnjay commented Apr 30, 2019

To provide just a bit more info about our situation, here are the totals:

 Statistics for all reads in the data set                                                                                                                                                                                                 
                                                                                                                                                                                                                                       
      32,255,959 reads        out of      41,818,118  ( 77.1%)                                                                                                                                                                            
 355,794,256,224 base pairs   out of 441,985,913,537  ( 80.5%)                                                                                                                                                                            

Based on the 2Gbp number, I'd have to do something like 150-200 separate runs, correct?

@thegenemyers
Copy link
Owner

thegenemyers commented May 2, 2019 via email

@pbnjay
Copy link
Author

pbnjay commented May 2, 2019

Thanks for the response! Yes, I'm calling DBsplit on these files, and running the first pair of blocks of each as a test run.

The default 200mb split size is clearly too small - it only uses ~50gb of ram per job and spends a lot of time on I/O. Also 1779 blocks and 396k jobs.

I have done 1000mb split size but it is still small - it only uses ~154gb of ram. 356 blocks and 16k jobs. Quite a bit better better but still under 10% of available memory.

I have tried 1800mb split size, which is 198 blocks and 5k jobs so getting more reasonable. It allocates about 180gb ram and starts the "Comparing" stage but then segfaults. I'm guessing the 2Gbp number is an estimate and I'm hitting the true limit here?

I would love to get to around 500gb allocation, any split sizes 3200mb and up give the error message above.

@thegenemyers
Copy link
Owner

thegenemyers commented May 3, 2019 via email

@pbnjay
Copy link
Author

pbnjay commented May 3, 2019

Yes it's a highly repetitive hexaploid genome, and we have an initial assembly (which is highly collapsed). So it would be pointless to mask at this point.

I'm happy to dig into the code but was just hoping for some explanation of the limits. It's difficult to tell if this is just an implementation specific issue or a problem inherent to the algorithm itself when applied to data of this scale.

@pbnjay pbnjay closed this as completed May 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants