Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up dnmtools states #26

Closed
hchetia opened this issue Oct 31, 2022 · 6 comments
Closed

Speeding up dnmtools states #26

hchetia opened this issue Oct 31, 2022 · 6 comments

Comments

@hchetia
Copy link

hchetia commented Oct 31, 2022

Hi,
Is there a way to use higher memory/cores with DNMtools states?

Best
H

@andrewdavidsmith
Copy link
Collaborator

Possibly. I'll keep this open and probably rename the issue, add detail so we might have a roadmap on how to do that. However, there's no simple switch that we can use to make this happen right away.

@hchetia
Copy link
Author

hchetia commented Nov 1, 2022

Hi @andrewdavidsmith
Thanks for your response.
I have run DNMTOOLS states on sam file (44GB). It has been running for >100 hours and have generated 3 MB of epireads.
A snippet from "top"
image

Seems like "states" could be made capable of using more memory and cores. The input sam could be split the temporary samlets and those samlets can be read in parallel.

Thanks.

@hchetia hchetia closed this as completed Nov 1, 2022
@hchetia hchetia reopened this Nov 1, 2022
@andrewdavidsmith
Copy link
Collaborator

That's not the expected behavior. If the reads are not sorted in the expected order, there's a chance it could turn from a linear time computation into a quadratic time one. If you could find a way to share the data with me I can check. Although I know you might not want to share all the data, the problem might not be easily revealed on just a small part. Let me know and feel free to email me.

@andrewdavidsmith
Copy link
Collaborator

andrewdavidsmith commented Nov 1, 2022

The right thing for us to do is have the code verify the sorting of reads, but sometimes the code attempts to just proceed and compensate if it gets unexpected input.

I also notice from your screen capture that the program is using 17.2g of vmem, but only 3.2g of pmem, which suggests something else is going on and the program is likely thrashing. Are you sure you have sufficient available physical memory for that process?

@hchetia
Copy link
Author

hchetia commented Dec 6, 2022

@andrewdavidsmith You were right, the algorithm was compensating for unsorted inputs. I have now been successfully ran the conversion to epiread pretty quickly.
In terms of accelerating the program, I agree that the code should verify the sorting first.
Run details- ~15 mins to generate 1.5G epireads from 35G deduped sorted sam inputs (HG38).
Adding CPU info and meminfo here in case it's helpful to you-
CPU(s): 96 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
Memory: 790 Gigs
I don't understand the thrashing part. Sharing another snippet here-
image

@andrewdavidsmith
Copy link
Collaborator

@hchetia I'm closing this because I think the issue has been solved. It should not continue if the input is unsorted in a way that will cause slowdown. Specifically, all reads from the same chromosome need to be consecutive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants