-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve efficiency of link_ancestors #2442
Comments
Also, rough experiments suggest runtime is larger for a more recent set of census ancestors 😱 (edit: memory too, I think) |
Hmm, where's the memory being used to you think? What's the size of the actual output as you scale the sample size? |
The number of rows in the output should be proportional to the
recombination rate times the sample size (ie. linear wrt sample size)
…On Tue, 26 Jul 2022, 8:23 pm Jerome Kelleher, ***@***.***> wrote:
Hmm, where's the memory being used to you think? What's the size of the
actual output as you scale the sample size?
—
Reply to this email directly, view it on GitHub
<#2442 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEHOXQVLAPK6FUJOE26FXL3VV64BZANCNFSM54VKYARA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
And as with ibd-segments, I suspect the size comes from the number of
segments recorded internally at each ancestral node. This should be larger
as you look at nodes higher up in the trees.
On Tue, 26 Jul 2022, 11:58 pm Georgia Tsambos, <
***@***.***> wrote:
… The number of rows in the output should be proportional to the
recombination rate times the sample size (ie. linear wrt sample size)
On Tue, 26 Jul 2022, 8:23 pm Jerome Kelleher, ***@***.***>
wrote:
> Hmm, where's the memory being used to you think? What's the size of the
> actual output as you scale the sample size?
>
> —
> Reply to this email directly, view it on GitHub
> <#2442 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AEHOXQVLAPK6FUJOE26FXL3VV64BZANCNFSM54VKYARA>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
Interesting to see some data here, see what we can dig up. I've found mprof really helpful for memory stuff . |
Thanks Jerome, will do! |
okay, I used scalene instead of mprof, but it seems to confirm that this change would make a huge difference in practice:
I ran three simulations using the same demographic scenario which contained a census at time=500, and then applied the AncestorMap (the Python mockup of link_ancestors) to each of these using the census nodes as the 'ancestors' input:
I also used the same random seed. Note that the table output in (2) and (3) is identical, so any differences in memory/time/cpu etc are due to computations happening after all the ancestral nodes have been processed. |
Here's the peak memory usage amount and runtime for each scenario:
|
Here is the full profiling output for simulation 2, and here is the script I was profiling (just a stripped down copy of python/tests/simplify.py) If you scroll down to the function profile at the bottom, you see that the vast majority of the memory goes towards initialising new segments, which are fed into the cumulative list of ancestral segments via the I didn't directly measure the size of the table output, but there's strong evidence to suggest that's a negligible factor here. For simulations 2 and 3, this table must be at most 15Mb (but probably a lot smaller) -- and given how big simulation 2 was, this looks to be pretty small compared to the size of the internal segment lists. All of this output is created in the |
so tldr -- I think this would be a good change to make! |
I'm noticing that link_ancestors is scaling much more poorly than I intuitively expected (memory seems super-linear wrt sample size?! :( ) Will think a bit more about why this might be, but there is one reasonably small 'edit' to the algorithm we could make that should improve things somewhat.
Currently, we're iterating through all of the edges in the edge table (see here) but we should actually just stop as soon as we reach a point in the edge table where the parent node IDs exceed the largest of the supplied ancestral node IDs.
The text was updated successfully, but these errors were encountered: