Improve efficiency of link_ancestors #2442

gtsambos · 2022-07-26T10:07:06Z

I'm noticing that link_ancestors is scaling much more poorly than I intuitively expected (memory seems super-linear wrt sample size?! :( ) Will think a bit more about why this might be, but there is one reasonably small 'edit' to the algorithm we could make that should improve things somewhat.

Currently, we're iterating through all of the edges in the edge table (see here) but we should actually just stop as soon as we reach a point in the edge table where the parent node IDs exceed the largest of the supplied ancestral node IDs.

gtsambos · 2022-07-26T10:09:32Z

@jeromekelleher @petrelharp

gtsambos · 2022-07-26T10:14:10Z

Also, rough experiments suggest runtime is larger for a more recent set of census ancestors 😱 (edit: memory too, I think)

jeromekelleher · 2022-07-26T10:23:13Z

Hmm, where's the memory being used to you think? What's the size of the actual output as you scale the sample size?

gtsambos · 2022-07-26T13:58:20Z

The number of rows in the output should be proportional to the recombination rate times the sample size (ie. linear wrt sample size)

…

On Tue, 26 Jul 2022, 8:23 pm Jerome Kelleher, ***@***.***> wrote: Hmm, where's the memory being used to you think? What's the size of the actual output as you scale the sample size? — Reply to this email directly, view it on GitHub <#2442 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEHOXQVLAPK6FUJOE26FXL3VV64BZANCNFSM54VKYARA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

gtsambos · 2022-07-26T14:42:00Z

And as with ibd-segments, I suspect the size comes from the number of segments recorded internally at each ancestral node. This should be larger as you look at nodes higher up in the trees. On Tue, 26 Jul 2022, 11:58 pm Georgia Tsambos, < ***@***.***> wrote:

…

The number of rows in the output should be proportional to the recombination rate times the sample size (ie. linear wrt sample size) On Tue, 26 Jul 2022, 8:23 pm Jerome Kelleher, ***@***.***> wrote: > Hmm, where's the memory being used to you think? What's the size of the > actual output as you scale the sample size? > > — > Reply to this email directly, view it on GitHub > <#2442 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AEHOXQVLAPK6FUJOE26FXL3VV64BZANCNFSM54VKYARA> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >

jeromekelleher · 2022-07-26T16:34:38Z

Interesting to see some data here, see what we can dig up. I've found mprof really helpful for memory stuff .

gtsambos · 2022-07-27T00:08:57Z

Thanks Jerome, will do!

gtsambos · 2022-08-03T23:41:30Z

okay, I used scalene instead of mprof, but it seems to confirm that this change would make a huge difference in practice:

Currently, we're iterating through all of the edges in the edge table (see here) but we should actually just stop as soon as we reach a point in the edge table where the parent node IDs exceed the largest of the supplied ancestral node IDs.

I ran three simulations using the same demographic scenario which contained a census at time=500, and then applied the AncestorMap (the Python mockup of link_ancestors) to each of these using the census nodes as the 'ancestors' input:

15 individuals (30 chromosomes)
150 individuals (300 chromosomes)
150 individuals (300 chromosomes), with the simulation stopped at time=501 (just after the census)

I also used the same random seed. Note that the table output in (2) and (3) is identical, so any differences in memory/time/cpu etc are due to computations happening after all the ancestral nodes have been processed.

gtsambos · 2022-08-03T23:55:36Z

Here's the peak memory usage amount and runtime for each scenario:

30 Mb (17 sec)
925 Mb (1603 sec)
15 Mb (13 sec)

gtsambos · 2022-08-04T00:10:31Z

Here is the full profiling output for simulation 2, and here is the script I was profiling (just a stripped down copy of python/tests/simplify.py)

If you scroll down to the function profile at the bottom, you see that the vast majority of the memory goes towards initialising new segments, which are fed into the cumulative list of ancestral segments via the add_ancestry method. This is an entirely internal object -- if we stopped building it once all the provided ancestral nodes have been processed, we'd save a huge number of these operations.

I didn't directly measure the size of the table output, but there's strong evidence to suggest that's a negligible factor here. For simulations 2 and 3, this table must be at most 15Mb (but probably a lot smaller) -- and given how big simulation 2 was, this looks to be pretty small compared to the size of the internal segment lists. All of this output is created in the record_edge method (here), which doesn't even make it into the function profile output because it's so minimal.

gtsambos · 2022-08-04T00:11:17Z

so tldr -- I think this would be a good change to make!

gtsambos added the Performance This issue addresses performance, either runtime or memory label Jul 26, 2022

gtsambos mentioned this issue Aug 4, 2022

link_ancestors stops after all relevant edges have been processed #2456

Merged

mergify bot closed this as completed in #2456 Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve efficiency of link_ancestors #2442

Improve efficiency of link_ancestors #2442

gtsambos commented Jul 26, 2022

gtsambos commented Jul 26, 2022

gtsambos commented Jul 26, 2022 •

edited

Loading

jeromekelleher commented Jul 26, 2022

gtsambos commented Jul 26, 2022 via email

gtsambos commented Jul 26, 2022 via email

jeromekelleher commented Jul 26, 2022

gtsambos commented Jul 27, 2022

gtsambos commented Aug 3, 2022 •

edited

Loading

gtsambos commented Aug 3, 2022

gtsambos commented Aug 4, 2022

gtsambos commented Aug 4, 2022

Improve efficiency of link_ancestors #2442

Improve efficiency of link_ancestors #2442

Comments

gtsambos commented Jul 26, 2022

gtsambos commented Jul 26, 2022

gtsambos commented Jul 26, 2022 • edited Loading

jeromekelleher commented Jul 26, 2022

gtsambos commented Jul 26, 2022 via email

gtsambos commented Jul 26, 2022 via email

jeromekelleher commented Jul 26, 2022

gtsambos commented Jul 27, 2022

gtsambos commented Aug 3, 2022 • edited Loading

gtsambos commented Aug 3, 2022

gtsambos commented Aug 4, 2022

gtsambos commented Aug 4, 2022

gtsambos commented Jul 26, 2022 •

edited

Loading

gtsambos commented Aug 3, 2022 •

edited

Loading