Keep path to ancient in simplify #782

jeromekelleher · 2020-08-20T11:50:09Z

This is an initial prototype showing that we can solve #775 if we snip out the "used up" ancestry as we progress through simplify. This is initially just a proof of concept, as it does things in exactly the opposite way as would be intended.

python/tests/simplify.py

gtsambos

Okay, I can see what you're trying to do here @jeromekelleher! I'll start with the more technical comments before they leave my brain -- doing it all in one step at the end of the simplification processes introduces some inefficiencies that you wouldn't have if you were keeping track of the uncoalesced segments dynamically. For, instance, you calculate the list of uncoalesced segments gaps anew independently for each node, but they are highly dependent on each other -- every interval of uncoalesced ancestry that you 'find' as you progress further up the tree sequence is an interval that can be removed from 1 or more of the nodes lower in the tree.

Here's a sketch of how I'd approach this dynamically:

There'd be some array of lists of uncoalesced segments, say self.uncoalesced_segs, with one array element per node.
As you loop over edges (L, R, P, C) in the edge table:

Add the interval (L, R) to self.uncoalesced_segs[u] if one of the following is true:

u is the child node C and u is a sample
u is the parent node P
You'll also need to check that (L, R) isn't already in self.uncoalesced_segs[u].

Remove the intersection of the interval (L, R) from segment(s) in self.uncoalesced_segs[C]
Do all the normal simplify things (recording new edges, nodes etc)

python/tests/simplify.py

gtsambos · 2020-08-24T10:02:58Z

Does this make sense?

Also, I was trying to think about how this PR is different from what I've been working on in #679 and #488. I think the main difference is that here, each segment has a state of being either coalesced or uncoalesced, and although this state changes as you work through simplify -- an uncoalesced segment becomes coalesced as soon as it has an ancestor somewhere further up in the tree -- it is a 'global' property of the segment, in that it doesn't depend on the node's relationship to all other nodes, but only to its most closely related node. This is different from the IBD work, where the state of coalescence is specific/relative to every other sample node.

gtsambos · 2020-08-24T10:08:36Z

Also also, I still feel this is a very complicated change to make just to solve the problem in #775. However, there would be other interesting information that you'd get for 'free' by doing it this way (assuming there aren't other costs to algorithmic speed etc). For instance, you'd be able to see how the total length of segments of uncoalesced ancestry changes as you move further back in time. I imagine this would have interesting/somewhat predictable theoretical properties, and would be very sensitive to aspects of demography

gtsambos · 2020-08-24T10:20:36Z

Hmm, so it occurred to me that my approach in the review comment might negate one of the advantages you saw for this change -- the object holding the gaps in ancestry, self-uncoalesced_segs, will take up some memory, and though the amount of memory should decrease as the algorithm continues, it might not offset the memory that you'll be saving by trimming the list of ancestral segments A

jeromekelleher · 2020-08-24T11:33:22Z

This makes sense @gtsambos, but I'm actually not suggesting that we store the ancestry gaps like I did in this quick hack. My proposal is to snip out the used up segments of ancestry as we go through the algorithm. I'm having a go at it at the moment, let's see what it looks like.

jeromekelleher · 2020-08-24T17:59:48Z

Here's a working version of simplify with ancestry snipping and a keep_input_roots option. Implementing snipping out the ancestry was a bit tedious, as expected, but the code isn't very complicated and it's hard to imagine it slowing things down in any real sense (unless the segment chains get really fragmented), and it could allow us to save some memory.

Once this was done, the keep_input_roots option was easy, with a couple of tricky wrinkles:

Annoyingly, the edges we output aren't guaranteed to be in the right order in the output table. In particular, if a root is internal in another tree along the sequence, then when we write out the root edges for it later, they will not be adjacent in the table. I don't think there's much we can do about this, other then mitigate the full sort it implies (either sort part of the table, or else merge the edges intelligently). But, I guess this won't ever happen in the forward simulation case (right?), so maybe it's not so critical that we handle it well, we just have to detect reliably when the sort is needed.
There's probably some subtleties with how we handle mutations on the new edges - I haven't thought this through properly yet.

What do you think @petrelharp - is this worth pushing through for the forthcoming SLiM release? I can spent maybe another day on it, if you think it's worth it, and should get most stuff sorted in that much time.

jeromekelleher · 2020-08-24T18:56:00Z

@gtsambos - I've updated this with a good version of the ancestry snipping. This could be useful for what you're working on?

petrelharp · 2020-08-24T19:00:48Z

What do you think @petrelharp - is this worth pushing through for the forthcoming SLiM release?

I've discussed with @bhaller and we think it is worthwhile - if this works out, then we would ditch all the confusing "first generation" stuff, which would be quite nice for the users (and make the code nicer).

petrelharp · 2020-08-24T20:55:07Z

I've had a close read of this - looks good! - and see what the problem with sorting is. I haven't thought of a solution, but you're right: for forwards simulations all the extra roots are (usually) at the same time, so no sorting will be necessary. (This isn't true if you add a new history-less population partway through, but that's quite uncommon, I think.)

bhaller · 2020-08-24T21:27:05Z

I've had a close read of this - looks good! - and see what the problem with sorting is. I haven't thought of a solution, but you're right: for forwards simulations all the extra roots are (usually) at the same time, so no sorting will be necessary. (This isn't true if you add a new history-less population partway through, but that's quite uncommon, I think.)

Well, I don't know how common it is, but it is certainly legal in SLiM, so we will need to think about how to handle it; I guess we'll need a flag to force the sort to happen in that generation, or something?

petrelharp · 2020-08-24T21:38:25Z

Well, I don't know how common it is, but it is certainly legal in SLiM, so we will need to think about how to handle it; I guess we'll need a flag to force the sort to happen in that generation, or something?

The proposal isn't to leave it up to the user, but rather to have simplify not do more sorting than is necessary. The nice thing here is that in the usual case in SLiM, there won't be any necessary sorting. For instance: we can check if the youngest root is no older than the last parent in the edge table; if so, no sorting is necessary.

jeromekelleher · 2020-08-25T09:46:17Z

The ancestry snipping process has basically no effect on performance as far as I can tell. Running simplify on a 1.8 gig unsimplified tree sequence with 100k samples (200 generations using the haploid_wright_fisher example program). The difference between the two runs here is < 1%, so I'm not going to get worried about that right now. Likewise RAM hasn't really changed (but I haven't tried to free the used-up ancestry segments - that could be worthwhile).

This doesn't include the mapping roots bit (or any mutations) - I just wanted to get a like-for-like benchmark just crunching through lots of topology. Looks good so far.

BEFORE

$ /usr/bin/time python3 benchmark-simplify.py 
loaded in 10.364779717987403 s
nodes =  20100000
edges =  40000000
Simplified in  6.4478405870031565
nodes =  659800
edges =  3121843
15.06user 3.02system 0:17.75elapsed 101%CPU (0avgtext+0avgdata 8191192maxresident)k
0inputs+0outputs (0major+3155214minor)pagefaults 0swaps

AFTER

jk@tempest$ /usr/bin/time python3 benchmark-simplify.py 
loaded in 10.335074374976102 s
nodes =  20100000
edges =  40000000
Simplified in  6.466902140993625
nodes =  659800
edges =  3121843
14.95user 3.18system 0:17.73elapsed 102%CPU (0avgtext+0avgdata 8205248maxresident)k
0inputs+0outputs (0major+3158301minor)pagefaults 0swaps

petrelharp · 2020-08-25T14:58:18Z

Wow, that's great. Let me know when you want me to have a look, or to come up with some more tests.

jeromekelleher · 2020-08-25T15:22:31Z

Will do @petrelharp - I'm going to spend another couple of hours polishing it up, and it'd be great if you could take a good look then.

codecov · 2020-08-25T15:53:37Z

Codecov Report

Merging #782 into master will increase coverage by 0.36%.
The diff coverage is 81.34%.

@@            Coverage Diff             @@
##           master     #782      +/-   ##
==========================================
+ Coverage   87.72%   88.09%   +0.36%     
==========================================
  Files          24       24              
  Lines       19447    19529      +82     
  Branches     3653     3674      +21     
==========================================
+ Hits        17060    17204     +144     
+ Misses       1292     1223      -69     
- Partials     1095     1102       +7

Flag	Coverage Δ
#c_tests	`86.07% <80.31%> (+0.74%)`	⬆️
#python_c_tests	`90.16% <100.00%> (+<0.01%)`	⬆️
#python_tests	`98.89% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/tskit/tables.py	`99.70% <ø> (ø)`
python/tskit/trees.py	`98.13% <ø> (ø)`
c/tskit/tables.c	`82.78% <80.15%> (+1.50%)`	⬆️
c/tskit/core.c	`94.08% <100.00%> (ø)`
python/_tskitmodule.c	`83.00% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20af34a...6f885de. Read the comment docs.

jeromekelleher · 2020-08-25T17:50:06Z

OK, I think this is ready for review now. I'd like to bump up the C test coverage a bit (we're not hitting a bunch of the awkward corner cases) but I'm not sure what's the best way to get at them. Should probably craft a "simplest...." example to do it. Probably the easiest way is to make a really fragmented small ts...

Anyway, ready for review. Good to get your eyes on this @petrelharp @benjeffery @gtsambos!

petrelharp

Looks excellent! The only loose ends I could find are having to do with the modification to sort_tables. Feel free to ignore my PR - the behavior is being tested elsewhere, although the test I wrote makes it more obvious, maybe.

c/tskit/tables.c

python/tests/test_topology.py

python/tests/tsutil.py

gtsambos · 2020-08-26T07:55:10Z

Anyway, ready for review. Good to get your eyes on this @petrelharp @benjeffery @gtsambos!

@gtsambos - I've updated this with a good version of the ancestry snipping. This could be useful for what you're working on?

hooray, exciting! I'll look over tonight. And yes, if this works as well as it look like it does in your tests, it should be able to replace the awkward "deleting bits of ancestral segments" stuff that's currently giving me some grief in #679.

gtsambos · 2020-08-26T07:57:34Z

Also, what's the best way to submit revisions to this (like @petrelharp has done) if I want to suggest something? Do I fork your repo, make my changes to the branch and then submit a PR as usual (?)

petrelharp · 2020-08-26T08:39:28Z

I did git fetch origin pull/782/head:new_simplify; git checkout new_simplify, did edits, pushed to my fork of tskit-dev/tskit, and made a pull request (um, from my fork's page, I think?).

jeromekelleher · 2020-08-26T08:44:58Z

I did git fetch origin pull/782/head:new_simplify; git checkout new_simplify, did edits, pushed to my fork of tskit-dev/tskit, and made a pull request (um, from my fork's page, I think?).

You need to make a PR against this branch in my fork (jeromekelleher:simplify-keep-path-to-ancient). If you go to my fork and open a PR, it should give you the options to select the base branch and the PR branch.

Thanks!

jeromekelleher · 2020-08-26T08:45:52Z

(Oops, I was quoting @petrelharp above who had already solved the problem!)

jeromekelleher · 2020-08-26T10:05:27Z

TODO

Update changelog
Bump up C test coverage with a hand crafted example to cover awkward cases in snip ancestry.

benjeffery

I've been through this without deep-diving the algorithm, looks really good.

petrelharp · 2020-08-26T15:27:05Z

Bump up C test coverage with a hand crafted example to cover awkward cases in snip ancestry.

Ping if you'd like me to do this one.

jeromekelleher · 2020-08-26T16:20:03Z

Ping if you'd like me to do this one.

Thanks, but I think I know how to do it, so I'll have a go soon.

jeromekelleher · 2020-08-26T19:58:26Z

Huh - turns out the reason we couldn't get test coverage on the squash loop is that it never gets called, ever. So, we can delete it! Gotta love the test suite here, it's totally awesome.

jeromekelleher · 2020-08-26T20:22:29Z

OK, that's as much coverage as we're going to get - let's merge!

jeromekelleher · 2020-08-26T21:02:40Z

Actuall, crud, no, let's not merge. Forgot to update the changelog.

gtsambos · 2020-08-26T21:15:55Z

I had a brief look at this last night in bed, it looks really nice :) I'm going to look a bit more closely now at how you are dealing with the removal/modification of the removed segments in the ancestral map, but a general 👍 from me

gtsambos

hey, awesome job @jeromekelleher, this looks great! So far I've focused on the Python code, and specifically on the topology-related stuff (haven't thought much about the implications for remapping of mutations) -- I think I'm going to need a bit more time to look at the C code here in detail, as it just takes a while to parse what's going on here, but I think you should go ahead and merge the PR. Thanks also for cleaning up that test suite of simplify-esque examples -- there are defs a lot of overlap between those that it's good to consolidate.

python/tests/simplify.py

Also updates the internal algorithm to remove ancestry that has been processed.

stick test

jeromekelleher marked this pull request as draft August 20, 2020 11:50

jeromekelleher commented Aug 20, 2020

View reviewed changes

python/tests/simplify.py Outdated Show resolved Hide resolved

jeromekelleher mentioned this pull request Aug 20, 2020

Add keep_input_roots option to simplify #775

Closed

gtsambos self-requested a review August 20, 2020 12:48

gtsambos reviewed Aug 24, 2020

View reviewed changes

python/tests/simplify.py Outdated Show resolved Hide resolved

python/tests/simplify.py Outdated Show resolved Hide resolved

jeromekelleher force-pushed the simplify-keep-path-to-ancient branch from ddc1e68 to 7269401 Compare August 24, 2020 17:49

jeromekelleher force-pushed the simplify-keep-path-to-ancient branch from 8152a64 to a9f0f2b Compare August 25, 2020 15:35

jeromekelleher force-pushed the simplify-keep-path-to-ancient branch from 8f43743 to d99c031 Compare August 25, 2020 17:47

jeromekelleher marked this pull request as ready for review August 25, 2020 17:47

jeromekelleher requested review from benjeffery, gtsambos and petrelharp and removed request for gtsambos August 25, 2020 17:47

petrelharp reviewed Aug 25, 2020

View reviewed changes

c/tskit/tables.c Show resolved Hide resolved

python/tests/test_topology.py Show resolved Hide resolved

python/tests/test_topology.py Outdated Show resolved Hide resolved

python/tests/test_topology.py Show resolved Hide resolved

python/tests/tsutil.py Show resolved Hide resolved

benjeffery mentioned this pull request Aug 26, 2020

Some questions about missing data #716

Closed

benjeffery approved these changes Aug 26, 2020

View reviewed changes

jeromekelleher force-pushed the simplify-keep-path-to-ancient branch from b2c2e84 to a17a840 Compare August 26, 2020 19:56

jeromekelleher force-pushed the simplify-keep-path-to-ancient branch from a17a840 to c9c7eb9 Compare August 26, 2020 20:22

gtsambos reviewed Aug 27, 2020

View reviewed changes

python/tests/simplify.py Show resolved Hide resolved

gtsambos mentioned this pull request Aug 27, 2020

C implementation of IBD-finding code. #679

Merged

jeromekelleher and others added 5 commits August 27, 2020 09:21

Add keep_input_roots to simplify.

f2225fc

Also updates the internal algorithm to remove ancestry that has been processed.

stick test

dc90ef5

stick test

make sure no sites are skipped

b232615

Clarify the semantics of skipping sort on sites/mutations.

0f65728

Removed the squash loop from simplifier_extract_ancestry.

6f885de

jeromekelleher force-pushed the simplify-keep-path-to-ancient branch from c9c7eb9 to 6f885de Compare August 27, 2020 08:27

jeromekelleher added the AUTOMERGE-REQUESTED label Aug 27, 2020

mergify bot merged commit 1eabf28 into tskit-dev:master Aug 27, 2020

mergify bot removed the AUTOMERGE-REQUESTED label Aug 27, 2020

jeromekelleher deleted the simplify-keep-path-to-ancient branch August 27, 2020 08:47

gtsambos mentioned this pull request Aug 27, 2020

Ancestral nodes from a founding generation can now be retained in simplify #781

Closed

Keep path to ancient in simplify #782

Keep path to ancient in simplify #782

Uh oh!

Conversation

jeromekelleher commented Aug 20, 2020

Uh oh!

Uh oh!

gtsambos left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gtsambos commented Aug 24, 2020

Uh oh!

gtsambos commented Aug 24, 2020

Uh oh!

gtsambos commented Aug 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Aug 24, 2020

Uh oh!

jeromekelleher commented Aug 24, 2020

Uh oh!

jeromekelleher commented Aug 24, 2020

Uh oh!

petrelharp commented Aug 24, 2020

Uh oh!

petrelharp commented Aug 24, 2020

Uh oh!

bhaller commented Aug 24, 2020

Uh oh!

petrelharp commented Aug 24, 2020

Uh oh!

jeromekelleher commented Aug 25, 2020

Uh oh!

petrelharp commented Aug 25, 2020

Uh oh!

jeromekelleher commented Aug 25, 2020

Uh oh!

codecov bot commented Aug 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented Aug 25, 2020

Uh oh!

petrelharp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gtsambos commented Aug 26, 2020

Uh oh!

gtsambos commented Aug 26, 2020

Uh oh!

petrelharp commented Aug 26, 2020

Uh oh!

jeromekelleher commented Aug 26, 2020

Uh oh!

jeromekelleher commented Aug 26, 2020

Uh oh!

jeromekelleher commented Aug 26, 2020

Uh oh!

benjeffery left a comment

Choose a reason for hiding this comment

Uh oh!

petrelharp commented Aug 26, 2020

Uh oh!

jeromekelleher commented Aug 26, 2020

Uh oh!

jeromekelleher commented Aug 26, 2020

Uh oh!

jeromekelleher commented Aug 26, 2020

Uh oh!

jeromekelleher commented Aug 26, 2020

Uh oh!

gtsambos commented Aug 26, 2020

gtsambos left a comment •

edited

Loading

gtsambos commented Aug 24, 2020 •

edited

Loading

codecov bot commented Aug 25, 2020 •

edited

Loading