Skip to content

Conversation

@jeromekelleher
Copy link
Member

This is an initial prototype showing that we can solve #775 if we snip out the "used up" ancestry as we progress through simplify. This is initially just a proof of concept, as it does things in exactly the opposite way as would be intended.

@jeromekelleher jeromekelleher marked this pull request as draft August 20, 2020 11:50
Copy link
Member

@gtsambos gtsambos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I can see what you're trying to do here @jeromekelleher! I'll start with the more technical comments before they leave my brain -- doing it all in one step at the end of the simplification processes introduces some inefficiencies that you wouldn't have if you were keeping track of the uncoalesced segments dynamically. For, instance, you calculate the list of uncoalesced segments gaps anew independently for each node, but they are highly dependent on each other -- every interval of uncoalesced ancestry that you 'find' as you progress further up the tree sequence is an interval that can be removed from 1 or more of the nodes lower in the tree.

Here's a sketch of how I'd approach this dynamically:

There'd be some array of lists of uncoalesced segments, say self.uncoalesced_segs, with one array element per node.
As you loop over edges (L, R, P, C) in the edge table:

  1. Add the interval (L, R) to self.uncoalesced_segs[u] if one of the following is true:
  • u is the child node C and u is a sample
  • u is the parent node P
    You'll also need to check that (L, R) isn't already in self.uncoalesced_segs[u].
  1. Remove the intersection of the interval (L, R) from segment(s) in self.uncoalesced_segs[C]
  2. Do all the normal simplify things (recording new edges, nodes etc)

@gtsambos
Copy link
Member

Does this make sense?

Also, I was trying to think about how this PR is different from what I've been working on in #679 and #488. I think the main difference is that here, each segment has a state of being either coalesced or uncoalesced, and although this state changes as you work through simplify -- an uncoalesced segment becomes coalesced as soon as it has an ancestor somewhere further up in the tree -- it is a 'global' property of the segment, in that it doesn't depend on the node's relationship to all other nodes, but only to its most closely related node. This is different from the IBD work, where the state of coalescence is specific/relative to every other sample node.

@gtsambos
Copy link
Member

Also also, I still feel this is a very complicated change to make just to solve the problem in #775. However, there would be other interesting information that you'd get for 'free' by doing it this way (assuming there aren't other costs to algorithmic speed etc). For instance, you'd be able to see how the total length of segments of uncoalesced ancestry changes as you move further back in time. I imagine this would have interesting/somewhat predictable theoretical properties, and would be very sensitive to aspects of demography

@gtsambos
Copy link
Member

gtsambos commented Aug 24, 2020

Hmm, so it occurred to me that my approach in the review comment might negate one of the advantages you saw for this change -- the object holding the gaps in ancestry, self-uncoalesced_segs, will take up some memory, and though the amount of memory should decrease as the algorithm continues, it might not offset the memory that you'll be saving by trimming the list of ancestral segments A

@jeromekelleher
Copy link
Member Author

This makes sense @gtsambos, but I'm actually not suggesting that we store the ancestry gaps like I did in this quick hack. My proposal is to snip out the used up segments of ancestry as we go through the algorithm. I'm having a go at it at the moment, let's see what it looks like.

@jeromekelleher jeromekelleher force-pushed the simplify-keep-path-to-ancient branch from ddc1e68 to 7269401 Compare August 24, 2020 17:49
@jeromekelleher
Copy link
Member Author

Here's a working version of simplify with ancestry snipping and a keep_input_roots option. Implementing snipping out the ancestry was a bit tedious, as expected, but the code isn't very complicated and it's hard to imagine it slowing things down in any real sense (unless the segment chains get really fragmented), and it could allow us to save some memory.

Once this was done, the keep_input_roots option was easy, with a couple of tricky wrinkles:

  1. Annoyingly, the edges we output aren't guaranteed to be in the right order in the output table. In particular, if a root is internal in another tree along the sequence, then when we write out the root edges for it later, they will not be adjacent in the table. I don't think there's much we can do about this, other then mitigate the full sort it implies (either sort part of the table, or else merge the edges intelligently). But, I guess this won't ever happen in the forward simulation case (right?), so maybe it's not so critical that we handle it well, we just have to detect reliably when the sort is needed.
  2. There's probably some subtleties with how we handle mutations on the new edges - I haven't thought this through properly yet.

What do you think @petrelharp - is this worth pushing through for the forthcoming SLiM release? I can spent maybe another day on it, if you think it's worth it, and should get most stuff sorted in that much time.

@jeromekelleher
Copy link
Member Author

@gtsambos - I've updated this with a good version of the ancestry snipping. This could be useful for what you're working on?

@petrelharp
Copy link
Contributor

What do you think @petrelharp - is this worth pushing through for the forthcoming SLiM release?

I've discussed with @bhaller and we think it is worthwhile - if this works out, then we would ditch all the confusing "first generation" stuff, which would be quite nice for the users (and make the code nicer).

@petrelharp
Copy link
Contributor

I've had a close read of this - looks good! - and see what the problem with sorting is. I haven't thought of a solution, but you're right: for forwards simulations all the extra roots are (usually) at the same time, so no sorting will be necessary. (This isn't true if you add a new history-less population partway through, but that's quite uncommon, I think.)

@bhaller
Copy link

bhaller commented Aug 24, 2020

I've had a close read of this - looks good! - and see what the problem with sorting is. I haven't thought of a solution, but you're right: for forwards simulations all the extra roots are (usually) at the same time, so no sorting will be necessary. (This isn't true if you add a new history-less population partway through, but that's quite uncommon, I think.)

Well, I don't know how common it is, but it is certainly legal in SLiM, so we will need to think about how to handle it; I guess we'll need a flag to force the sort to happen in that generation, or something?

@petrelharp
Copy link
Contributor

Well, I don't know how common it is, but it is certainly legal in SLiM, so we will need to think about how to handle it; I guess we'll need a flag to force the sort to happen in that generation, or something?

The proposal isn't to leave it up to the user, but rather to have simplify not do more sorting than is necessary. The nice thing here is that in the usual case in SLiM, there won't be any necessary sorting. For instance: we can check if the youngest root is no older than the last parent in the edge table; if so, no sorting is necessary.

@jeromekelleher
Copy link
Member Author

The ancestry snipping process has basically no effect on performance as far as I can tell. Running simplify on a 1.8 gig unsimplified tree sequence with 100k samples (200 generations using the haploid_wright_fisher example program). The difference between the two runs here is < 1%, so I'm not going to get worried about that right now. Likewise RAM hasn't really changed (but I haven't tried to free the used-up ancestry segments - that could be worthwhile).

This doesn't include the mapping roots bit (or any mutations) - I just wanted to get a like-for-like benchmark just crunching through lots of topology. Looks good so far.

BEFORE

$ /usr/bin/time python3 benchmark-simplify.py 
loaded in 10.364779717987403 s
nodes =  20100000
edges =  40000000
Simplified in  6.4478405870031565
nodes =  659800
edges =  3121843
15.06user 3.02system 0:17.75elapsed 101%CPU (0avgtext+0avgdata 8191192maxresident)k
0inputs+0outputs (0major+3155214minor)pagefaults 0swaps

AFTER

jk@tempest$ /usr/bin/time python3 benchmark-simplify.py 
loaded in 10.335074374976102 s
nodes =  20100000
edges =  40000000
Simplified in  6.466902140993625
nodes =  659800
edges =  3121843
14.95user 3.18system 0:17.73elapsed 102%CPU (0avgtext+0avgdata 8205248maxresident)k
0inputs+0outputs (0major+3158301minor)pagefaults 0swaps

@petrelharp
Copy link
Contributor

Wow, that's great. Let me know when you want me to have a look, or to come up with some more tests.

@jeromekelleher
Copy link
Member Author

Will do @petrelharp - I'm going to spend another couple of hours polishing it up, and it'd be great if you could take a good look then.

@jeromekelleher jeromekelleher force-pushed the simplify-keep-path-to-ancient branch from 8152a64 to a9f0f2b Compare August 25, 2020 15:35
@codecov
Copy link

codecov bot commented Aug 25, 2020

Codecov Report

Merging #782 into master will increase coverage by 0.36%.
The diff coverage is 81.34%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #782      +/-   ##
==========================================
+ Coverage   87.72%   88.09%   +0.36%     
==========================================
  Files          24       24              
  Lines       19447    19529      +82     
  Branches     3653     3674      +21     
==========================================
+ Hits        17060    17204     +144     
+ Misses       1292     1223      -69     
- Partials     1095     1102       +7     
Flag Coverage Δ
#c_tests 86.07% <80.31%> (+0.74%) ⬆️
#python_c_tests 90.16% <100.00%> (+<0.01%) ⬆️
#python_tests 98.89% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/tables.py 99.70% <ø> (ø)
python/tskit/trees.py 98.13% <ø> (ø)
c/tskit/tables.c 82.78% <80.15%> (+1.50%) ⬆️
c/tskit/core.c 94.08% <100.00%> (ø)
python/_tskitmodule.c 83.00% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20af34a...6f885de. Read the comment docs.

@jeromekelleher jeromekelleher force-pushed the simplify-keep-path-to-ancient branch from 8f43743 to d99c031 Compare August 25, 2020 17:47
@jeromekelleher jeromekelleher marked this pull request as ready for review August 25, 2020 17:47
@jeromekelleher jeromekelleher requested review from benjeffery, gtsambos and petrelharp and removed request for gtsambos August 25, 2020 17:47
@jeromekelleher
Copy link
Member Author

OK, I think this is ready for review now. I'd like to bump up the C test coverage a bit (we're not hitting a bunch of the awkward corner cases) but I'm not sure what's the best way to get at them. Should probably craft a "simplest...." example to do it. Probably the easiest way is to make a really fragmented small ts...

Anyway, ready for review. Good to get your eyes on this @petrelharp @benjeffery @gtsambos!

Copy link
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks excellent! The only loose ends I could find are having to do with the modification to sort_tables. Feel free to ignore my PR - the behavior is being tested elsewhere, although the test I wrote makes it more obvious, maybe.

@gtsambos
Copy link
Member

Anyway, ready for review. Good to get your eyes on this @petrelharp @benjeffery @gtsambos!

@gtsambos - I've updated this with a good version of the ancestry snipping. This could be useful for what you're working on?

hooray, exciting! I'll look over tonight. And yes, if this works as well as it look like it does in your tests, it should be able to replace the awkward "deleting bits of ancestral segments" stuff that's currently giving me some grief in #679.

@gtsambos
Copy link
Member

Also, what's the best way to submit revisions to this (like @petrelharp has done) if I want to suggest something? Do I fork your repo, make my changes to the branch and then submit a PR as usual (?)

@petrelharp
Copy link
Contributor

I did git fetch origin pull/782/head:new_simplify; git checkout new_simplify, did edits, pushed to my fork of tskit-dev/tskit, and made a pull request (um, from my fork's page, I think?).

@jeromekelleher
Copy link
Member Author

I did git fetch origin pull/782/head:new_simplify; git checkout new_simplify, did edits, pushed to my fork of tskit-dev/tskit, and made a pull request (um, from my fork's page, I think?).

You need to make a PR against this branch in my fork (jeromekelleher:simplify-keep-path-to-ancient). If you go to my fork and open a PR, it should give you the options to select the base branch and the PR branch.

Thanks!

@jeromekelleher
Copy link
Member Author

(Oops, I was quoting @petrelharp above who had already solved the problem!)

@jeromekelleher
Copy link
Member Author

TODO

  • Update changelog
  • Bump up C test coverage with a hand crafted example to cover awkward cases in snip ancestry.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been through this without deep-diving the algorithm, looks really good.

@petrelharp
Copy link
Contributor

Bump up C test coverage with a hand crafted example to cover awkward cases in snip ancestry.

Ping if you'd like me to do this one.

@jeromekelleher
Copy link
Member Author

Ping if you'd like me to do this one.

Thanks, but I think I know how to do it, so I'll have a go soon.

@jeromekelleher jeromekelleher force-pushed the simplify-keep-path-to-ancient branch from b2c2e84 to a17a840 Compare August 26, 2020 19:56
@jeromekelleher
Copy link
Member Author

Huh - turns out the reason we couldn't get test coverage on the squash loop is that it never gets called, ever. So, we can delete it! Gotta love the test suite here, it's totally awesome.

@jeromekelleher jeromekelleher force-pushed the simplify-keep-path-to-ancient branch from a17a840 to c9c7eb9 Compare August 26, 2020 20:22
@jeromekelleher
Copy link
Member Author

OK, that's as much coverage as we're going to get - let's merge!

@jeromekelleher
Copy link
Member Author

Actuall, crud, no, let's not merge. Forgot to update the changelog.

@gtsambos
Copy link
Member

I had a brief look at this last night in bed, it looks really nice :) I'm going to look a bit more closely now at how you are dealing with the removal/modification of the removed segments in the ancestral map, but a general 👍 from me

Copy link
Member

@gtsambos gtsambos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey, awesome job @jeromekelleher, this looks great! So far I've focused on the Python code, and specifically on the topology-related stuff (haven't thought much about the implications for remapping of mutations) -- I think I'm going to need a bit more time to look at the C code here in detail, as it just takes a while to parse what's going on here, but I think you should go ahead and merge the PR. Thanks also for cleaning up that test suite of simplify-esque examples -- there are defs a lot of overlap between those that it's good to consolidate.

@jeromekelleher jeromekelleher force-pushed the simplify-keep-path-to-ancient branch from c9c7eb9 to 6f885de Compare August 27, 2020 08:27
@mergify mergify bot merged commit 1eabf28 into tskit-dev:master Aug 27, 2020
@jeromekelleher jeromekelleher deleted the simplify-keep-path-to-ancient branch August 27, 2020 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants