-
Notifications
You must be signed in to change notification settings - Fork 78
Keep path to ancient in simplify #782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep path to ancient in simplify #782
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I can see what you're trying to do here @jeromekelleher! I'll start with the more technical comments before they leave my brain -- doing it all in one step at the end of the simplification processes introduces some inefficiencies that you wouldn't have if you were keeping track of the uncoalesced segments dynamically. For, instance, you calculate the list of uncoalesced segments gaps anew independently for each node, but they are highly dependent on each other -- every interval of uncoalesced ancestry that you 'find' as you progress further up the tree sequence is an interval that can be removed from 1 or more of the nodes lower in the tree.
Here's a sketch of how I'd approach this dynamically:
There'd be some array of lists of uncoalesced segments, say self.uncoalesced_segs, with one array element per node.
As you loop over edges (L, R, P, C) in the edge table:
- Add the interval
(L, R)toself.uncoalesced_segs[u]if one of the following is true:
uis the child nodeCanduis a sampleuis the parent nodeP
You'll also need to check that(L, R)isn't already inself.uncoalesced_segs[u].
- Remove the intersection of the interval
(L, R)from segment(s) inself.uncoalesced_segs[C] - Do all the normal
simplifythings (recording new edges, nodes etc)
|
Does this make sense? Also, I was trying to think about how this PR is different from what I've been working on in #679 and #488. I think the main difference is that here, each segment has a state of being either coalesced or uncoalesced, and although this state changes as you work through |
|
Also also, I still feel this is a very complicated change to make just to solve the problem in #775. However, there would be other interesting information that you'd get for 'free' by doing it this way (assuming there aren't other costs to algorithmic speed etc). For instance, you'd be able to see how the total length of segments of uncoalesced ancestry changes as you move further back in time. I imagine this would have interesting/somewhat predictable theoretical properties, and would be very sensitive to aspects of demography |
|
Hmm, so it occurred to me that my approach in the review comment might negate one of the advantages you saw for this change -- the object holding the gaps in ancestry, |
|
This makes sense @gtsambos, but I'm actually not suggesting that we store the ancestry gaps like I did in this quick hack. My proposal is to snip out the used up segments of ancestry as we go through the algorithm. I'm having a go at it at the moment, let's see what it looks like. |
ddc1e68 to
7269401
Compare
|
Here's a working version of simplify with ancestry snipping and a Once this was done, the
What do you think @petrelharp - is this worth pushing through for the forthcoming SLiM release? I can spent maybe another day on it, if you think it's worth it, and should get most stuff sorted in that much time. |
|
@gtsambos - I've updated this with a good version of the ancestry snipping. This could be useful for what you're working on? |
I've discussed with @bhaller and we think it is worthwhile - if this works out, then we would ditch all the confusing "first generation" stuff, which would be quite nice for the users (and make the code nicer). |
|
I've had a close read of this - looks good! - and see what the problem with sorting is. I haven't thought of a solution, but you're right: for forwards simulations all the extra roots are (usually) at the same time, so no sorting will be necessary. (This isn't true if you add a new history-less population partway through, but that's quite uncommon, I think.) |
Well, I don't know how common it is, but it is certainly legal in SLiM, so we will need to think about how to handle it; I guess we'll need a flag to force the sort to happen in that generation, or something? |
The proposal isn't to leave it up to the user, but rather to have simplify not do more sorting than is necessary. The nice thing here is that in the usual case in SLiM, there won't be any necessary sorting. For instance: we can check if the youngest root is no older than the last parent in the edge table; if so, no sorting is necessary. |
|
The ancestry snipping process has basically no effect on performance as far as I can tell. Running simplify on a 1.8 gig unsimplified tree sequence with 100k samples (200 generations using the haploid_wright_fisher example program). The difference between the two runs here is < 1%, so I'm not going to get worried about that right now. Likewise RAM hasn't really changed (but I haven't tried to free the used-up ancestry segments - that could be worthwhile). This doesn't include the mapping roots bit (or any mutations) - I just wanted to get a like-for-like benchmark just crunching through lots of topology. Looks good so far. BEFORE AFTER |
|
Wow, that's great. Let me know when you want me to have a look, or to come up with some more tests. |
|
Will do @petrelharp - I'm going to spend another couple of hours polishing it up, and it'd be great if you could take a good look then. |
8152a64 to
a9f0f2b
Compare
Codecov Report
@@ Coverage Diff @@
## master #782 +/- ##
==========================================
+ Coverage 87.72% 88.09% +0.36%
==========================================
Files 24 24
Lines 19447 19529 +82
Branches 3653 3674 +21
==========================================
+ Hits 17060 17204 +144
+ Misses 1292 1223 -69
- Partials 1095 1102 +7
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
8f43743 to
d99c031
Compare
|
OK, I think this is ready for review now. I'd like to bump up the C test coverage a bit (we're not hitting a bunch of the awkward corner cases) but I'm not sure what's the best way to get at them. Should probably craft a "simplest...." example to do it. Probably the easiest way is to make a really fragmented small ts... Anyway, ready for review. Good to get your eyes on this @petrelharp @benjeffery @gtsambos! |
petrelharp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks excellent! The only loose ends I could find are having to do with the modification to sort_tables. Feel free to ignore my PR - the behavior is being tested elsewhere, although the test I wrote makes it more obvious, maybe.
hooray, exciting! I'll look over tonight. And yes, if this works as well as it look like it does in your tests, it should be able to replace the awkward "deleting bits of ancestral segments" stuff that's currently giving me some grief in #679. |
|
Also, what's the best way to submit revisions to this (like @petrelharp has done) if I want to suggest something? Do I fork your repo, make my changes to the branch and then submit a PR as usual (?) |
|
I did |
You need to make a PR against this branch in my fork (jeromekelleher:simplify-keep-path-to-ancient). If you go to my fork and open a PR, it should give you the options to select the base branch and the PR branch. Thanks! |
|
(Oops, I was quoting @petrelharp above who had already solved the problem!) |
|
TODO
|
benjeffery
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been through this without deep-diving the algorithm, looks really good.
Ping if you'd like me to do this one. |
Thanks, but I think I know how to do it, so I'll have a go soon. |
b2c2e84 to
a17a840
Compare
|
Huh - turns out the reason we couldn't get test coverage on the squash loop is that it never gets called, ever. So, we can delete it! Gotta love the test suite here, it's totally awesome. |
a17a840 to
c9c7eb9
Compare
|
OK, that's as much coverage as we're going to get - let's merge! |
|
Actuall, crud, no, let's not merge. Forgot to update the changelog. |
|
I had a brief look at this last night in bed, it looks really nice :) I'm going to look a bit more closely now at how you are dealing with the removal/modification of the removed segments in the ancestral map, but a general 👍 from me |
gtsambos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey, awesome job @jeromekelleher, this looks great! So far I've focused on the Python code, and specifically on the topology-related stuff (haven't thought much about the implications for remapping of mutations) -- I think I'm going to need a bit more time to look at the C code here in detail, as it just takes a while to parse what's going on here, but I think you should go ahead and merge the PR. Thanks also for cleaning up that test suite of simplify-esque examples -- there are defs a lot of overlap between those that it's good to consolidate.
Also updates the internal algorithm to remove ancestry that has been processed.
stick test
c9c7eb9 to
6f885de
Compare
This is an initial prototype showing that we can solve #775 if we snip out the "used up" ancestry as we progress through simplify. This is initially just a proof of concept, as it does things in exactly the opposite way as would be intended.