-
Notifications
You must be signed in to change notification settings - Fork 78
Python implementation of IBD methods. #488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Great, thanks @gtsambos, looking forward go going through this with you tomorrow. |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good @gtsambos, this is great progress. As we discussed, I think the next steps are:
- Add some more tests. Small examples with previously worked out answers would be ideal, across a few different types of funky topologies.
- Start C-ifying the main algorithm. This is mainly a case of (a) changing the encoding of pairs (0, 0), (0, 1) etc to indexes 0, 1, ..., n**2 - 1; (b) changing the structure from creating temporary mappings that are moved between functions to global state that is modified in place by methods.
python/tests/ibd.py
Outdated
| # Copyright (c) 2018-2020 Tskit Developers | ||
| # Copyright (c) 2015-2018 University of Oxford |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't exist until now, so change to
# Copyright (c) 2020 Tskit Developers
|
This is great!! Let me know if I can help write some tests or something. |
|
Hi @jeromekelleher and @petrelharp (also cc @dvukcevic), this has come along a bit in the last few days after some weeks of muddy-headedness! I have one or two relatively easy things I'll do over the next few days -- cleaning and documenting the code, adding some other tests, making the CI tests work. Otherwise I think this should be ready to review quite soon, probably before next week.
(b) was pretty easy to do; (a) has been much more complicated. I have created a new object,
Cool, thanks @petrelharp! I have a few others to add in myself; I'll let you know once I've done these so you can see if they look comprehensive enough. |
|
Sounds good @gtsambos, please do ping me if you want any input in the meantime. |
|
Me, too! Looking forward to it. |
|
ready to review ✨ |
|
Oh yeah, I totally forgot that |
I think we can assume if msprime._version.version >= '0.7.5':
.... TEST |
|
okay will do, thanks @benjeffery! |
|
A good way to do this is to use |
|
Nice one! Wasn't aware of that decorator! |
| ts = treeSequence | ||
| ibd0 = ibd.IbdFinder(ts, samples=ts.samples()) | ||
| ibd0 = ibd0.find_ibd_segments() | ||
| ibd1 = get_ibd_all_pairs(ts, path_ibd=True, mrca_ibd=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a good place to test the samples argument, by also comparing output on a random subset of the samples.
|
I'll have a read-through - unless someone else is in the middle of it already? |
|
Looking through... looking good... I'll say when I'm done. In the meantime, a comment: I see that you're doing the thing that I did on my last PR, which is to make the python implementation as much like the C implementation as possible, including building linked lists explicitly with pointers to the next one, etcetera. BUT THEN, @jeromekelleher told me to be more pythonic, and leave that sort of thing up to python, thus leaving the basic structure of the algorithm more clear without having to worry about who's linked to who, etcetera. For instance, your I think he's right - it helps to separate the logic of the implementation from the fiddly bookkeeping details. But, on the other hand, I know for me it was helpful to do all this stuff in python before trying to do it in C. So, my suggestion is to at some point simplify these data structures to use more python - maybe now, but up to you. |
|
Thanks for looking at this so quickly @petrelharp! I had a call with @jeromekelleher last night and we went over this together then as well. I'll comment on some of the specific things up above. About this:
I was trying to make this pretty C-like, yes. I think I'll keep my SegmentLists as they are, because I'm using them in a pretty involved way. In this IBD algorithm, we need to be able to add lists together/do other operations on them, rather than just cycling through elements of the list as we do in Are there any things in this code other than the |
Ah, ok. I haven't gotten there yet.
I'm getting there! Had some important mucking about in the yard to do first, though. |
|
btw, some other things we talked about were in
in
|
|
Those sound good. Maybe I should wait 'til you've done those things to finish the careful read-through? |
|
sure :) hopefully will be done this week! I'd be really appreciative if you could look at the tests though, I'm not sure I put in as many wacky topologies as I could have |
|
Hmm, would it okay to add the |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @gtsambos, much simpler and cleaner! I think we need to do one final review of the data structures to make sure that it's all as simple as it can be - I'm not convinced the SegmentList structure is really needed now, for example. Maintaining the list of children nodes for each parent will be pretty tedious in C also, so if there was a less "nested array" way of doing that it would be nice. We currently don't have any hash tables or AVL trees in C in tskit, and it's a lot of hassle bringing them in, so it's worth really thinking about what we need data structure wise.
| # find_sample_pair_index method further down. | ||
|
|
||
| self.ibd_segments = {} | ||
| for key in self.sample_pairs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's much point in storing None for all the sample_pairs, right? We can just check to see if the key is already in the array or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's talk about this
| if self.ibd_segments[self.sample_pairs[index]] is None: | ||
| self.ibd_segments[self.sample_pairs[index]] = SegmentList( | ||
| head=seg, tail=seg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if self.ibd_segments[self.sample_pairs[index]] is None: | |
| self.ibd_segments[self.sample_pairs[index]] = SegmentList( | |
| head=seg, tail=seg | |
| index = self.find_sample_pair_index(sample0, sample1) | |
| key = self.sample_pairs[index] | |
| if key not in self.ibd_segments[key]: | |
| self.ibd_segments[key] = SegmentList(head=seg, tail=seg) |
Although, it would probably be better to map directly to the indexes, so that it'll be easier to compare with the C code (it's easy enough to convert the index to an actual tuple when printing out debug stuff.)
|
It looks a lot simpler, nice! I've got nothing else to add. I'm curious, though: where does this leave us for squashing edges - will require a refactor? |
|
Hi @jeromekelleher, about this:
I think we could do without it (and am trying to mod it like this now), but getting rid of it comes at the cost of adding a couple of new |
|
Hi @jeromekelleher, I'm just now thinking more deeply about if we were to get rid of SegmentLists. One issue is that the current output of this algorithm is currently SegmentLists (and it will have to be something like this, as this is basically what raw IBD information is). Updating these lists as the algorithm iterates isn't going to be quite as simple as before. Without having segment lists, we have two options:
(1) seems a little more ideal to me because of its efficiency? |
|
OK, great, let's keep the SegmentLists then @gtsambos - just wanted to make sure we still needed them. |
|
Hi @petrelharp @jeromekelleher , sorry for not responding to a lot of this -- I've been very busy with other time-critical projects but have set aside this afternoon to go through these!
I think that as long as we keep SegmentLists, the 'refactoring' will mostly involve adding new code rather than modifying this existing code base -- most immediately, I can see that we'd need a method for removing a segment from a SegmentList and patching it up (as opposed to just adding them in, as we need here), and we'd also need a SegmentList-type object or set of objects to hold the edges once they have been squashed.. I haven't thought much about the AVL/hash table stuff, and to be honest I feel a little out of my depth -- I think I'll need to read and learn more about these before I'm able to have a proper conversation about it. |
|
Hmmm, this build looks like it's failing for some reason unrelated to this pull request (something wrong with the |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor change here @gtsambos, and we can merge then I think. As discussed, let's follow up with with an initial C implementation, not worrying too much about optimum data structures.
python/tests/ibd.py
Outdated
| # Set up an iterator over the edges in the tree sequence. | ||
| edges_iter = iter(self.ts.edges()) | ||
| e = next(edges_iter) | ||
| PARENT_SHOULD_BE_ADDED = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this ALL_CAPS? It looks like a regular variable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay, I've changed this in the latest commit. I think I'm used to seeing boolean variables being capitalised, but it's no big deal to me either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ALL_CAPS is usually reserved for constants, I think.
Great! I'll probably need to do some squashing and rebasing as well, but only once you're happy with the code itself. |
|
OK, great! Can you rebase and squash please? Don't worry about the travis builds, we'll merge once they've been sorted out. |
|
Cool -- done |
Baby commit of IBD tests. Working, but messy Python implementation of fast IBD algorithm. ibdFinder now returns segments rather than intervals. verify_equal_ibd function has been changed Bug fix. Rudimentary tests of IbdFinder for path_IBD=true, mrca_IBD=true seem to work.
min-length and max-time argument of IBDFinder should now be working. Added a new test class, TestIbdTopologies. Added functions to test for equality of ibd segments. extra test topologies in test_ibd.py Converted list of ancetral segments into more C-like objects.
Added to SegmentList class and changed all uses of A, S to use the SegmentList class. Fixed bug. Added more examples to test_ibd.py ibd_segments is now an attribute of the IbdFinder class Removed new_segs Infrastructure for converting sample pairs to index in struct array. ibd_segments now contains linked lists Fixed bug, modified tests to convert SegmentList output into normal python lists. Neatened the IBD tests. Added some documentation. Added more tests, code to deal with fringe cases. Jerome and Peter's latest comments. Peter's comments. Updated IBD tests with forward time DTWF tests.
small bugfix forward DTWF sims are now in tests Samples no longer need to be sorted, sample_id_map is part of the constructor. mygen renamed to edges_iter current_time, current_parent and PARENT_TO_BE_ADDED are now local to find_ibd_segments() More small changes Removed dependency on msprime v1.0 Removed parent_list, added simpler oldest_parent attribute Removed processed_nodes object Small change to variable name.
|
great, thanks @benjeffery! |
Codecov Report
@@ Coverage Diff @@
## master #488 +/- ##
=======================================
Coverage 87.75% 87.75%
=======================================
Files 23 23
Lines 17687 17687
Branches 3498 3498
=======================================
Hits 15522 15522
Misses 1062 1062
Partials 1103 1103
Continue to review full report at Codecov.
|
|
Great, thanks @gtsambos. Merging this much, so we can move on with the next phase of C implementation. |
Pulling this in now so that others can see my continuing progress on IBD-calculating methods for
tskit(cc @jeromekelleher for our call tomorrow). It's very basic so far but it should work:Command-line implementation:
should return IBD segments shared between all pairs of samples.
should return IBD segments shared between samples 0 and 1 (ensure the nodes are ordered).
Tests:
IBD definitions
A pair of sample nodes that have both inherited the interval
(left, right)from a single ancestor,node, are IBD over the interval(left, right).Further options:
path_ibd: IfTrue, the IBD segment must be inherited by two samples from a common ancestor without being broken by recombination since the time of that ancestor. Equivalently, the path of the tree connecting the sample pair must be the same along the whole span of the interval.mrca_ibd: IfTrue, only IBD segments at the MRCA of the sample pair are recorded.Cutoff criteria:
min_length: Only segments longer thanmin_lengthare returned.max_time: Only segments from common ancestors who lived more recently thanmax_timeare returned.Note: currently, the only implemented version is
path_ibd=True,mrca_ibd=True,min_length=0,max_time=0, but watch this space!