Python implementation of IBD methods. #488

gtsambos · 2020-03-18T08:13:05Z

Pulling this in now so that others can see my continuing progress on IBD-calculating methods for tskit (cc @jeromekelleher for our call tomorrow). It's very basic so far but it should work:

Command-line implementation:

python3 tests/ibd.py test.trees

should return IBD segments shared between all pairs of samples.

python3 tests/ibd.py test.trees 0 1

should return IBD segments shared between samples 0 and 1 (ensure the nodes are ordered).

Tests:

python3 -m nose -vs tests/test_ibd.py

IBD definitions

A pair of sample nodes that have both inherited the interval (left, right) from a single ancestor, node, are IBD over the interval (left, right).

Further options:

path_ibd: If True, the IBD segment must be inherited by two samples from a common ancestor without being broken by recombination since the time of that ancestor. Equivalently, the path of the tree connecting the sample pair must be the same along the whole span of the interval.
mrca_ibd: If True, only IBD segments at the MRCA of the sample pair are recorded.

Cutoff criteria:

min_length: Only segments longer than min_length are returned.
max_time: Only segments from common ancestors who lived more recently than max_time are returned.

Note: currently, the only implemented version is path_ibd=True, mrca_ibd=True, min_length=0, max_time=0, but watch this space!

jeromekelleher · 2020-03-18T09:19:50Z

Great, thanks @gtsambos, looking forward go going through this with you tomorrow.

python/tests/ibd.py

jeromekelleher

Look good @gtsambos, this is great progress. As we discussed, I think the next steps are:

Add some more tests. Small examples with previously worked out answers would be ideal, across a few different types of funky topologies.
Start C-ifying the main algorithm. This is mainly a case of (a) changing the encoding of pairs (0, 0), (0, 1) etc to indexes 0, 1, ..., n**2 - 1; (b) changing the structure from creating temporary mappings that are moved between functions to global state that is modified in place by methods.

jeromekelleher · 2020-03-19T10:10:39Z

python/tests/ibd.py

+# Copyright (c) 2018-2020 Tskit Developers
+# Copyright (c) 2015-2018 University of Oxford


This didn't exist until now, so change to

# Copyright (c) 2020 Tskit Developers

petrelharp · 2020-03-25T15:16:13Z

This is great!! Let me know if I can help write some tests or something.

gtsambos · 2020-04-22T02:21:18Z

Hi @jeromekelleher and @petrelharp (also cc @dvukcevic), this has come along a bit in the last few days after some weeks of muddy-headedness! I have one or two relatively easy things I'll do over the next few days -- cleaning and documenting the code, adding some other tests, making the CI tests work. Otherwise I think this should be ready to review quite soon, probably before next week.

Start C-ifying the main algorithm. This is mainly a case of (a) changing the encoding of pairs (0, 0), (0, 1) etc to indexes 0, 1, ..., n**2 - 1; (b) changing the structure from creating temporary mappings that are moved between functions to global state that is modified in place by methods.

(b) was pretty easy to do; (a) has been much more complicated. I have created a new object, SegmentList, to hold linked lists of Segment objects. (Confusingly, most of the algorithm works with lists of SegmentLists -- ie. lists of lists of segments). For now, I have retained the representation of the output, ibd_segments, as a Python dictionary, with the keys being the sample pairs and the values being the list of IBD segments for each pair. However, in the C code this will be a struct array holding SegmentList objects. I've added some methods that mimic the procedure that will be needed to 'look up' the correct array element in C, even though this wasn't strictly necessary to make the Python code work. I'm happy to talk through this on a call or elsewhere.

This is great!! Let me know if I can help write some tests or something.

Cool, thanks @petrelharp! I have a few others to add in myself; I'll let you know once I've done these so you can see if they look comprehensive enough.

jeromekelleher · 2020-04-22T09:28:06Z

Sounds good @gtsambos, please do ping me if you want any input in the meantime.

petrelharp · 2020-04-22T14:00:11Z

Me, too! Looking forward to it.

gtsambos · 2020-04-26T23:40:23Z

ready to review ✨

gtsambos · 2020-04-27T01:14:34Z

Oh yeah, I totally forgot that discrete=True won't work in the CI tests. Should I just comment these out until the new msprime release? I think that's the only reason why the CI is failing

benjeffery · 2020-04-27T10:24:21Z

Should I just comment these out until the new msprime release?

I think we can assume discrete=True will be in the next release so how about:

if msprime._version.version >= '0.7.5':
    .... TEST

gtsambos · 2020-04-27T10:32:25Z

okay will do, thanks @benjeffery!

jeromekelleher · 2020-04-27T11:13:52Z

A good way to do this is to use @unittest.skipIf(msprime.__version__ <= 0.7.4)

benjeffery · 2020-04-27T11:32:37Z

Nice one! Wasn't aware of that decorator!
This way also tells us that the test is being skipped in test output so is extra nice.

petrelharp · 2020-04-27T15:11:07Z

python/tests/test_ibd.py

+    ts = treeSequence
+    ibd0 = ibd.IbdFinder(ts, samples=ts.samples())
+    ibd0 = ibd0.find_ibd_segments()
+    ibd1 = get_ibd_all_pairs(ts, path_ibd=True, mrca_ibd=True)


This would be a good place to test the samples argument, by also comparing output on a random subset of the samples.

petrelharp · 2020-04-27T15:32:13Z

I'll have a read-through - unless someone else is in the middle of it already?

python/tests/ibd.py

python/tests/test_ibd.py

petrelharp · 2020-04-27T17:59:34Z

Looking through... looking good... I'll say when I'm done. In the meantime, a comment: I see that you're doing the thing that I did on my last PR, which is to make the python implementation as much like the C implementation as possible, including building linked lists explicitly with pointers to the next one, etcetera. BUT THEN, @jeromekelleher told me to be more pythonic, and leave that sort of thing up to python, thus leaving the basic structure of the algorithm more clear without having to worry about who's linked to who, etcetera. For instance, your SegmentList could probably just be a list, without defining a special class?

I think he's right - it helps to separate the logic of the implementation from the fiddly bookkeeping details. But, on the other hand, I know for me it was helpful to do all this stuff in python before trying to do it in C. So, my suggestion is to at some point simplify these data structures to use more python - maybe now, but up to you.

gtsambos · 2020-04-27T23:44:26Z

Thanks for looking at this so quickly @petrelharp! I had a call with @jeromekelleher last night and we went over this together then as well. I'll comment on some of the specific things up above. About this:

I see that you're doing the thing that I did on my last PR, which is to make the python implementation as much like the C implementation as possible

For instance, your SegmentList could probably just be a list, without defining a special class?

I was trying to make this pretty C-like, yes. I think I'll keep my SegmentLists as they are, because I'm using them in a pretty involved way. In this IBD algorithm, we need to be able to add lists together/do other operations on them, rather than just cycling through elements of the list as we do in simplify and link_to_ancestors. Also, the output of the C-based algorithm will have to be an array of SegmentList objects (each one listing the IBD ancestors/segments for a particular sample pair), so it makes sense to me that we'd have also it in Python in order to compare the output between the two implementations. There are quite a few methods I'll have to define in C to do these things, to the point where I thought it would look a bit unrecognisable if I omitted them here.

Are there any things in this code other than the SegmentList objects that could be simplified, though?

petrelharp · 2020-04-28T00:11:10Z

we need to be able to add lists together/do other operations on them, rather than just cycling through elements of the list

Ah, ok. I haven't gotten there yet.

Are there any things in this code other than the SegmentList objects that could be simplified, though?

I'm getting there! Had some important mucking about in the yard to do first, though.

gtsambos · 2020-04-28T00:11:46Z

btw, some other things we talked about were

in ibd.py:

changing this add method to the more explicit SegmentList().add(...) instead of using +
changing SegL_head and SegL_tail to be local variables inside the find_ibd_segments method
initialising ibd_segments with None rather than with SegmentList()

in test_ibd.py

Simplifying the 'naive' get_ibd method to only test MRCA ibd (and so making the function clearer to follow)
replacing assert statements with self.assert statements (look into defining a subclass of unittest.TestCase specific to these IBD tests)
Generating some weird random tree sequences with the forward-time DTWF simulator

petrelharp · 2020-04-28T00:17:49Z

Those sound good. Maybe I should wait 'til you've done those things to finish the careful read-through?

gtsambos · 2020-04-28T00:28:46Z

sure :) hopefully will be done this week! I'd be really appreciative if you could look at the tests though, I'm not sure I put in as many wacky topologies as I could have

gtsambos · 2020-05-18T06:42:49Z

Hmm, would it okay to add the packaging module to development.txt? It looks smallish, and I think I need this to safely do comparisons of versions like we discussed here. This is why the CI is currently failing btw.

jeromekelleher

Looks great @gtsambos, much simpler and cleaner! I think we need to do one final review of the data structures to make sure that it's all as simple as it can be - I'm not convinced the SegmentList structure is really needed now, for example. Maintaining the list of children nodes for each parent will be pretty tedious in C also, so if there was a less "nested array" way of doing that it would be nice. We currently don't have any hash tables or AVL trees in C in tskit, and it's a lot of hassle bringing them in, so it's worth really thinking about what we need data structure wise.

python/tests/ibd.py

jeromekelleher · 2020-05-18T09:42:15Z

python/tests/ibd.py

+        # find_sample_pair_index method further down.
+
+        self.ibd_segments = {}
+        for key in self.sample_pairs:


I don't think there's much point in storing None for all the sample_pairs, right? We can just check to see if the key is already in the array or not.

Let's talk about this

jeromekelleher · 2020-05-18T09:47:32Z

python/tests/ibd.py

+        if self.ibd_segments[self.sample_pairs[index]] is None:
+            self.ibd_segments[self.sample_pairs[index]] = SegmentList(
+                head=seg, tail=seg


Suggested change

if self.ibd_segments[self.sample_pairs[index]] is None:

self.ibd_segments[self.sample_pairs[index]] = SegmentList(

head=seg, tail=seg

index = self.find_sample_pair_index(sample0, sample1)

key = self.sample_pairs[index]

if key not in self.ibd_segments[key]:

self.ibd_segments[key] = SegmentList(head=seg, tail=seg)

Although, it would probably be better to map directly to the indexes, so that it'll be easier to compare with the C code (it's easy enough to convert the index to an actual tuple when printing out debug stuff.)

python/tests/test_ibd.py

python/tests/ibd.py

petrelharp · 2020-05-18T18:52:29Z

It looks a lot simpler, nice! I've got nothing else to add. I'm curious, though: where does this leave us for squashing edges - will require a refactor?

gtsambos · 2020-05-19T07:32:24Z

Hi @jeromekelleher, about this:

I'm not convinced the SegmentList structure is really needed now, for example.

I think we could do without it (and am trying to mod it like this now), but getting rid of it comes at the cost of adding a couple of new add_segment_to_list methods that are all basically doing the same thing

gtsambos · 2020-05-19T08:06:49Z

Hi @jeromekelleher, I'm just now thinking more deeply about if we were to get rid of SegmentLists. One issue is that the current output of this algorithm is currently SegmentLists (and it will have to be something like this, as this is basically what raw IBD information is).

Updating these lists as the algorithm iterates isn't going to be quite as simple as before. Without having segment lists, we have two options:

For each pair of samples, keep track of both the head and the tail segment. Each time a new IBD segment is found, add it to the tail. We'll need to make a struct array for the heads (indexed by the sample pair), and another for the tails (also indexed by the sample pair), and will have to update these array elements each time a new segment is found.
For each pair of IBD segments, we keep track of only the head segment. Everytime we want to add a new ibd segment, we will have to cycle through all of the segments starting from the first one.

(1) seems a little more ideal to me because of its efficiency?

jeromekelleher · 2020-05-19T13:20:21Z

OK, great, let's keep the SegmentLists then @gtsambos - just wanted to make sure we still needed them.

gtsambos · 2020-05-26T02:36:39Z

Hi @petrelharp @jeromekelleher , sorry for not responding to a lot of this -- I've been very busy with other time-critical projects but have set aside this afternoon to go through these!

I'm curious, though: where does this leave us for squashing edges - will require a refactor?

I think that as long as we keep SegmentLists, the 'refactoring' will mostly involve adding new code rather than modifying this existing code base -- most immediately, I can see that we'd need a method for removing a segment from a SegmentList and patching it up (as opposed to just adding them in, as we need here), and we'd also need a SegmentList-type object or set of objects to hold the edges once they have been squashed..

I haven't thought much about the AVL/hash table stuff, and to be honest I feel a little out of my depth -- I think I'll need to read and learn more about these before I'm able to have a proper conversation about it.

gtsambos · 2020-06-08T07:43:12Z

Hmmm, this build looks like it's failing for some reason unrelated to this pull request (something wrong with the pysam library?)

jeromekelleher · 2020-06-08T07:55:37Z

Yep, Pysam thing is unrelated @gtsambos - see #673

jeromekelleher

One minor change here @gtsambos, and we can merge then I think. As discussed, let's follow up with with an initial C implementation, not worrying too much about optimum data structures.

jeromekelleher · 2020-06-08T07:58:42Z

python/tests/ibd.py

+        # Set up an iterator over the edges in the tree sequence.
+        edges_iter = iter(self.ts.edges())
+        e = next(edges_iter)
+        PARENT_SHOULD_BE_ADDED = True


Why is this ALL_CAPS? It looks like a regular variable to me.

Oh okay, I've changed this in the latest commit. I think I'm used to seeing boolean variables being capitalised, but it's no big deal to me either way

ALL_CAPS is usually reserved for constants, I think.

gtsambos · 2020-06-08T09:51:00Z

One minor change here @gtsambos, and we can merge then I think. As discussed, let's follow up with with an initial C implementation, not worrying too much about optimum data structures.

Great! I'll probably need to do some squashing and rebasing as well, but only once you're happy with the code itself.

jeromekelleher · 2020-06-08T10:18:09Z

OK, great! Can you rebase and squash please? Don't worry about the travis builds, we'll merge once they've been sorted out.

gtsambos · 2020-06-08T12:22:14Z

Cool -- done

benjeffery · 2020-06-08T14:04:33Z

@gtsambos #673 is fixed and merged so please go ahead and rebase to clear the CI.

Baby commit of IBD tests. Working, but messy Python implementation of fast IBD algorithm. ibdFinder now returns segments rather than intervals. verify_equal_ibd function has been changed Bug fix. Rudimentary tests of IbdFinder for path_IBD=true, mrca_IBD=true seem to work.

min-length and max-time argument of IBDFinder should now be working. Added a new test class, TestIbdTopologies. Added functions to test for equality of ibd segments. extra test topologies in test_ibd.py Converted list of ancetral segments into more C-like objects.

Added to SegmentList class and changed all uses of A, S to use the SegmentList class. Fixed bug. Added more examples to test_ibd.py ibd_segments is now an attribute of the IbdFinder class Removed new_segs Infrastructure for converting sample pairs to index in struct array. ibd_segments now contains linked lists Fixed bug, modified tests to convert SegmentList output into normal python lists. Neatened the IBD tests. Added some documentation. Added more tests, code to deal with fringe cases. Jerome and Peter's latest comments. Peter's comments. Updated IBD tests with forward time DTWF tests.

small bugfix forward DTWF sims are now in tests Samples no longer need to be sorted, sample_id_map is part of the constructor. mygen renamed to edges_iter current_time, current_parent and PARENT_TO_BE_ADDED are now local to find_ibd_segments() More small changes Removed dependency on msprime v1.0 Removed parent_list, added simpler oldest_parent attribute Removed processed_nodes object Small change to variable name.

gtsambos · 2020-06-08T23:22:49Z

great, thanks @benjeffery!

codecov · 2020-06-08T23:46:01Z

Codecov Report

Merging #488 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #488   +/-   ##
=======================================
  Coverage   87.75%   87.75%           
=======================================
  Files          23       23           
  Lines       17687    17687           
  Branches     3498     3498           
=======================================
  Hits        15522    15522           
  Misses       1062     1062           
  Partials     1103     1103

Flag	Coverage Δ
#c_tests	`89.03% <ø> (ø)`
#python_c_tests	`91.23% <ø> (ø)`
#python_tests	`98.99% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e495fd7...bebf6f6. Read the comment docs.

jeromekelleher · 2020-06-09T06:54:48Z

Great, thanks @gtsambos. Merging this much, so we can move on with the next phase of C implementation.

jeromekelleher reviewed Mar 19, 2020

View reviewed changes

python/tests/ibd.py Outdated Show resolved Hide resolved

jeromekelleher reviewed Mar 19, 2020

View reviewed changes

gtsambos force-pushed the ibd branch from 8da1901 to e51a20d Compare March 23, 2020 00:17

gtsambos force-pushed the ibd branch from e51a20d to 946cf24 Compare April 22, 2020 02:04

gtsambos marked this pull request as draft April 22, 2020 07:03

gtsambos force-pushed the ibd branch from 946cf24 to bd7f066 Compare April 26, 2020 23:37

gtsambos marked this pull request as ready for review April 27, 2020 01:14

petrelharp reviewed Apr 27, 2020

View reviewed changes

python/tests/ibd.py Outdated Show resolved Hide resolved

petrelharp reviewed Apr 27, 2020

View reviewed changes

python/tests/ibd.py Outdated Show resolved Hide resolved

petrelharp reviewed Apr 27, 2020

View reviewed changes

python/tests/test_ibd.py Show resolved Hide resolved

petrelharp reviewed Apr 27, 2020

View reviewed changes

python/tests/test_ibd.py Show resolved Hide resolved

jeromekelleher reviewed May 18, 2020

View reviewed changes

petrelharp reviewed May 18, 2020

View reviewed changes

python/tests/ibd.py Show resolved Hide resolved

petrelharp reviewed May 18, 2020

View reviewed changes

python/tests/ibd.py Show resolved Hide resolved

jeromekelleher reviewed Jun 8, 2020

View reviewed changes

gtsambos force-pushed the ibd branch from 31d6233 to 94ca30c Compare June 8, 2020 12:21

gtsambos added 4 commits June 9, 2020 09:22

gtsambos force-pushed the ibd branch from 94ca30c to bebf6f6 Compare June 8, 2020 23:22

jeromekelleher approved these changes Jun 9, 2020

View reviewed changes

jeromekelleher added the AUTOMERGE-REQUESTED label Jun 9, 2020

mergify bot merged commit fc7473b into tskit-dev:master Jun 9, 2020

mergify bot removed the AUTOMERGE-REQUESTED label Jun 9, 2020

gtsambos mentioned this pull request Aug 24, 2020

Keep path to ancient in simplify #782

Merged

		# Copyright (c) 2018-2020 Tskit Developers
		# Copyright (c) 2015-2018 University of Oxford

-        if self.ibd_segments[self.sample_pairs[index]] is None:
-            self.ibd_segments[self.sample_pairs[index]] = SegmentList(
-                head=seg, tail=seg
+        index = self.find_sample_pair_index(sample0, sample1)
+        key = self.sample_pairs[index]
+        if key not in self.ibd_segments[key]:
+            self.ibd_segments[key] = SegmentList(head=seg, tail=seg)

Python implementation of IBD methods. #488

Python implementation of IBD methods. #488

Uh oh!

Conversation

gtsambos commented Mar 18, 2020

IBD definitions

Uh oh!

jeromekelleher commented Mar 18, 2020

Uh oh!

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petrelharp commented Mar 25, 2020

Uh oh!

gtsambos commented Apr 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Apr 22, 2020

Uh oh!

petrelharp commented Apr 22, 2020

Uh oh!

gtsambos commented Apr 26, 2020

Uh oh!

gtsambos commented Apr 27, 2020

Uh oh!

benjeffery commented Apr 27, 2020

Uh oh!

gtsambos commented Apr 27, 2020

Uh oh!

jeromekelleher commented Apr 27, 2020

Uh oh!

benjeffery commented Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petrelharp commented Apr 27, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petrelharp commented Apr 27, 2020

Uh oh!

gtsambos commented Apr 27, 2020

Uh oh!

petrelharp commented Apr 28, 2020

Uh oh!

gtsambos commented Apr 28, 2020

Uh oh!

petrelharp commented Apr 28, 2020

Uh oh!

gtsambos commented Apr 28, 2020

Uh oh!

gtsambos commented May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petrelharp commented May 18, 2020

gtsambos commented Apr 22, 2020 •

edited

Loading

benjeffery commented Apr 27, 2020 •

edited

Loading

gtsambos commented May 18, 2020 •

edited

Loading

gtsambos commented May 19, 2020 •

edited

Loading

codecov bot commented Jun 8, 2020 •

edited

Loading