Skip to content

Conversation

@gtsambos
Copy link
Member

Pulling this in now so that others can see my continuing progress on IBD-calculating methods for tskit (cc @jeromekelleher for our call tomorrow). It's very basic so far but it should work:

Command-line implementation:

python3 tests/ibd.py test.trees

should return IBD segments shared between all pairs of samples.

python3 tests/ibd.py test.trees 0 1

should return IBD segments shared between samples 0 and 1 (ensure the nodes are ordered).

Tests:

python3 -m nose -vs tests/test_ibd.py

IBD definitions

A pair of sample nodes that have both inherited the interval (left, right) from a single ancestor, node, are IBD over the interval (left, right).

Further options:

  • path_ibd: If True, the IBD segment must be inherited by two samples from a common ancestor without being broken by recombination since the time of that ancestor. Equivalently, the path of the tree connecting the sample pair must be the same along the whole span of the interval.
  • mrca_ibd: If True, only IBD segments at the MRCA of the sample pair are recorded.

Cutoff criteria:

  • min_length: Only segments longer than min_length are returned.
  • max_time: Only segments from common ancestors who lived more recently than max_time are returned.

Note: currently, the only implemented version is path_ibd=True, mrca_ibd=True, min_length=0, max_time=0, but watch this space!

@jeromekelleher
Copy link
Member

Great, thanks @gtsambos, looking forward go going through this with you tomorrow.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good @gtsambos, this is great progress. As we discussed, I think the next steps are:

  1. Add some more tests. Small examples with previously worked out answers would be ideal, across a few different types of funky topologies.
  2. Start C-ifying the main algorithm. This is mainly a case of (a) changing the encoding of pairs (0, 0), (0, 1) etc to indexes 0, 1, ..., n**2 - 1; (b) changing the structure from creating temporary mappings that are moved between functions to global state that is modified in place by methods.

Comment on lines 3 to 4
# Copyright (c) 2018-2020 Tskit Developers
# Copyright (c) 2015-2018 University of Oxford
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't exist until now, so change to

# Copyright (c) 2020 Tskit Developers

@petrelharp
Copy link
Contributor

This is great!! Let me know if I can help write some tests or something.

@gtsambos
Copy link
Member Author

gtsambos commented Apr 22, 2020

Hi @jeromekelleher and @petrelharp (also cc @dvukcevic), this has come along a bit in the last few days after some weeks of muddy-headedness! I have one or two relatively easy things I'll do over the next few days -- cleaning and documenting the code, adding some other tests, making the CI tests work. Otherwise I think this should be ready to review quite soon, probably before next week.

Start C-ifying the main algorithm. This is mainly a case of (a) changing the encoding of pairs (0, 0), (0, 1) etc to indexes 0, 1, ..., n**2 - 1; (b) changing the structure from creating temporary mappings that are moved between functions to global state that is modified in place by methods.

(b) was pretty easy to do; (a) has been much more complicated. I have created a new object, SegmentList, to hold linked lists of Segment objects. (Confusingly, most of the algorithm works with lists of SegmentLists -- ie. lists of lists of segments). For now, I have retained the representation of the output, ibd_segments, as a Python dictionary, with the keys being the sample pairs and the values being the list of IBD segments for each pair. However, in the C code this will be a struct array holding SegmentList objects. I've added some methods that mimic the procedure that will be needed to 'look up' the correct array element in C, even though this wasn't strictly necessary to make the Python code work. I'm happy to talk through this on a call or elsewhere.

This is great!! Let me know if I can help write some tests or something.

Cool, thanks @petrelharp! I have a few others to add in myself; I'll let you know once I've done these so you can see if they look comprehensive enough.

@gtsambos gtsambos marked this pull request as draft April 22, 2020 07:03
@jeromekelleher
Copy link
Member

Sounds good @gtsambos, please do ping me if you want any input in the meantime.

@petrelharp
Copy link
Contributor

Me, too! Looking forward to it.

@gtsambos
Copy link
Member Author

ready to review ✨

@gtsambos
Copy link
Member Author

Oh yeah, I totally forgot that discrete=True won't work in the CI tests. Should I just comment these out until the new msprime release? I think that's the only reason why the CI is failing

@gtsambos gtsambos marked this pull request as ready for review April 27, 2020 01:14
@benjeffery
Copy link
Member

Should I just comment these out until the new msprime release?

I think we can assume discrete=True will be in the next release so how about:

if msprime._version.version >= '0.7.5':
    .... TEST

@gtsambos
Copy link
Member Author

okay will do, thanks @benjeffery!

@jeromekelleher
Copy link
Member

A good way to do this is to use @unittest.skipIf(msprime.__version__ <= 0.7.4)

@benjeffery
Copy link
Member

benjeffery commented Apr 27, 2020

Nice one! Wasn't aware of that decorator!
This way also tells us that the test is being skipped in test output so is extra nice.

ts = treeSequence
ibd0 = ibd.IbdFinder(ts, samples=ts.samples())
ibd0 = ibd0.find_ibd_segments()
ibd1 = get_ibd_all_pairs(ts, path_ibd=True, mrca_ibd=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a good place to test the samples argument, by also comparing output on a random subset of the samples.

@petrelharp
Copy link
Contributor

I'll have a read-through - unless someone else is in the middle of it already?

@petrelharp
Copy link
Contributor

Looking through... looking good... I'll say when I'm done. In the meantime, a comment: I see that you're doing the thing that I did on my last PR, which is to make the python implementation as much like the C implementation as possible, including building linked lists explicitly with pointers to the next one, etcetera. BUT THEN, @jeromekelleher told me to be more pythonic, and leave that sort of thing up to python, thus leaving the basic structure of the algorithm more clear without having to worry about who's linked to who, etcetera. For instance, your SegmentList could probably just be a list, without defining a special class?

I think he's right - it helps to separate the logic of the implementation from the fiddly bookkeeping details. But, on the other hand, I know for me it was helpful to do all this stuff in python before trying to do it in C. So, my suggestion is to at some point simplify these data structures to use more python - maybe now, but up to you.

@gtsambos
Copy link
Member Author

Thanks for looking at this so quickly @petrelharp! I had a call with @jeromekelleher last night and we went over this together then as well. I'll comment on some of the specific things up above. About this:

I see that you're doing the thing that I did on my last PR, which is to make the python implementation as much like the C implementation as possible

For instance, your SegmentList could probably just be a list, without defining a special class?

I was trying to make this pretty C-like, yes. I think I'll keep my SegmentLists as they are, because I'm using them in a pretty involved way. In this IBD algorithm, we need to be able to add lists together/do other operations on them, rather than just cycling through elements of the list as we do in simplify and link_to_ancestors. Also, the output of the C-based algorithm will have to be an array of SegmentList objects (each one listing the IBD ancestors/segments for a particular sample pair), so it makes sense to me that we'd have also it in Python in order to compare the output between the two implementations. There are quite a few methods I'll have to define in C to do these things, to the point where I thought it would look a bit unrecognisable if I omitted them here.

Are there any things in this code other than the SegmentList objects that could be simplified, though?

@petrelharp
Copy link
Contributor

we need to be able to add lists together/do other operations on them, rather than just cycling through elements of the list

Ah, ok. I haven't gotten there yet.

Are there any things in this code other than the SegmentList objects that could be simplified, though?

I'm getting there! Had some important mucking about in the yard to do first, though.

@gtsambos
Copy link
Member Author

btw, some other things we talked about were

in ibd.py:

  • changing this add method to the more explicit SegmentList().add(...) instead of using +
  • changing SegL_head and SegL_tail to be local variables inside the find_ibd_segments method
  • initialising ibd_segments with None rather than with SegmentList()

in test_ibd.py

  • Simplifying the 'naive' get_ibd method to only test MRCA ibd (and so making the function clearer to follow)
  • replacing assert statements with self.assert statements (look into defining a subclass of unittest.TestCase specific to these IBD tests)
  • Generating some weird random tree sequences with the forward-time DTWF simulator

@petrelharp
Copy link
Contributor

Those sound good. Maybe I should wait 'til you've done those things to finish the careful read-through?

@gtsambos
Copy link
Member Author

sure :) hopefully will be done this week! I'd be really appreciative if you could look at the tests though, I'm not sure I put in as many wacky topologies as I could have

@gtsambos
Copy link
Member Author

gtsambos commented May 18, 2020

Hmm, would it okay to add the packaging module to development.txt? It looks smallish, and I think I need this to safely do comparisons of versions like we discussed here. This is why the CI is currently failing btw.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @gtsambos, much simpler and cleaner! I think we need to do one final review of the data structures to make sure that it's all as simple as it can be - I'm not convinced the SegmentList structure is really needed now, for example. Maintaining the list of children nodes for each parent will be pretty tedious in C also, so if there was a less "nested array" way of doing that it would be nice. We currently don't have any hash tables or AVL trees in C in tskit, and it's a lot of hassle bringing them in, so it's worth really thinking about what we need data structure wise.

# find_sample_pair_index method further down.

self.ibd_segments = {}
for key in self.sample_pairs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's much point in storing None for all the sample_pairs, right? We can just check to see if the key is already in the array or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's talk about this

Comment on lines +183 to +190
if self.ibd_segments[self.sample_pairs[index]] is None:
self.ibd_segments[self.sample_pairs[index]] = SegmentList(
head=seg, tail=seg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.ibd_segments[self.sample_pairs[index]] is None:
self.ibd_segments[self.sample_pairs[index]] = SegmentList(
head=seg, tail=seg
index = self.find_sample_pair_index(sample0, sample1)
key = self.sample_pairs[index]
if key not in self.ibd_segments[key]:
self.ibd_segments[key] = SegmentList(head=seg, tail=seg)

Although, it would probably be better to map directly to the indexes, so that it'll be easier to compare with the C code (it's easy enough to convert the index to an actual tuple when printing out debug stuff.)

@petrelharp
Copy link
Contributor

It looks a lot simpler, nice! I've got nothing else to add. I'm curious, though: where does this leave us for squashing edges - will require a refactor?

@gtsambos
Copy link
Member Author

gtsambos commented May 19, 2020

Hi @jeromekelleher, about this:

I'm not convinced the SegmentList structure is really needed now, for example.

I think we could do without it (and am trying to mod it like this now), but getting rid of it comes at the cost of adding a couple of new add_segment_to_list methods that are all basically doing the same thing

@gtsambos
Copy link
Member Author

Hi @jeromekelleher, I'm just now thinking more deeply about if we were to get rid of SegmentLists. One issue is that the current output of this algorithm is currently SegmentLists (and it will have to be something like this, as this is basically what raw IBD information is).

Updating these lists as the algorithm iterates isn't going to be quite as simple as before. Without having segment lists, we have two options:

  1. For each pair of samples, keep track of both the head and the tail segment. Each time a new IBD segment is found, add it to the tail. We'll need to make a struct array for the heads (indexed by the sample pair), and another for the tails (also indexed by the sample pair), and will have to update these array elements each time a new segment is found.

  2. For each pair of IBD segments, we keep track of only the head segment. Everytime we want to add a new ibd segment, we will have to cycle through all of the segments starting from the first one.

(1) seems a little more ideal to me because of its efficiency?

@jeromekelleher
Copy link
Member

OK, great, let's keep the SegmentLists then @gtsambos - just wanted to make sure we still needed them.

@gtsambos
Copy link
Member Author

Hi @petrelharp @jeromekelleher , sorry for not responding to a lot of this -- I've been very busy with other time-critical projects but have set aside this afternoon to go through these!

I'm curious, though: where does this leave us for squashing edges - will require a refactor?

I think that as long as we keep SegmentLists, the 'refactoring' will mostly involve adding new code rather than modifying this existing code base -- most immediately, I can see that we'd need a method for removing a segment from a SegmentList and patching it up (as opposed to just adding them in, as we need here), and we'd also need a SegmentList-type object or set of objects to hold the edges once they have been squashed..

I haven't thought much about the AVL/hash table stuff, and to be honest I feel a little out of my depth -- I think I'll need to read and learn more about these before I'm able to have a proper conversation about it.

@gtsambos
Copy link
Member Author

gtsambos commented Jun 8, 2020

Hmmm, this build looks like it's failing for some reason unrelated to this pull request (something wrong with the pysam library?)

@jeromekelleher
Copy link
Member

Yep, Pysam thing is unrelated @gtsambos - see #673

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor change here @gtsambos, and we can merge then I think. As discussed, let's follow up with with an initial C implementation, not worrying too much about optimum data structures.

# Set up an iterator over the edges in the tree sequence.
edges_iter = iter(self.ts.edges())
e = next(edges_iter)
PARENT_SHOULD_BE_ADDED = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this ALL_CAPS? It looks like a regular variable to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay, I've changed this in the latest commit. I think I'm used to seeing boolean variables being capitalised, but it's no big deal to me either way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALL_CAPS is usually reserved for constants, I think.

@gtsambos
Copy link
Member Author

gtsambos commented Jun 8, 2020

One minor change here @gtsambos, and we can merge then I think. As discussed, let's follow up with with an initial C implementation, not worrying too much about optimum data structures.

Great! I'll probably need to do some squashing and rebasing as well, but only once you're happy with the code itself.

@jeromekelleher
Copy link
Member

OK, great! Can you rebase and squash please? Don't worry about the travis builds, we'll merge once they've been sorted out.

@gtsambos
Copy link
Member Author

gtsambos commented Jun 8, 2020

Cool -- done

@benjeffery
Copy link
Member

@gtsambos #673 is fixed and merged so please go ahead and rebase to clear the CI.

gtsambos added 4 commits June 9, 2020 09:22
Baby commit of IBD tests.

Working, but messy Python implementation of fast IBD algorithm.

ibdFinder now returns segments rather than intervals.

verify_equal_ibd function has been changed

Bug fix.

Rudimentary tests of IbdFinder for path_IBD=true, mrca_IBD=true seem to work.
min-length and max-time argument of IBDFinder should now be working.

Added a new test class, TestIbdTopologies.

Added functions to test for equality of ibd segments.

extra test topologies in test_ibd.py

Converted list of ancetral segments into more C-like objects.
Added to SegmentList class and changed all uses of A, S to use the SegmentList class.

Fixed bug.

Added more examples to test_ibd.py

ibd_segments is now an attribute of the IbdFinder class

Removed new_segs

Infrastructure for converting sample pairs to index in struct array.

ibd_segments now contains linked lists

Fixed bug, modified tests to convert SegmentList output into normal python lists.

Neatened the IBD tests.

Added some documentation.

Added more tests, code to deal with fringe cases.

Jerome and Peter's latest comments.

Peter's comments.

Updated IBD tests with forward time DTWF tests.
small bugfix

forward DTWF sims are now in tests

Samples no longer need to be sorted, sample_id_map is part of the constructor.

mygen renamed to edges_iter

current_time, current_parent and PARENT_TO_BE_ADDED are now local to find_ibd_segments()

More small changes

Removed dependency on msprime v1.0

Removed parent_list, added simpler oldest_parent attribute

Removed processed_nodes object

Small change to variable name.
@gtsambos
Copy link
Member Author

gtsambos commented Jun 8, 2020

great, thanks @benjeffery!

@codecov
Copy link

codecov bot commented Jun 8, 2020

Codecov Report

Merging #488 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #488   +/-   ##
=======================================
  Coverage   87.75%   87.75%           
=======================================
  Files          23       23           
  Lines       17687    17687           
  Branches     3498     3498           
=======================================
  Hits        15522    15522           
  Misses       1062     1062           
  Partials     1103     1103           
Flag Coverage Δ
#c_tests 89.03% <ø> (ø)
#python_c_tests 91.23% <ø> (ø)
#python_tests 98.99% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e495fd7...bebf6f6. Read the comment docs.

@jeromekelleher
Copy link
Member

Great, thanks @gtsambos. Merging this much, so we can move on with the next phase of C implementation.

@mergify mergify bot merged commit fc7473b into tskit-dev:master Jun 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants