-
Notifications
You must be signed in to change notification settings - Fork 78
Implement incremental KC Distance on tree sequences #548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #548 +/- ##
==========================================
+ Coverage 87.22% 87.27% +0.04%
==========================================
Files 21 21
Lines 16282 16459 +177
Branches 3195 3234 +39
==========================================
+ Hits 14202 14364 +162
- Misses 1020 1029 +9
- Partials 1060 1066 +6
Continue to review full report at Codecov.
|
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good @daniel-goldstein, I like it. I see two ways of going from here, which depend on how we choose to implement this at the C level:
-
Pull everything that we're doing into the KCIncrementalTree, and no longer inherit from the SampleListTree. That is, we do everything ourselves, within this class, which will then look very like the underlying C implementation (in which we do everything from scratch).
-
Stop using the SampleListTree and instead use the high level Tree class. That is, we'd zip together the edge_diffs and tree objects, and therefore use the Tree class to do a lot of the work for us. We could then do the same thing at the C level, where we use the edge_diff_iter (or whatever it's called) and two instances of
tsk_tree_twhich we iterate along.
I'm leaning towards 2, as it's a bit easier to implement and reuses a good lot of infrastructure that we have lying around. It's probably slightly less efficient, but not dramatically so, and it would be good enough to give us an indication on whether it's worth doing a more perf-tuned version or not.
I don't think there's much point in doing any perf measurements on the current Python version - you don't learn a lot from this in my experience, as you end up having to do a lot of tuning to get around the obvious Python bottlenecks before getting at the actual algorithm performance.
What do you think?
|
I agree option 2 seems the better way forward. It saves a lot of duplication and would at the very least get most of the C implementation done before fine-tuning. Same goes for the perf, let’s hold off until it’s in C. |
2b1202f to
ef529d9
Compare
|
@jeromekelleher Changed to use the trees and edge diffs iterator at the same time. Do you want to get this in before moving on to the C implementation? |
ef529d9 to
425304c
Compare
Thanks @daniel-goldstein, this is much nicer! I think we'll just keep this in the one PR for now if you don't mind - things might evolve a bit as we try it out in C, and it's easier keep the whole thing in my head if it's all in one diff. The PR isn't too big anyway. We should follow up with the full-stack Tree.depth() operation first though, there's no point in reinventing this wheel. |
|
Now stacked on #554 |
3ff94c3 to
97c8d10
Compare
|
@jeromekelleher Made the changes we discussed and ready for review |
a30f9d9 to
49b9fb8
Compare
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @daniel-goldstein. Some final comments and we're ready to merge.
|
@jeromekelleher Addressed |
|
I think this is ready to merge @daniel-goldstein? |
|
Yes, it is. Forgot to tag you @jeromekelleher |
|
Done! Can you open an issue flagging the KC-internal-samples issue please @daniel-goldstein? |
|
Done #575 :) @jeromekelleher |
|
Hi @jeromekelleher and @daniel-goldstein In calculating KC metric for tree sequence, did you consider polytomies? |
|
Yes, the KC metric does allow polytomies (although not unary nodes). It's not clear, however, that the treatment of polytomies will result in a metric that is useful for your specific use-case. For example, Pier Palamara (ARGneedle PI) is very skeptical of using KC for this reason: he thinks it doesn't penalise polytomies enough. One way of thinking about this is to ask how the metric with a polytomy compares to the average of the metrics for all possible binary resolutions of that polytomy. Usually, I think that the average is higher than the with-polytomy case. However, it's not clear that a metric exists that allows these to be the same. You can randomly resolve the polytomies if you like, using Tree.split_polytomies, then recalculate the KC metric, which will give you an alternative (noisier) metric. |
A high-level implementation of the KC distance on Tree sequences. The tree sequence extension of KC walks a long both tree sequences, taking the KC distance between corresponding trees. The distance is the sum of the individual tree KC distances weighted by the fraction of the sequence spanned by the pair of trees.
The KC distance focuses on pairs of leaves, specifically the depth and time-from-root for the mrca of a given pair of leaves. To maintain the depth and time vectors used in KC incrementally, for each edge
e, we must do an upward and downward traversal. While traversing up toward the root, we update the pairs of leaves where one leaf is in the subtree affected byeand one is not. Traversing down frome, we update all pairs of leaves where both leaves are in the subtree. Pairs where both leaves are outside of the subtree underehaven't been affected by the insertion/removal of that edge.