Skip to content

Conversation

@daniel-goldstein
Copy link
Member

@daniel-goldstein daniel-goldstein commented Apr 20, 2020

A high-level implementation of the KC distance on Tree sequences. The tree sequence extension of KC walks a long both tree sequences, taking the KC distance between corresponding trees. The distance is the sum of the individual tree KC distances weighted by the fraction of the sequence spanned by the pair of trees.

The KC distance focuses on pairs of leaves, specifically the depth and time-from-root for the mrca of a given pair of leaves. To maintain the depth and time vectors used in KC incrementally, for each edge e, we must do an upward and downward traversal. While traversing up toward the root, we update the pairs of leaves where one leaf is in the subtree affected by e and one is not. Traversing down from e, we update all pairs of leaves where both leaves are in the subtree. Pairs where both leaves are outside of the subtree under e haven't been affected by the insertion/removal of that edge.

@daniel-goldstein daniel-goldstein changed the title Implement an incremental KC Distance on tree sequences Implement high-level incremental KC Distance on tree sequences Apr 20, 2020
@codecov
Copy link

codecov bot commented Apr 20, 2020

Codecov Report

Merging #548 into master will increase coverage by 0.04%.
The diff coverage is 90.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #548      +/-   ##
==========================================
+ Coverage   87.22%   87.27%   +0.04%     
==========================================
  Files          21       21              
  Lines       16282    16459     +177     
  Branches     3195     3234      +39     
==========================================
+ Hits        14202    14364     +162     
- Misses       1020     1029       +9     
- Partials     1060     1066       +6     
Flag Coverage Δ
#c_tests 88.44% <90.45%> (+0.04%) ⬆️
#python_c_tests 90.34% <93.33%> (+<0.01%) ⬆️
#python_tests ?
Impacted Files Coverage Δ
c/tskit/trees.c 90.91% <90.25%> (+0.02%) ⬆️
python/_tskitmodule.c 83.97% <92.30%> (+0.02%) ⬆️
c/tskit/core.c 90.60% <100.00%> (+0.06%) ⬆️
python/tskit/trees.py 98.64% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a86e661...b76bc5b. Read the comment docs.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good @daniel-goldstein, I like it. I see two ways of going from here, which depend on how we choose to implement this at the C level:

  1. Pull everything that we're doing into the KCIncrementalTree, and no longer inherit from the SampleListTree. That is, we do everything ourselves, within this class, which will then look very like the underlying C implementation (in which we do everything from scratch).

  2. Stop using the SampleListTree and instead use the high level Tree class. That is, we'd zip together the edge_diffs and tree objects, and therefore use the Tree class to do a lot of the work for us. We could then do the same thing at the C level, where we use the edge_diff_iter (or whatever it's called) and two instances of tsk_tree_t which we iterate along.

I'm leaning towards 2, as it's a bit easier to implement and reuses a good lot of infrastructure that we have lying around. It's probably slightly less efficient, but not dramatically so, and it would be good enough to give us an indication on whether it's worth doing a more perf-tuned version or not.

I don't think there's much point in doing any perf measurements on the current Python version - you don't learn a lot from this in my experience, as you end up having to do a lot of tuning to get around the obvious Python bottlenecks before getting at the actual algorithm performance.

What do you think?

@daniel-goldstein
Copy link
Member Author

I agree option 2 seems the better way forward. It saves a lot of duplication and would at the very least get most of the C implementation done before fine-tuning. Same goes for the perf, let’s hold off until it’s in C.

@daniel-goldstein daniel-goldstein marked this pull request as ready for review April 23, 2020 03:05
@daniel-goldstein
Copy link
Member Author

@jeromekelleher Changed to use the trees and edge diffs iterator at the same time. Do you want to get this in before moving on to the C implementation?

@jeromekelleher
Copy link
Member

@jeromekelleher Changed to use the trees and edge diffs iterator at the same time. Do you want to get this in before moving on to the C implementation?

Thanks @daniel-goldstein, this is much nicer! I think we'll just keep this in the one PR for now if you don't mind - things might evolve a bit as we try it out in C, and it's easier keep the whole thing in my head if it's all in one diff. The PR isn't too big anyway.

We should follow up with the full-stack Tree.depth() operation first though, there's no point in reinventing this wheel.

@daniel-goldstein
Copy link
Member Author

Now stacked on #554

@daniel-goldstein daniel-goldstein marked this pull request as draft April 24, 2020 13:26
@daniel-goldstein daniel-goldstein changed the title Implement high-level incremental KC Distance on tree sequences Implement incremental KC Distance on tree sequences Apr 24, 2020
@daniel-goldstein daniel-goldstein force-pushed the kc-seq branch 7 times, most recently from 3ff94c3 to 97c8d10 Compare May 1, 2020 16:30
@daniel-goldstein
Copy link
Member Author

daniel-goldstein commented May 1, 2020

@jeromekelleher Made the changes we discussed and ready for review

@daniel-goldstein daniel-goldstein marked this pull request as ready for review May 1, 2020 17:24
@daniel-goldstein daniel-goldstein force-pushed the kc-seq branch 2 times, most recently from a30f9d9 to 49b9fb8 Compare May 3, 2020 19:59
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @daniel-goldstein. Some final comments and we're ready to merge.

@daniel-goldstein
Copy link
Member Author

@jeromekelleher Addressed

@jeromekelleher
Copy link
Member

I think this is ready to merge @daniel-goldstein?

@daniel-goldstein
Copy link
Member Author

Yes, it is. Forgot to tag you @jeromekelleher

@jeromekelleher jeromekelleher merged commit b872e37 into tskit-dev:master May 5, 2020
@jeromekelleher
Copy link
Member

Done! Can you open an issue flagging the KC-internal-samples issue please @daniel-goldstein?

@daniel-goldstein
Copy link
Member Author

Done #575 :) @jeromekelleher

@frezaei98
Copy link

Hi @jeromekelleher and @daniel-goldstein

In calculating KC metric for tree sequence, did you consider polytomies?
I mean: if a tree has polytomies, how calculate KC metric?

@jeromekelleher
Copy link
Member

Polytomies are captured directly by the metric itself (see the paper), although the interpretation is somewhat arguable. This is discussed in the tsinfer and ARGneedle papers, I believe (@hyanwong may have further details)

@hyanwong
Copy link
Member

Yes, the KC metric does allow polytomies (although not unary nodes). It's not clear, however, that the treatment of polytomies will result in a metric that is useful for your specific use-case. For example, Pier Palamara (ARGneedle PI) is very skeptical of using KC for this reason: he thinks it doesn't penalise polytomies enough.

One way of thinking about this is to ask how the metric with a polytomy compares to the average of the metrics for all possible binary resolutions of that polytomy. Usually, I think that the average is higher than the with-polytomy case. However, it's not clear that a metric exists that allows these to be the same.

You can randomly resolve the polytomies if you like, using Tree.split_polytomies, then recalculate the KC metric, which will give you an alternative (noisier) metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants