Implement incremental KC Distance on tree sequences #548

daniel-goldstein · 2020-04-20T16:45:13Z

A high-level implementation of the KC distance on Tree sequences. The tree sequence extension of KC walks a long both tree sequences, taking the KC distance between corresponding trees. The distance is the sum of the individual tree KC distances weighted by the fraction of the sequence spanned by the pair of trees.

The KC distance focuses on pairs of leaves, specifically the depth and time-from-root for the mrca of a given pair of leaves. To maintain the depth and time vectors used in KC incrementally, for each edge e, we must do an upward and downward traversal. While traversing up toward the root, we update the pairs of leaves where one leaf is in the subtree affected by e and one is not. Traversing down from e, we update all pairs of leaves where both leaves are in the subtree. Pairs where both leaves are outside of the subtree under e haven't been affected by the insertion/removal of that edge.

codecov · 2020-04-20T17:24:34Z

Codecov Report

Merging #548 into master will increase coverage by 0.04%.
The diff coverage is 90.56%.

@@            Coverage Diff             @@
##           master     #548      +/-   ##
==========================================
+ Coverage   87.22%   87.27%   +0.04%     
==========================================
  Files          21       21              
  Lines       16282    16459     +177     
  Branches     3195     3234      +39     
==========================================
+ Hits        14202    14364     +162     
- Misses       1020     1029       +9     
- Partials     1060     1066       +6

Flag	Coverage Δ
#c_tests	`88.44% <90.45%> (+0.04%)`	⬆️
#python_c_tests	`90.34% <93.33%> (+<0.01%)`	⬆️
#python_tests	`?`

Impacted Files	Coverage Δ
c/tskit/trees.c	`90.91% <90.25%> (+0.02%)`	⬆️
python/_tskitmodule.c	`83.97% <92.30%> (+0.02%)`	⬆️
c/tskit/core.c	`90.60% <100.00%> (+0.06%)`	⬆️
python/tskit/trees.py	`98.64% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a86e661...b76bc5b. Read the comment docs.

jeromekelleher

This looks really good @daniel-goldstein, I like it. I see two ways of going from here, which depend on how we choose to implement this at the C level:

Pull everything that we're doing into the KCIncrementalTree, and no longer inherit from the SampleListTree. That is, we do everything ourselves, within this class, which will then look very like the underlying C implementation (in which we do everything from scratch).
Stop using the SampleListTree and instead use the high level Tree class. That is, we'd zip together the edge_diffs and tree objects, and therefore use the Tree class to do a lot of the work for us. We could then do the same thing at the C level, where we use the edge_diff_iter (or whatever it's called) and two instances of tsk_tree_t which we iterate along.

I'm leaning towards 2, as it's a bit easier to implement and reuses a good lot of infrastructure that we have lying around. It's probably slightly less efficient, but not dramatically so, and it would be good enough to give us an indication on whether it's worth doing a more perf-tuned version or not.

I don't think there's much point in doing any perf measurements on the current Python version - you don't learn a lot from this in my experience, as you end up having to do a lot of tuning to get around the obvious Python bottlenecks before getting at the actual algorithm performance.

What do you think?

python/tests/test_topology.py

daniel-goldstein · 2020-04-21T12:00:43Z

I agree option 2 seems the better way forward. It saves a lot of duplication and would at the very least get most of the C implementation done before fine-tuning. Same goes for the perf, let’s hold off until it’s in C.

daniel-goldstein · 2020-04-23T03:05:59Z

@jeromekelleher Changed to use the trees and edge diffs iterator at the same time. Do you want to get this in before moving on to the C implementation?

python/tests/test_topology.py

jeromekelleher · 2020-04-23T07:36:52Z

@jeromekelleher Changed to use the trees and edge diffs iterator at the same time. Do you want to get this in before moving on to the C implementation?

Thanks @daniel-goldstein, this is much nicer! I think we'll just keep this in the one PR for now if you don't mind - things might evolve a bit as we try it out in C, and it's easier keep the whole thing in my head if it's all in one diff. The PR isn't too big anyway.

We should follow up with the full-stack Tree.depth() operation first though, there's no point in reinventing this wheel.

daniel-goldstein · 2020-04-23T16:20:38Z

Now stacked on #554

python/tests/test_topology.py

daniel-goldstein · 2020-05-01T17:24:20Z

@jeromekelleher Made the changes we discussed and ready for review

jeromekelleher

Looks great, thanks @daniel-goldstein. Some final comments and we're ready to merge.

c/tskit/trees.c

python/_tskitmodule.c

python/tskit/trees.py

daniel-goldstein · 2020-05-04T14:43:28Z

@jeromekelleher Addressed

c/tskit/trees.c

jeromekelleher · 2020-05-05T14:55:54Z

I think this is ready to merge @daniel-goldstein?

daniel-goldstein · 2020-05-05T15:28:46Z

Yes, it is. Forgot to tag you @jeromekelleher

jeromekelleher · 2020-05-05T15:56:58Z

Done! Can you open an issue flagging the KC-internal-samples issue please @daniel-goldstein?

daniel-goldstein · 2020-05-05T15:59:54Z

Done #575 :) @jeromekelleher

frezaei98 · 2023-10-31T05:55:32Z

Hi @jeromekelleher and @daniel-goldstein

In calculating KC metric for tree sequence, did you consider polytomies?
I mean: if a tree has polytomies, how calculate KC metric?

jeromekelleher · 2023-10-31T09:03:20Z

Polytomies are captured directly by the metric itself (see the paper), although the interpretation is somewhat arguable. This is discussed in the tsinfer and ARGneedle papers, I believe (@hyanwong may have further details)

hyanwong · 2023-10-31T11:08:09Z

Yes, the KC metric does allow polytomies (although not unary nodes). It's not clear, however, that the treatment of polytomies will result in a metric that is useful for your specific use-case. For example, Pier Palamara (ARGneedle PI) is very skeptical of using KC for this reason: he thinks it doesn't penalise polytomies enough.

One way of thinking about this is to ask how the metric with a polytomy compares to the average of the metrics for all possible binary resolutions of that polytomy. Usually, I think that the average is higher than the with-polytomy case. However, it's not clear that a metric exists that allows these to be the same.

You can randomly resolve the polytomies if you like, using Tree.split_polytomies, then recalculate the KC metric, which will give you an alternative (noisier) metric.

daniel-goldstein changed the title ~~Implement an incremental KC Distance on tree sequences~~ Implement high-level incremental KC Distance on tree sequences Apr 20, 2020

jeromekelleher reviewed Apr 21, 2020

View reviewed changes

python/tests/test_topology.py Outdated Show resolved Hide resolved

python/tests/test_topology.py Outdated Show resolved Hide resolved

daniel-goldstein force-pushed the kc-seq branch from 2b1202f to ef529d9 Compare April 23, 2020 02:59

daniel-goldstein marked this pull request as ready for review April 23, 2020 03:05

daniel-goldstein commented Apr 23, 2020

View reviewed changes

python/tests/test_topology.py Outdated Show resolved Hide resolved

daniel-goldstein force-pushed the kc-seq branch from ef529d9 to 425304c Compare April 23, 2020 03:07

daniel-goldstein marked this pull request as draft April 24, 2020 13:26

daniel-goldstein changed the title ~~Implement high-level incremental KC Distance on tree sequences~~ Implement incremental KC Distance on tree sequences Apr 24, 2020

daniel-goldstein commented Apr 24, 2020

View reviewed changes

python/tests/test_topology.py Show resolved Hide resolved

daniel-goldstein force-pushed the kc-seq branch 7 times, most recently from 3ff94c3 to 97c8d10 Compare May 1, 2020 16:30

daniel-goldstein marked this pull request as ready for review May 1, 2020 17:24

daniel-goldstein force-pushed the kc-seq branch 2 times, most recently from a30f9d9 to 49b9fb8 Compare May 3, 2020 19:59

jeromekelleher reviewed May 4, 2020

View reviewed changes

daniel-goldstein force-pushed the kc-seq branch from 49b9fb8 to 95e6a4e Compare May 4, 2020 14:43

daniel-goldstein force-pushed the kc-seq branch from 95e6a4e to 9eb48b9 Compare May 4, 2020 14:45

jeromekelleher reviewed May 4, 2020

View reviewed changes

c/tskit/trees.c Outdated Show resolved Hide resolved

daniel-goldstein force-pushed the kc-seq branch from 9eb48b9 to 820f529 Compare May 4, 2020 17:28

jeromekelleher reviewed May 4, 2020

View reviewed changes

c/tskit/trees.c Outdated Show resolved Hide resolved

Implement KC Distance on tree sequences with an incremental algorithm

b76bc5b

daniel-goldstein force-pushed the kc-seq branch from 820f529 to b76bc5b Compare May 5, 2020 13:58

jeromekelleher merged commit b872e37 into tskit-dev:master May 5, 2020

brieuclehmann mentioned this pull request Oct 8, 2020

Add function for covariance matrix #898

Merged

brianzhang01 mentioned this pull request Nov 19, 2020

Random split polytomy #815

Merged

Implement incremental KC Distance on tree sequences #548

Implement incremental KC Distance on tree sequences #548

Uh oh!

Conversation

daniel-goldstein commented Apr 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

daniel-goldstein commented Apr 21, 2020

Uh oh!

daniel-goldstein commented Apr 23, 2020

Uh oh!

Uh oh!

jeromekelleher commented Apr 23, 2020

Uh oh!

daniel-goldstein commented Apr 23, 2020

Uh oh!

Uh oh!

daniel-goldstein commented May 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daniel-goldstein commented May 4, 2020

Uh oh!

Uh oh!

Uh oh!

jeromekelleher commented May 5, 2020

Uh oh!

daniel-goldstein commented May 5, 2020

Uh oh!

jeromekelleher commented May 5, 2020

Uh oh!

daniel-goldstein commented May 5, 2020

Uh oh!

frezaei98 commented Oct 31, 2023

Uh oh!

jeromekelleher commented Oct 31, 2023

Uh oh!

hyanwong commented Oct 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

daniel-goldstein commented Apr 20, 2020 •

edited

Loading

codecov bot commented Apr 20, 2020 •

edited

Loading

daniel-goldstein commented May 1, 2020 •

edited

Loading