Skip to content

Conversation

@daniel-goldstein
Copy link
Member

@daniel-goldstein daniel-goldstein commented Mar 18, 2020

Computing the KC distance requires building two vectors per tree, m and M, where each element records, for some leaves u & v, the distance of mrca(u, v) to the root in number of edges and time, respectively. The existing implementation was essentially the above explanation, computing the mrca for each pair, which for n leaves, takes O(n^2 log(n)) time.

This PR implements a slightly different approach, where we do a preorder traversal of the tree, accumulating the depth and time from the root. For each internal node, we can take the Cartesian product of the node's children's leaves and assign each pair the current depth and time. We use sample_lists to iterate through the leaves without any extra traversal overhead, erroring out when sample_lists is not available. Since each leaf pair is considered once, the overall runtime is O(n^2). This approach is similar to that used in the official R package for KC-distance, tree space.

@daniel-goldstein daniel-goldstein changed the title Make KC distance quadratic in number of nodes. Make KC distance linear in size of KC vectors Mar 18, 2020
@daniel-goldstein daniel-goldstein force-pushed the linear-kc branch 2 times, most recently from cf84b52 to cf6b68b Compare March 18, 2020 19:36
@codecov
Copy link

codecov bot commented Mar 18, 2020

Codecov Report

Merging #490 into master will decrease coverage by 0.02%.
The diff coverage is 87.39%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #490      +/-   ##
==========================================
- Coverage   86.84%   86.81%   -0.03%     
==========================================
  Files          21       21              
  Lines       15944    15969      +25     
  Branches     3095     3102       +7     
==========================================
+ Hits        13847    13864      +17     
- Misses       1050     1055       +5     
- Partials     1047     1050       +3     
Flag Coverage Δ
#c_tests 87.73% <87.39%> (-0.05%) ⬇️
#python_c_tests 90.52% <ø> (ø)
#python_tests 99.18% <ø> (ø)
Impacted Files Coverage Δ
c/tskit/trees.c 90.71% <87.39%> (-0.22%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbeb129...53a4756. Read the comment docs.

@jeromekelleher
Copy link
Member

This looks great, thanks @daniel-goldstein. One basic question though: isn't the LeafTracker basically the same things as the SAMPLE_LISTS option for tsk_tree_t? For example, we use this in genotypes.c to quickly find the set of samples that descend from a given mutation when we're decoding genotypes. Ironically, though, it turns out that doing simple traversals like this are often more efficient than using the sample list feature. This is probably because the algorithm for maintaining sample lists is a little stupid: we basically traverse upwards each time and rebuild the parts of the linked list on each update. I tried really hard to make this better though and failed --- it's very tricky. Also, though, we're only looking for sample lists for a small number of nodes per tree when decoding mutations. We'll be doing lots more for KC.

Anyway, the short version is, I don't think there's much point in having an extra LeafTracker class and we might as well just use the SampleList feature of the trees. The simplest thing to do is to raise an error if the tree doesn't have this feature enabled when kc_distance is called.

Looking forward to talking through the rest of the details on Friday.

@daniel-goldstein
Copy link
Member Author

AH ok, this makes sense. I thought we were discussing the use of something like the leaf tracker over the sideways links in the quintuply linked tree structure. Either way, glad I can pare down this PR then, not a big deal to change how we iterate through the leaves. Will discuss more tomorrow then.

@daniel-goldstein
Copy link
Member Author

daniel-goldstein commented Mar 25, 2020

Still got a few more tweaks left. Running some benchmarks to see how KC performs.

@daniel-goldstein
Copy link
Member Author

Ran some benchmarks on holly comparing master and this branch. Below shows the runtime for a simple msprime simulation, no params except the number of samples and a fixed seed. Then run the KC distance of the first tree against itself.

Screen Shot 2020-03-26 at 11 39 53 AM

OOM errors start ~46k samples for both branches. Just under that, at 40k, this branch still runs at ~46s.

@jeromekelleher
Copy link
Member

Great work @daniel-goldstein. Is this ready for final review?

@daniel-goldstein
Copy link
Member Author

daniel-goldstein commented Mar 26, 2020

Great work @daniel-goldstein. Is this ready for final review?

Thanks @jeromekelleher! I left a few comments I’d like your thoughts on. Otherwise good for review.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @daniel-goldstein! I'd like to dig in to the perf side of things before moving on here. Can you try out the restrict pointer stuff for the all-pairs update please? This should make a bit of a difference. Also interesting to see where the time is spent - using perf record on holly for one of your big benchmarks should give good insights.

@daniel-goldstein daniel-goldstein force-pushed the linear-kc branch 4 times, most recently from 270561f to 37cc4c5 Compare March 27, 2020 18:16
@daniel-goldstein
Copy link
Member Author

Determined that a good chunk of time (~40%) is spent in moving data into the KC vectors. The leaf traversal order does not align well with the positions in the vectors (which are O(n^2) in space) and so gets very poor cache behavior. Leaving a comment in the code to explain this.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @daniel-goldstein! Minor comment above about the Python version we should keep. We're also missing a few corner cases in the C tests --- particularly, need to run this on trees with 0 and 1 samples, see what happens.

@daniel-goldstein daniel-goldstein force-pushed the linear-kc branch 2 times, most recently from b1dbc34 to 7a7b5c8 Compare March 31, 2020 15:56
@jeromekelleher
Copy link
Member

Awesome, thanks @daniel-goldstein! Can you update the CHANGELOG with this (Performance improvements in kc_distance) please? We're good to merge then.

@daniel-goldstein
Copy link
Member Author

@jeromekelleher All set

@jeromekelleher jeromekelleher merged commit 43300e8 into tskit-dev:master Apr 1, 2020
@jeromekelleher
Copy link
Member

Thanks @daniel-goldstein, this is great stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants