Make KC distance linear in size of KC vectors #490

daniel-goldstein · 2020-03-18T19:28:57Z

Computing the KC distance requires building two vectors per tree, m and M, where each element records, for some leaves u & v, the distance of mrca(u, v) to the root in number of edges and time, respectively. The existing implementation was essentially the above explanation, computing the mrca for each pair, which for n leaves, takes O(n^2 log(n)) time.

This PR implements a slightly different approach, where we do a preorder traversal of the tree, accumulating the depth and time from the root. For each internal node, we can take the Cartesian product of the node's children's leaves and assign each pair the current depth and time. We use sample_lists to iterate through the leaves without any extra traversal overhead, erroring out when sample_lists is not available. Since each leaf pair is considered once, the overall runtime is O(n^2). This approach is similar to that used in the official R package for KC-distance, tree space.

codecov · 2020-03-18T21:00:23Z

Codecov Report

Merging #490 into master will decrease coverage by 0.02%.
The diff coverage is 87.39%.

@@            Coverage Diff             @@
##           master     #490      +/-   ##
==========================================
- Coverage   86.84%   86.81%   -0.03%     
==========================================
  Files          21       21              
  Lines       15944    15969      +25     
  Branches     3095     3102       +7     
==========================================
+ Hits        13847    13864      +17     
- Misses       1050     1055       +5     
- Partials     1047     1050       +3

Flag	Coverage Δ
#c_tests	`87.73% <87.39%> (-0.05%)`	⬇️
#python_c_tests	`90.52% <ø> (ø)`
#python_tests	`99.18% <ø> (ø)`

Impacted Files	Coverage Δ
c/tskit/trees.c	`90.71% <87.39%> (-0.22%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbeb129...53a4756. Read the comment docs.

jeromekelleher · 2020-03-19T10:56:46Z

This looks great, thanks @daniel-goldstein. One basic question though: isn't the LeafTracker basically the same things as the SAMPLE_LISTS option for tsk_tree_t? For example, we use this in genotypes.c to quickly find the set of samples that descend from a given mutation when we're decoding genotypes. Ironically, though, it turns out that doing simple traversals like this are often more efficient than using the sample list feature. This is probably because the algorithm for maintaining sample lists is a little stupid: we basically traverse upwards each time and rebuild the parts of the linked list on each update. I tried really hard to make this better though and failed --- it's very tricky. Also, though, we're only looking for sample lists for a small number of nodes per tree when decoding mutations. We'll be doing lots more for KC.

Anyway, the short version is, I don't think there's much point in having an extra LeafTracker class and we might as well just use the SampleList feature of the trees. The simplest thing to do is to raise an error if the tree doesn't have this feature enabled when kc_distance is called.

Looking forward to talking through the rest of the details on Friday.

daniel-goldstein · 2020-03-19T15:29:04Z

AH ok, this makes sense. I thought we were discussing the use of something like the leaf tracker over the sideways links in the quintuply linked tree structure. Either way, glad I can pare down this PR then, not a big deal to change how we iterate through the leaves. Will discuss more tomorrow then.

daniel-goldstein · 2020-03-25T15:36:55Z

Still got a few more tweaks left. Running some benchmarks to see how KC performs.

daniel-goldstein · 2020-03-26T15:58:19Z

Ran some benchmarks on holly comparing master and this branch. Below shows the runtime for a simple msprime simulation, no params except the number of samples and a fixed seed. Then run the KC distance of the first tree against itself.

OOM errors start ~46k samples for both branches. Just under that, at 40k, this branch still runs at ~46s.

jeromekelleher · 2020-03-26T16:43:44Z

Great work @daniel-goldstein. Is this ready for final review?

.pre-commit-config.yaml

c/tskit/trees.c

python/tests/test_topology.py

daniel-goldstein · 2020-03-26T16:59:37Z

Great work @daniel-goldstein. Is this ready for final review?

Thanks @jeromekelleher! I left a few comments I’d like your thoughts on. Otherwise good for review.

jeromekelleher

Looks great @daniel-goldstein! I'd like to dig in to the perf side of things before moving on here. Can you try out the restrict pointer stuff for the all-pairs update please? This should make a bit of a difference. Also interesting to see where the time is spent - using perf record on holly for one of your big benchmarks should give good insights.

.pre-commit-config.yaml

c/tskit/trees.c

daniel-goldstein · 2020-03-27T18:19:18Z

Determined that a good chunk of time (~40%) is spent in moving data into the KC vectors. The leaf traversal order does not align well with the positions in the vectors (which are O(n^2) in space) and so gets very poor cache behavior. Leaving a comment in the code to explain this.

jeromekelleher

Looks great, thanks @daniel-goldstein! Minor comment above about the Python version we should keep. We're also missing a few corner cases in the C tests --- particularly, need to run this on trees with 0 and 1 samples, see what happens.

python/tests/test_topology.py

jeromekelleher · 2020-04-01T09:02:27Z

Awesome, thanks @daniel-goldstein! Can you update the CHANGELOG with this (Performance improvements in kc_distance) please? We're good to merge then.

queries

daniel-goldstein · 2020-04-01T18:24:19Z

@jeromekelleher All set

jeromekelleher · 2020-04-01T19:42:46Z

Thanks @daniel-goldstein, this is great stuff.

daniel-goldstein force-pushed the linear-kc branch from 44ba977 to f9950e9 Compare March 18, 2020 19:32

daniel-goldstein changed the title ~~Make KC distance quadratic in number of nodes.~~ Make KC distance linear in size of KC vectors Mar 18, 2020

daniel-goldstein force-pushed the linear-kc branch 2 times, most recently from cf84b52 to cf6b68b Compare March 18, 2020 19:36

daniel-goldstein force-pushed the linear-kc branch from cf6b68b to f83ca41 Compare March 25, 2020 15:36

daniel-goldstein commented Mar 26, 2020

View reviewed changes

.pre-commit-config.yaml Show resolved Hide resolved

c/tskit/trees.c Outdated Show resolved Hide resolved

python/tests/test_topology.py Show resolved Hide resolved

jeromekelleher reviewed Mar 26, 2020

View reviewed changes

.pre-commit-config.yaml Show resolved Hide resolved

c/tskit/trees.c Show resolved Hide resolved

c/tskit/trees.c Outdated Show resolved Hide resolved

c/tskit/trees.c Outdated Show resolved Hide resolved

c/tskit/trees.c Outdated Show resolved Hide resolved

daniel-goldstein force-pushed the linear-kc branch 4 times, most recently from 270561f to 37cc4c5 Compare March 27, 2020 18:16

jeromekelleher reviewed Mar 30, 2020

View reviewed changes

python/tests/test_topology.py Outdated Show resolved Hide resolved

daniel-goldstein force-pushed the linear-kc branch 2 times, most recently from b1dbc34 to 7a7b5c8 Compare March 31, 2020 15:56

jeromekelleher reviewed Mar 31, 2020

View reviewed changes

python/tests/test_topology.py Outdated Show resolved Hide resolved

daniel-goldstein force-pushed the linear-kc branch from 7a7b5c8 to 79af42b Compare March 31, 2020 19:17

daniel-goldstein force-pushed the linear-kc branch from 79af42b to ef8c8c6 Compare April 1, 2020 18:21

make kc linear in the number of nodes by not doing pairwise mrca

53a4756

queries

daniel-goldstein force-pushed the linear-kc branch from ef8c8c6 to 53a4756 Compare April 1, 2020 18:22

jeromekelleher merged commit 43300e8 into tskit-dev:master Apr 1, 2020

Make KC distance linear in size of KC vectors #490

Make KC distance linear in size of KC vectors #490

Uh oh!

Conversation

daniel-goldstein commented Mar 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented Mar 19, 2020

Uh oh!

daniel-goldstein commented Mar 19, 2020

Uh oh!

daniel-goldstein commented Mar 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniel-goldstein commented Mar 26, 2020

Uh oh!

jeromekelleher commented Mar 26, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daniel-goldstein commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daniel-goldstein commented Mar 27, 2020

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jeromekelleher commented Apr 1, 2020

Uh oh!

daniel-goldstein commented Apr 1, 2020

Uh oh!

jeromekelleher commented Apr 1, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daniel-goldstein commented Mar 18, 2020 •

edited

Loading

codecov bot commented Mar 18, 2020 •

edited

Loading

daniel-goldstein commented Mar 25, 2020 •

edited

Loading

daniel-goldstein commented Mar 26, 2020 •

edited

Loading