Add tree_pos to tsk_tree_t struct and change next, prev and seek_from_null to use tree_pos #2874

duncanMR · 2023-12-04T16:21:35Z

Description

This PR aims to add tree_pos to the tsk_tree_t struct and alter next, prev, first, last and seek_from_null to use the tree positioning code. This means that we can remove the separate tsk_tree_position iterator in LSHMM and KC distance code.

Fixes #2794, Fixes #2795 and Fixes #2825.

PR Checklist:

Tests that fully cover new/changed functionality.
Documentation including tutorial content if appropriate.
Changelogs, if there are API changes.

duncanMR · 2023-12-04T16:38:58Z

@jeromekelleher Unfortunately the CI issues remain on the fresh branch. Is there any way to force update the package cache?

codecov · 2023-12-04T17:00:16Z

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (813e361) 89.75% compared to head (a614344) 89.79%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2874      +/-   ##
==========================================
+ Coverage   89.75%   89.79%   +0.03%     
==========================================
  Files          30       30              
  Lines       30395    30399       +4     
  Branches     5912     5909       -3     
==========================================
+ Hits        27282    27296      +14     
+ Misses       1781     1778       -3     
+ Partials     1332     1325       -7

Flag	Coverage Δ
c-tests	`86.21% <93.37%> (+0.07%)`	⬆️
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`67.71% <ø> (ø)`
python-tests	`98.96% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
c/tskit/haplotype_matching.c	`84.55% <100.00%> (+0.07%)`	⬆️
c/tskit/trees.c	`90.42% <93.18%> (+0.20%)`	⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 813e361...a614344. Read the comment docs.

jeromekelleher · 2023-12-05T09:32:23Z

@benjeffery what'st the easiest way of dealing with this?

duncanMR · 2023-12-05T09:53:27Z

Ben suggested bumping the version number in the cache config, which I did in the last commit, but for some reason it seems to have not refreshed the cache.

benjeffery · 2023-12-05T10:05:52Z

Bumping the number will certainly have flushed the cache. Think there is a deeper issue here - checking it now.

benjeffery · 2023-12-05T10:12:02Z

BTW - The CI cache works across PRs, so a fresh one doesn't have any effect.

Hoping #2875 fixes this CI issue.

duncanMR · 2023-12-05T12:36:41Z

Ah, I realised that using Sync Fork on GitHub causes a merge. Will not use it again. Hopefully CI works now, just need to fix the coverage issue

benjeffery · 2023-12-05T12:47:46Z

Ah, I realised that using Sync Fork on GitHub causes a merge. Will not use it again. Hopefully CI works now, just need to fix the coverage issue

If you need to rebase to main and the PR is other wise complete you can use "_mergifyio rebase" replacing the _ with a @.

duncanMR · 2023-12-05T13:46:25Z

I think I've covered all the cases where tsk_tree_position_t was used outside of tsk_tree_t. I haven't figured out how to resolve the codecov warning: the tests implicitly check if tsk_tree_position_init returns a nonzero value within tsk_tree_init, and I've added checks that all the initialised values are correct.

jeromekelleher

I think you're doing this the wrong way around. You need to update the tree class to use the tree_pos_t first, then alter external code to use it directly.

jeromekelleher · 2023-12-05T13:59:29Z

c/tskit/haplotype_matching.c

@@ -247,8 +246,8 @@ tsk_ls_hmm_reset(tsk_ls_hmm_t *self, double value)
    tsk_memset(self->transition_parent, 0xff,
        self->max_transitions * sizeof(*self->transition_parent));

-    tsk_tree_position_free(&self->tree_pos);
-    ret = tsk_tree_position_init(&self->tree_pos, self->tree_sequence, 0);
+    tsk_tree_free(&self->tree);


I'm not sure we want to do this - we're already maintaining the state of the tree elsewhere, so can delete this "_init" and free

jeromekelleher · 2023-12-05T14:01:38Z

c/tskit/trees.c

@@ -7007,10 +7002,10 @@ tsk_treeseq_kc_distance(const tsk_treeseq_t *self, const tsk_treeseq_t *other,
    if (ret != 0) {
        goto out;
    }
-    tsk_tree_position_next(&tree_pos[0]);
-    tsk_bug_assert(tree_pos[0].index == 0);
+    tsk_tree_position_next(&trees[0].tree_pos);


Ah no, that's not the point - we need to use the internal calls to tree_pos_next. Calling externally like this should break all sorts of things.

jeromekelleher · 2023-12-05T14:03:12Z

I would drop the last two commits for now @duncanMR, and focus on changing tree methods to use the tree_pos

duncanMR · 2023-12-05T15:11:15Z

I've moved the last two commits into another branch for now. I'm going to have to pause this for now, but I'll continue revising the tree positioning code next year.

duncanMR · 2024-02-09T18:11:02Z

I've moved over all the tree positioning functions to use tree_pos; for now, I've made seek_from_null just call tsk_tree_next repeatedly for now, until we implement seek_forward. All tests are passing except for 21 in the class tests/test_highlevel.py::TestSeekDirection. I've narrowed down the cause: the left_child and right_sib arrays are calculated incorrectly when calling prev from the null state:

from tests import tsutil
from tests import test_tree_positioning
ts = tsutil.all_trees_ts(3)
tree_state = test_tree_positioning.StatefulTree(ts)
tree_state.prev()
print(tree_state)

parent: [3, 3, 4, 4, -1]
position:
	index: 3
	interval: Interval(left=3.0, right=4.0)
	direction: -1
	in_range: EdgeRange(start=8, stop=4, order=array([0, 5, 2, 6, 4, 8, 7, 3, 1], dtype=int32))
	out_range: EdgeRange(start=8, stop=8, order=array([0, 2, 4, 5, 8, 1, 6, 3, 7], dtype=int32))

So we will insert edges 1 ($3 \rightarrow 0$), 3 ($3 \rightarrow 1$), 7 ($4 \rightarrow 2$) and 8 ($4 \rightarrow 3$) in that order.
Because edge 7 is inserted first, we set left_child(4) = 2 and right_sib[2] = 3. This is incorrect, because it is supposed to be left_child(4) = 3 and right_sib(3)=2, as we can see from the tree plot:

I'm still trying to figure out how the system for node ordering works. I think we either need to change the in_range calculation, or we have to revise how left_child and right_sib are updated, but I'm not sure yet.

jeromekelleher · 2024-02-09T20:22:15Z

Hmm, the ordering of children for a parent is documented as arbitrary, and we can change it. If that's the only problem then I wouldn't consider it a bug.

Is it just those tests expect a particular ordering?

duncanMR · 2024-02-10T10:54:18Z

Yes, it seems that test_highlevel.py::TestSeekDirection is the only test class that checks the child and sib arrays directly with the assert_trees_identical function. The tests fail because the child and sib arrays are calculated differently when reaching the last tree in a sequence with prev or next. I get that the ordering is arbitrary, but I would have thought we'd need it to be independent of the tree positioning direction?

jeromekelleher · 2024-02-11T11:21:28Z

I get that the ordering is arbitrary, but I would have thought we'd need it to be independent of the tree positioning direction?

It's always been conditioned on the precise order in which edges are inserted and removed. For example, you will slightly different left_sib arrays when you seek to the same tree in the left and right directions. There's no avoiding this if we want to keep it efficient for large polytomies.

So, just to clarify, the arrays are correct here, but they are ordered slightly differently to the currently library code? I'm OK with that, if that's the case.

duncanMR · 2024-02-12T08:51:13Z

I get that the ordering is arbitrary, but I would have thought we'd need it to be independent of the tree positioning direction?

It's always been conditioned on the precise order in which edges are inserted and removed. For example, you will slightly different left_sib arrays when you seek to the same tree in the left and right directions. There's no avoiding this if we want to keep it efficient for large polytomies.

So, just to clarify, the arrays are correct here, but they are ordered slightly differently to the currently library code? I'm OK with that, if that's the case.

I see. Yes, the right_sib and left_child have just been permuted at one position; here, Tree 2 is what the current library code outputs and Tree 1 is the result of calling prev from the null tree, which used to give the same result prior to my changes:

Tree 1 left_child: [-1 -1 -1  0  2  4]
Tree 2 left_child: [-1 -1 -1  0  3  4]
Tree 1 left_sib:   [-1  0 -1  2 -1 -1]
Tree 2 left_sib:   [-1  0  3 -1 -1 -1]

The trees are identical otherwise. Should I just remove the checks for whether the left/right child/sib arrays are identical for this test?

jeromekelleher · 2024-02-12T13:13:24Z

Yes for now can just comment out those tests so we can see how the rest of the suite is doing (would come up with a better solution before merging)

duncanMR · 2024-02-21T15:09:36Z

I had some trouble replicating Kevin Thornton's performance plots for seek_from_null, since my timing data for each seek has higher variance and I don't have his benchmarking code to compare with. Strangely, the times I get for the new tree_pos implementation have tended to have higher variance; sometimes, a single seek takes an order of magnitude longer than normal. Overall, it seems like the new implementation is comparable in performance to the old method, but slightly slower on average due to the occasionally long seek times:

import stdpopsim
import tskit
import time
import pandas as pd

species = stdpopsim.get_species("HomSap")
model = species.get_demographic_model("OutOfAfrica_3G09")
contig = species.get_contig("chr1", length_multiplier=0.2)
samples = {"YRI": 0, "CHB": 0, "CEU": 2e4}
engine = stdpopsim.get_engine("msprime")
test_ts = engine.simulate(model, contig, samples, seed=1)

def seek_from_null_sweep(ts, method):
    data = []

    for n in range(0, ts.num_trees - 1, 500):
        tree = tskit.Tree(ts)
        before = time.perf_counter()
        tree.seek(n)
        duration = time.perf_counter() - before
        data.append([n, duration, method])

    # Convert the list to a DataFrame
    df = pd.DataFrame(data, columns=['Index', 'Time', 'Method'])
    return df

#lib_null_df = seek_from_null_sweep(test_ts, method="lib")
tree_pos_null_df = seek_from_null_sweep(test_ts, method="tree_pos")

We can't determine a clear difference between the methods from a single run, but if I repeatedly time how long it takes to seek from null to various points along the genome, the difference between the two implementations does seem significant in some cases. Here, I seeked to 1000 indices in the HGDP and msprime simulated trees, and only 10 indices in the sc2ts trees. This how I timed it:

def test_seek_from_null(ts, num_runs, num_indices):
    times = np.zeros(num_runs)
    tree = tskit.Tree(ts)
    indices = np.linspace(0, ts.num_trees - 1, num_indices, dtype=int)
    for run in range(num_runs):
        before = time.perf_counter()
        for n in indices:
            tree.clear()
            tree.seek(n)
        times[run] = time.perf_counter() - before
    return np.mean(times), np.std(times)

Here are the results, including the tests for other tree positioning methods (the times are averaged over 30 runs for seek_from_null and 50 runs for the other tests):

@jeromekelleher Do you think this warrants further investigation / profiling? My test setup is far from ideal: background processes and the complexity of WSL + Jupyter + Windows causes significant variation in the timings I calculate. In my first experiments, the performance of the new method was actually worse, but I couldn't replicate that later on.

molpopgen · 2024-02-21T17:12:58Z

I don't totally remember the details but my benchmarks were probably all done using a prototype written using tskit-rust and are now lost to history.

jeromekelleher · 2024-02-21T17:14:56Z

Interesting - what's the percentage difference in the average seek_from_null times? I guess we should run it through perf or something to see where the time is being spent in the low-level code.

jeromekelleher

There's an obvious pointer dereferencing that could be leading to the perf difference. Let's avoid using self-> within loops.

jeromekelleher · 2024-02-21T17:17:51Z

c/tskit/trees.c

+    valid = tsk_tree_position_next(&self->tree_pos);
+
+    if (valid) {
+        for (j = self->tree_pos.out.start; j != self->tree_pos.out.stop; j++) {


I think these pointer references might be adding up. Try something like

tree_pos = self->tree_pos; for (j = tree_pos.out.start; j != tree_pos.out.stop; j++) { ... }```

Looks like this fixes the problem. I'm just running a better perftest script now to make sure

duncanMR · 2024-02-21T19:35:41Z

I wrote a better performance test that calculates mean seek times for each of a subset of indices

def test_seek_from_null(ts, ts_name, num_runs, num_indices, method):
    times_matrix = np.zeros((num_runs, num_indices))
    indices = np.linspace(0, ts.num_trees - 1, num_indices, dtype=int)
    
    for run in range(num_runs):
        for i, n in enumerate(indices):
            tree = tskit.Tree(ts)
            before = time.perf_counter()
            tree.seek(n)
            after = time.perf_counter()
            times_matrix[run, i] = (after - before) * 1000000
    
    mean_times = np.mean(times_matrix, axis=0)
    std_times = np.std(times_matrix, axis=0)

    # Create a DataFrame to return
    results_df = pd.DataFrame({
        'index': indices,
        'mean': mean_times,
        'sd': std_times,
        'method': method,
        'ts': ts_name
    })

    return results_df

I computed the seek times for three TS, starting with a huge simulated ts:

The new algorithm performs substantially better toward the end of the sequence (results are averaged over 200 runs)

The variance in seek times is extremely high in the HGDP case for both methods, but the implementations perform within measurement error for each other

I had to reduce the number of runs to 50 for sc2ts because it takes so long. It's another clear win for the new implementation

I've updated the LSHMM and KC distance code. I removed the tsk_tree_position_step function because it is actually useless (we never want to update tsk_tree_position separately to the tree now). I think it's ready for review now. I just need to rebase

duncanMR · 2024-02-21T19:52:37Z

@Mergifyio rebase

mergify · 2024-02-21T19:53:00Z

rebase

❌ Unable to rebase: user `duncanMR` is unknown.

Please make sure duncanMR has logged in Mergify dashboard.

jeromekelleher

Fantastic. I've gone through this carefully and just spotted one detail. Ready to merge after that's fixed.

Can you keep the scripts etc that you use here handy, as it would be nice to do a tskit .dev news item with these perf numbers when we release this. Once you've implemented the seek_forward/backward functionality I think we'll want to do a joint Python and C API release, as the perf improvements are significant.

jeromekelleher · 2024-02-22T09:28:03Z

c/tskit/trees.c

-    // searching in the first or last 1/2
-    // of trees.
-    j = -1;
+    double interval_left = self->interval.left;


You're not actually using these values that interval_left and interval_right are being initialised to. Better to just declare,

double interval_left, interval_right;

jeromekelleher · 2024-02-22T09:35:33Z

c/tskit/trees.c

+        ret = tsk_tree_position_seek_forward(&self->tree_pos, index);
+        if (ret != 0) {
+            goto out;
+        }


It took me a couple of minutes to realise that because we're seeking from the null tree there are no edges to remove. It's worth adding a comment to that effect here I think.

duncanMR · 2024-02-22T11:01:09Z

Fantastic. I've gone through this carefully and just spotted one detail. Ready to merge after that's fixed.

Can you keep the scripts etc that you use here handy, as it would be nice to do a tskit .dev news item with these perf numbers when we release this. Once you've implemented the seek_forward/backward functionality I think we'll want to do a joint Python and C API release, as the perf improvements are significant.

Thanks for the feedback; I've corrected both issues and rebased. I'll definitely keep my scripts handy; I'll be using them to assess the new seek functionality soon.

duncanMR marked this pull request as ready for review December 5, 2023 10:22

duncanMR marked this pull request as draft December 5, 2023 10:22

duncanMR force-pushed the update_tree_struct branch from b3af6f8 to b08efa7 Compare December 5, 2023 12:40

duncanMR marked this pull request as ready for review December 5, 2023 13:41

jeromekelleher reviewed Dec 5, 2023

View reviewed changes

duncanMR marked this pull request as draft December 5, 2023 14:49

duncanMR changed the title ~~Add tree_pos to tsk_tree_t struct (updated)~~ Add tree_pos to tsk_tree_t struct Dec 5, 2023

duncanMR force-pushed the update_tree_struct branch from c514476 to bf2f3ac Compare December 5, 2023 15:09

duncanMR force-pushed the update_tree_struct branch 2 times, most recently from 6f7e74b to 895abed Compare February 9, 2024 16:23

duncanMR force-pushed the update_tree_struct branch 3 times, most recently from 724c5ce to e272ff3 Compare February 12, 2024 15:22

duncanMR force-pushed the update_tree_struct branch 2 times, most recently from 697c591 to ebf9159 Compare February 21, 2024 16:43

jeromekelleher reviewed Feb 21, 2024

View reviewed changes

duncanMR force-pushed the update_tree_struct branch from ebf9159 to 99a3d7d Compare February 21, 2024 19:31

duncanMR changed the title ~~Add tree_pos to tsk_tree_t struct~~ Add tree_pos to tsk_tree_t struct and change next, prev and seek_from_null to use tree_pos Feb 21, 2024

duncanMR force-pushed the update_tree_struct branch from 99a3d7d to d1a44d9 Compare February 21, 2024 19:51

duncanMR force-pushed the update_tree_struct branch 3 times, most recently from 272c1d7 to d88aff8 Compare February 21, 2024 20:02

duncanMR marked this pull request as ready for review February 21, 2024 20:04

jeromekelleher reviewed Feb 22, 2024

View reviewed changes

duncanMR added 5 commits February 22, 2024 10:57

Add tree_pos to tsk_tree_t struct

ae8283b

Update basic tree positioning functions with tree_pos versions

57748ed

Add tsk_tree_position_seek functions and revise seek_from_null

1998a80

Add tests for tsk_tree_position_seek functions

901b6aa

Remove tree_pos iterator from KC distance and haplotype matching

a614344

duncanMR force-pushed the update_tree_struct branch from d88aff8 to a614344 Compare February 22, 2024 10:57

jeromekelleher approved these changes Feb 22, 2024

View reviewed changes

jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Feb 22, 2024

mergify bot merged commit f955bae into tskit-dev:main Feb 22, 2024
23 checks passed

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Feb 22, 2024

duncanMR deleted the update_tree_struct branch February 22, 2024 13:48

duncanMR mentioned this pull request Feb 26, 2024

Implement efficient seeking from non-null trees using tree_pos #2911

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tree_pos to tsk_tree_t struct and change next, prev and seek_from_null to use tree_pos #2874

Add tree_pos to tsk_tree_t struct and change next, prev and seek_from_null to use tree_pos #2874

duncanMR commented Dec 4, 2023 •

edited

Loading

duncanMR commented Dec 4, 2023 •

edited

Loading

codecov bot commented Dec 4, 2023 •

edited

Loading

jeromekelleher commented Dec 5, 2023

duncanMR commented Dec 5, 2023

benjeffery commented Dec 5, 2023

benjeffery commented Dec 5, 2023 •

edited

Loading

duncanMR commented Dec 5, 2023

benjeffery commented Dec 5, 2023

duncanMR commented Dec 5, 2023 •

edited

Loading

jeromekelleher left a comment

jeromekelleher Dec 5, 2023

jeromekelleher Dec 5, 2023

jeromekelleher commented Dec 5, 2023

duncanMR commented Dec 5, 2023

duncanMR commented Feb 9, 2024

jeromekelleher commented Feb 9, 2024

duncanMR commented Feb 10, 2024

jeromekelleher commented Feb 11, 2024

duncanMR commented Feb 12, 2024

jeromekelleher commented Feb 12, 2024

duncanMR commented Feb 21, 2024

molpopgen commented Feb 21, 2024

jeromekelleher commented Feb 21, 2024

jeromekelleher left a comment

jeromekelleher Feb 21, 2024

duncanMR Feb 21, 2024

duncanMR commented Feb 21, 2024

duncanMR commented Feb 21, 2024

mergify bot commented Feb 21, 2024

jeromekelleher left a comment

jeromekelleher Feb 22, 2024

jeromekelleher Feb 22, 2024

duncanMR commented Feb 22, 2024

Add tree_pos to tsk_tree_t struct and change next, prev and seek_from_null to use tree_pos #2874

Add tree_pos to tsk_tree_t struct and change next, prev and seek_from_null to use tree_pos #2874

Conversation

duncanMR commented Dec 4, 2023 • edited Loading

Description

PR Checklist:

duncanMR commented Dec 4, 2023 • edited Loading

codecov bot commented Dec 4, 2023 • edited Loading

Codecov Report

jeromekelleher commented Dec 5, 2023

duncanMR commented Dec 5, 2023

benjeffery commented Dec 5, 2023

benjeffery commented Dec 5, 2023 • edited Loading

duncanMR commented Dec 5, 2023

benjeffery commented Dec 5, 2023

duncanMR commented Dec 5, 2023 • edited Loading

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Dec 5, 2023

Choose a reason for hiding this comment

jeromekelleher Dec 5, 2023

Choose a reason for hiding this comment

jeromekelleher commented Dec 5, 2023

duncanMR commented Dec 5, 2023

duncanMR commented Feb 9, 2024

jeromekelleher commented Feb 9, 2024

duncanMR commented Feb 10, 2024

jeromekelleher commented Feb 11, 2024

duncanMR commented Feb 12, 2024

jeromekelleher commented Feb 12, 2024

duncanMR commented Feb 21, 2024

molpopgen commented Feb 21, 2024

jeromekelleher commented Feb 21, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Feb 21, 2024

Choose a reason for hiding this comment

duncanMR Feb 21, 2024

Choose a reason for hiding this comment

duncanMR commented Feb 21, 2024

duncanMR commented Feb 21, 2024

mergify bot commented Feb 21, 2024

❌ Unable to rebase: user duncanMR is unknown.

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Feb 22, 2024

Choose a reason for hiding this comment

jeromekelleher Feb 22, 2024

Choose a reason for hiding this comment

duncanMR commented Feb 22, 2024

duncanMR commented Dec 4, 2023 •

edited

Loading

duncanMR commented Dec 4, 2023 •

edited

Loading

codecov bot commented Dec 4, 2023 •

edited

Loading

benjeffery commented Dec 5, 2023 •

edited

Loading

duncanMR commented Dec 5, 2023 •

edited

Loading

❌ Unable to rebase: user `duncanMR` is unknown.