Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tree_pos to tsk_tree_t struct and change next, prev and seek_from_null to use tree_pos #2874

Merged
merged 5 commits into from
Feb 22, 2024

Conversation

duncanMR
Copy link
Contributor

@duncanMR duncanMR commented Dec 4, 2023

Description

This PR aims to add tree_pos to the tsk_tree_t struct and alter next, prev, first, last and seek_from_null to use the tree positioning code. This means that we can remove the separate tsk_tree_position iterator in LSHMM and KC distance code.

Fixes #2794, Fixes #2795 and Fixes #2825.

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • Changelogs, if there are API changes.

@duncanMR
Copy link
Contributor Author

duncanMR commented Dec 4, 2023

@jeromekelleher Unfortunately the CI issues remain on the fresh branch. Is there any way to force update the package cache?

Copy link

codecov bot commented Dec 4, 2023

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (813e361) 89.75% compared to head (a614344) 89.79%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2874      +/-   ##
==========================================
+ Coverage   89.75%   89.79%   +0.03%     
==========================================
  Files          30       30              
  Lines       30395    30399       +4     
  Branches     5912     5909       -3     
==========================================
+ Hits        27282    27296      +14     
+ Misses       1781     1778       -3     
+ Partials     1332     1325       -7     
Flag Coverage Δ
c-tests 86.21% <93.37%> (+0.07%) ⬆️
lwt-tests 80.78% <ø> (ø)
python-c-tests 67.71% <ø> (ø)
python-tests 98.96% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
c/tskit/haplotype_matching.c 84.55% <100.00%> (+0.07%) ⬆️
c/tskit/trees.c 90.42% <93.18%> (+0.20%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 813e361...a614344. Read the comment docs.

@jeromekelleher
Copy link
Member

@benjeffery what'st the easiest way of dealing with this?

@duncanMR
Copy link
Contributor Author

duncanMR commented Dec 5, 2023

Ben suggested bumping the version number in the cache config, which I did in the last commit, but for some reason it seems to have not refreshed the cache.

@benjeffery
Copy link
Member

Bumping the number will certainly have flushed the cache. Think there is a deeper issue here - checking it now.

@benjeffery
Copy link
Member

benjeffery commented Dec 5, 2023

BTW - The CI cache works across PRs, so a fresh one doesn't have any effect.

Hoping #2875 fixes this CI issue.

@duncanMR duncanMR marked this pull request as ready for review December 5, 2023 10:22
@duncanMR duncanMR marked this pull request as draft December 5, 2023 10:22
@duncanMR
Copy link
Contributor Author

duncanMR commented Dec 5, 2023

Ah, I realised that using Sync Fork on GitHub causes a merge. Will not use it again. Hopefully CI works now, just need to fix the coverage issue

@benjeffery
Copy link
Member

Ah, I realised that using Sync Fork on GitHub causes a merge. Will not use it again. Hopefully CI works now, just need to fix the coverage issue

If you need to rebase to main and the PR is other wise complete you can use "_mergifyio rebase" replacing the _ with a @.

@duncanMR duncanMR marked this pull request as ready for review December 5, 2023 13:41
@duncanMR
Copy link
Contributor Author

duncanMR commented Dec 5, 2023

I think I've covered all the cases where tsk_tree_position_t was used outside of tsk_tree_t. I haven't figured out how to resolve the codecov warning: the tests implicitly check if tsk_tree_position_init returns a nonzero value within tsk_tree_init, and I've added checks that all the initialised values are correct.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're doing this the wrong way around. You need to update the tree class to use the tree_pos_t first, then alter external code to use it directly.

@@ -247,8 +246,8 @@ tsk_ls_hmm_reset(tsk_ls_hmm_t *self, double value)
tsk_memset(self->transition_parent, 0xff,
self->max_transitions * sizeof(*self->transition_parent));

tsk_tree_position_free(&self->tree_pos);
ret = tsk_tree_position_init(&self->tree_pos, self->tree_sequence, 0);
tsk_tree_free(&self->tree);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to do this - we're already maintaining the state of the tree elsewhere, so can delete this "_init" and free

c/tskit/trees.c Outdated
@@ -7007,10 +7002,10 @@ tsk_treeseq_kc_distance(const tsk_treeseq_t *self, const tsk_treeseq_t *other,
if (ret != 0) {
goto out;
}
tsk_tree_position_next(&tree_pos[0]);
tsk_bug_assert(tree_pos[0].index == 0);
tsk_tree_position_next(&trees[0].tree_pos);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no, that's not the point - we need to use the internal calls to tree_pos_next. Calling externally like this should break all sorts of things.

@jeromekelleher
Copy link
Member

I would drop the last two commits for now @duncanMR, and focus on changing tree methods to use the tree_pos

@duncanMR duncanMR marked this pull request as draft December 5, 2023 14:49
@duncanMR duncanMR changed the title Add tree_pos to tsk_tree_t struct (updated) Add tree_pos to tsk_tree_t struct Dec 5, 2023
@duncanMR
Copy link
Contributor Author

duncanMR commented Dec 5, 2023

I've moved the last two commits into another branch for now. I'm going to have to pause this for now, but I'll continue revising the tree positioning code next year.

@duncanMR duncanMR force-pushed the update_tree_struct branch 2 times, most recently from 6f7e74b to 895abed Compare February 9, 2024 16:23
@duncanMR
Copy link
Contributor Author

duncanMR commented Feb 9, 2024

I've moved over all the tree positioning functions to use tree_pos; for now, I've made seek_from_null just call tsk_tree_next repeatedly for now, until we implement seek_forward. All tests are passing except for 21 in the class tests/test_highlevel.py::TestSeekDirection. I've narrowed down the cause: the left_child and right_sib arrays are calculated incorrectly when calling prev from the null state:

from tests import tsutil
from tests import test_tree_positioning
ts = tsutil.all_trees_ts(3)
tree_state = test_tree_positioning.StatefulTree(ts)
tree_state.prev()
print(tree_state)
parent: [3, 3, 4, 4, -1]
position:
	index: 3
	interval: Interval(left=3.0, right=4.0)
	direction: -1
	in_range: EdgeRange(start=8, stop=4, order=array([0, 5, 2, 6, 4, 8, 7, 3, 1], dtype=int32))
	out_range: EdgeRange(start=8, stop=8, order=array([0, 2, 4, 5, 8, 1, 6, 3, 7], dtype=int32))

So we will insert edges 1 ($3 \rightarrow 0$), 3 ($3 \rightarrow 1$), 7 ($4 \rightarrow 2$) and 8 ($4 \rightarrow 3$) in that order.
Because edge 7 is inserted first, we set left_child(4) = 2 and right_sib[2] = 3. This is incorrect, because it is supposed to be left_child(4) = 3 and right_sib(3)=2, as we can see from the tree plot:

image

I'm still trying to figure out how the system for node ordering works. I think we either need to change the in_range calculation, or we have to revise how left_child and right_sib are updated, but I'm not sure yet.

@jeromekelleher
Copy link
Member

Hmm, the ordering of children for a parent is documented as arbitrary, and we can change it. If that's the only problem then I wouldn't consider it a bug.

Is it just those tests expect a particular ordering?

@duncanMR
Copy link
Contributor Author

Yes, it seems that test_highlevel.py::TestSeekDirection is the only test class that checks the child and sib arrays directly with the assert_trees_identical function. The tests fail because the child and sib arrays are calculated differently when reaching the last tree in a sequence with prev or next. I get that the ordering is arbitrary, but I would have thought we'd need it to be independent of the tree positioning direction?

@jeromekelleher
Copy link
Member

I get that the ordering is arbitrary, but I would have thought we'd need it to be independent of the tree positioning direction?

It's always been conditioned on the precise order in which edges are inserted and removed. For example, you will slightly different left_sib arrays when you seek to the same tree in the left and right directions. There's no avoiding this if we want to keep it efficient for large polytomies.

So, just to clarify, the arrays are correct here, but they are ordered slightly differently to the currently library code? I'm OK with that, if that's the case.

@duncanMR
Copy link
Contributor Author

I get that the ordering is arbitrary, but I would have thought we'd need it to be independent of the tree positioning direction?

It's always been conditioned on the precise order in which edges are inserted and removed. For example, you will slightly different left_sib arrays when you seek to the same tree in the left and right directions. There's no avoiding this if we want to keep it efficient for large polytomies.

So, just to clarify, the arrays are correct here, but they are ordered slightly differently to the currently library code? I'm OK with that, if that's the case.

I see. Yes, the right_sib and left_child have just been permuted at one position; here, Tree 2 is what the current library code outputs and Tree 1 is the result of calling prev from the null tree, which used to give the same result prior to my changes:

Tree 1 left_child: [-1 -1 -1  0  2  4]
Tree 2 left_child: [-1 -1 -1  0  3  4]
Tree 1 left_sib:   [-1  0 -1  2 -1 -1]
Tree 2 left_sib:   [-1  0  3 -1 -1 -1]

The trees are identical otherwise. Should I just remove the checks for whether the left/right child/sib arrays are identical for this test?

@jeromekelleher
Copy link
Member

Yes for now can just comment out those tests so we can see how the rest of the suite is doing (would come up with a better solution before merging)

@duncanMR duncanMR force-pushed the update_tree_struct branch 3 times, most recently from 724c5ce to e272ff3 Compare February 12, 2024 15:22
@duncanMR
Copy link
Contributor Author

I had some trouble replicating Kevin Thornton's performance plots for seek_from_null, since my timing data for each seek has higher variance and I don't have his benchmarking code to compare with. Strangely, the times I get for the new tree_pos implementation have tended to have higher variance; sometimes, a single seek takes an order of magnitude longer than normal. Overall, it seems like the new implementation is comparable in performance to the old method, but slightly slower on average due to the occasionally long seek times:

import stdpopsim
import tskit
import time
import pandas as pd

species = stdpopsim.get_species("HomSap")
model = species.get_demographic_model("OutOfAfrica_3G09")
contig = species.get_contig("chr1", length_multiplier=0.2)
samples = {"YRI": 0, "CHB": 0, "CEU": 2e4}
engine = stdpopsim.get_engine("msprime")
test_ts = engine.simulate(model, contig, samples, seed=1)

def seek_from_null_sweep(ts, method):
    data = []

    for n in range(0, ts.num_trees - 1, 500):
        tree = tskit.Tree(ts)
        before = time.perf_counter()
        tree.seek(n)
        duration = time.perf_counter() - before
        data.append([n, duration, method])

    # Convert the list to a DataFrame
    df = pd.DataFrame(data, columns=['Index', 'Time', 'Method'])
    return df

#lib_null_df = seek_from_null_sweep(test_ts, method="lib")
tree_pos_null_df = seek_from_null_sweep(test_ts, method="tree_pos")

image

We can't determine a clear difference between the methods from a single run, but if I repeatedly time how long it takes to seek from null to various points along the genome, the difference between the two implementations does seem significant in some cases. Here, I seeked to 1000 indices in the HGDP and msprime simulated trees, and only 10 indices in the sc2ts trees. This how I timed it:

def test_seek_from_null(ts, num_runs, num_indices):
    times = np.zeros(num_runs)
    tree = tskit.Tree(ts)
    indices = np.linspace(0, ts.num_trees - 1, num_indices, dtype=int)
    for run in range(num_runs):
        before = time.perf_counter()
        for n in indices:
            tree.clear()
            tree.seek(n)
        times[run] = time.perf_counter() - before
    return np.mean(times), np.std(times)

Here are the results, including the tests for other tree positioning methods (the times are averaged over 30 runs for seek_from_null and 50 runs for the other tests):

image

@jeromekelleher Do you think this warrants further investigation / profiling? My test setup is far from ideal: background processes and the complexity of WSL + Jupyter + Windows causes significant variation in the timings I calculate. In my first experiments, the performance of the new method was actually worse, but I couldn't replicate that later on.

@duncanMR duncanMR force-pushed the update_tree_struct branch 2 times, most recently from 697c591 to ebf9159 Compare February 21, 2024 16:43
@molpopgen
Copy link
Member

I don't totally remember the details but my benchmarks were probably all done using a prototype written using tskit-rust and are now lost to history.

@jeromekelleher
Copy link
Member

Interesting - what's the percentage difference in the average seek_from_null times? I guess we should run it through perf or something to see where the time is being spent in the low-level code.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an obvious pointer dereferencing that could be leading to the perf difference. Let's avoid using self-> within loops.

c/tskit/trees.c Outdated
valid = tsk_tree_position_next(&self->tree_pos);

if (valid) {
for (j = self->tree_pos.out.start; j != self->tree_pos.out.stop; j++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these pointer references might be adding up. Try something like

tree_pos = self->tree_pos;
for (j = tree_pos.out.start; j != tree_pos.out.stop; j++) {
...
}```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this fixes the problem. I'm just running a better perftest script now to make sure

@duncanMR
Copy link
Contributor Author

I wrote a better performance test that calculates mean seek times for each of a subset of indices

def test_seek_from_null(ts, ts_name, num_runs, num_indices, method):
    times_matrix = np.zeros((num_runs, num_indices))
    indices = np.linspace(0, ts.num_trees - 1, num_indices, dtype=int)
    
    for run in range(num_runs):
        for i, n in enumerate(indices):
            tree = tskit.Tree(ts)
            before = time.perf_counter()
            tree.seek(n)
            after = time.perf_counter()
            times_matrix[run, i] = (after - before) * 1000000
    
    mean_times = np.mean(times_matrix, axis=0)
    std_times = np.std(times_matrix, axis=0)

    # Create a DataFrame to return
    results_df = pd.DataFrame({
        'index': indices,
        'mean': mean_times,
        'sd': std_times,
        'method': method,
        'ts': ts_name
    })

    return results_df

I computed the seek times for three TS, starting with a huge simulated ts:

image

The new algorithm performs substantially better toward the end of the sequence (results are averaged over 200 runs)

image

The variance in seek times is extremely high in the HGDP case for both methods, but the implementations perform within measurement error for each other

image

I had to reduce the number of runs to 50 for sc2ts because it takes so long. It's another clear win for the new implementation

image

I've updated the LSHMM and KC distance code. I removed the tsk_tree_position_step function because it is actually useless (we never want to update tsk_tree_position separately to the tree now). I think it's ready for review now. I just need to rebase

@duncanMR duncanMR changed the title Add tree_pos to tsk_tree_t struct Add tree_pos to tsk_tree_t struct and change next, prev and seek_from_null to use tree_pos Feb 21, 2024
@duncanMR
Copy link
Contributor Author

@Mergifyio rebase

Copy link
Contributor

mergify bot commented Feb 21, 2024

rebase

❌ Unable to rebase: user duncanMR is unknown.

Please make sure duncanMR has logged in Mergify dashboard.

@duncanMR duncanMR force-pushed the update_tree_struct branch 3 times, most recently from 272c1d7 to d88aff8 Compare February 21, 2024 20:02
@duncanMR duncanMR marked this pull request as ready for review February 21, 2024 20:04
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic. I've gone through this carefully and just spotted one detail. Ready to merge after that's fixed.

Can you keep the scripts etc that you use here handy, as it would be nice to do a tskit .dev news item with these perf numbers when we release this. Once you've implemented the seek_forward/backward functionality I think we'll want to do a joint Python and C API release, as the perf improvements are significant.

c/tskit/trees.c Outdated
// searching in the first or last 1/2
// of trees.
j = -1;
double interval_left = self->interval.left;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not actually using these values that interval_left and interval_right are being initialised to. Better to just declare,

double interval_left, interval_right;

ret = tsk_tree_position_seek_forward(&self->tree_pos, index);
if (ret != 0) {
goto out;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a couple of minutes to realise that because we're seeking from the null tree there are no edges to remove. It's worth adding a comment to that effect here I think.

@duncanMR
Copy link
Contributor Author

Fantastic. I've gone through this carefully and just spotted one detail. Ready to merge after that's fixed.

Can you keep the scripts etc that you use here handy, as it would be nice to do a tskit .dev news item with these perf numbers when we release this. Once you've implemented the seek_forward/backward functionality I think we'll want to do a joint Python and C API release, as the perf improvements are significant.

Thanks for the feedback; I've corrected both issues and rebased. I'll definitely keep my scripts handy; I'll be using them to assess the new seek functionality soon.

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Feb 22, 2024
@mergify mergify bot merged commit f955bae into tskit-dev:main Feb 22, 2024
23 checks passed
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Feb 22, 2024
@duncanMR duncanMR deleted the update_tree_struct branch February 22, 2024 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants