-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tree_pos to tsk_tree_t struct and change next, prev and seek_from_null to use tree_pos #2874
Conversation
@jeromekelleher Unfortunately the CI issues remain on the fresh branch. Is there any way to force update the package cache? |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #2874 +/- ##
==========================================
+ Coverage 89.75% 89.79% +0.03%
==========================================
Files 30 30
Lines 30395 30399 +4
Branches 5912 5909 -3
==========================================
+ Hits 27282 27296 +14
+ Misses 1781 1778 -3
+ Partials 1332 1325 -7
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
|
@benjeffery what'st the easiest way of dealing with this? |
Ben suggested bumping the version number in the cache config, which I did in the last commit, but for some reason it seems to have not refreshed the cache. |
Bumping the number will certainly have flushed the cache. Think there is a deeper issue here - checking it now. |
BTW - The CI cache works across PRs, so a fresh one doesn't have any effect. Hoping #2875 fixes this CI issue. |
Ah, I realised that using Sync Fork on GitHub causes a merge. Will not use it again. Hopefully CI works now, just need to fix the coverage issue |
b3af6f8
to
b08efa7
Compare
If you need to rebase to main and the PR is other wise complete you can use "_mergifyio rebase" replacing the _ with a @. |
I think I've covered all the cases where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're doing this the wrong way around. You need to update the tree class to use the tree_pos_t
first, then alter external code to use it directly.
c/tskit/haplotype_matching.c
Outdated
@@ -247,8 +246,8 @@ tsk_ls_hmm_reset(tsk_ls_hmm_t *self, double value) | |||
tsk_memset(self->transition_parent, 0xff, | |||
self->max_transitions * sizeof(*self->transition_parent)); | |||
|
|||
tsk_tree_position_free(&self->tree_pos); | |||
ret = tsk_tree_position_init(&self->tree_pos, self->tree_sequence, 0); | |||
tsk_tree_free(&self->tree); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want to do this - we're already maintaining the state of the tree elsewhere, so can delete this "_init" and free
c/tskit/trees.c
Outdated
@@ -7007,10 +7002,10 @@ tsk_treeseq_kc_distance(const tsk_treeseq_t *self, const tsk_treeseq_t *other, | |||
if (ret != 0) { | |||
goto out; | |||
} | |||
tsk_tree_position_next(&tree_pos[0]); | |||
tsk_bug_assert(tree_pos[0].index == 0); | |||
tsk_tree_position_next(&trees[0].tree_pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no, that's not the point - we need to use the internal calls to tree_pos_next. Calling externally like this should break all sorts of things.
I would drop the last two commits for now @duncanMR, and focus on changing tree methods to use the tree_pos |
c514476
to
bf2f3ac
Compare
I've moved the last two commits into another branch for now. I'm going to have to pause this for now, but I'll continue revising the tree positioning code next year. |
6f7e74b
to
895abed
Compare
I've moved over all the tree positioning functions to use from tests import tsutil
from tests import test_tree_positioning
ts = tsutil.all_trees_ts(3)
tree_state = test_tree_positioning.StatefulTree(ts)
tree_state.prev()
print(tree_state)
So we will insert edges 1 ( I'm still trying to figure out how the system for node ordering works. I think we either need to change the |
Hmm, the ordering of children for a parent is documented as arbitrary, and we can change it. If that's the only problem then I wouldn't consider it a bug. Is it just those tests expect a particular ordering? |
Yes, it seems that |
It's always been conditioned on the precise order in which edges are inserted and removed. For example, you will slightly different So, just to clarify, the arrays are correct here, but they are ordered slightly differently to the currently library code? I'm OK with that, if that's the case. |
I see. Yes, the Tree 1 left_child: [-1 -1 -1 0 2 4]
Tree 2 left_child: [-1 -1 -1 0 3 4]
Tree 1 left_sib: [-1 0 -1 2 -1 -1]
Tree 2 left_sib: [-1 0 3 -1 -1 -1] The trees are identical otherwise. Should I just remove the checks for whether the left/right child/sib arrays are identical for this test? |
Yes for now can just comment out those tests so we can see how the rest of the suite is doing (would come up with a better solution before merging) |
724c5ce
to
e272ff3
Compare
I had some trouble replicating Kevin Thornton's performance plots for import stdpopsim
import tskit
import time
import pandas as pd
species = stdpopsim.get_species("HomSap")
model = species.get_demographic_model("OutOfAfrica_3G09")
contig = species.get_contig("chr1", length_multiplier=0.2)
samples = {"YRI": 0, "CHB": 0, "CEU": 2e4}
engine = stdpopsim.get_engine("msprime")
test_ts = engine.simulate(model, contig, samples, seed=1)
def seek_from_null_sweep(ts, method):
data = []
for n in range(0, ts.num_trees - 1, 500):
tree = tskit.Tree(ts)
before = time.perf_counter()
tree.seek(n)
duration = time.perf_counter() - before
data.append([n, duration, method])
# Convert the list to a DataFrame
df = pd.DataFrame(data, columns=['Index', 'Time', 'Method'])
return df
#lib_null_df = seek_from_null_sweep(test_ts, method="lib")
tree_pos_null_df = seek_from_null_sweep(test_ts, method="tree_pos") We can't determine a clear difference between the methods from a single run, but if I repeatedly time how long it takes to seek from null to various points along the genome, the difference between the two implementations does seem significant in some cases. Here, I seeked to 1000 indices in the HGDP and msprime simulated trees, and only 10 indices in the sc2ts trees. This how I timed it: def test_seek_from_null(ts, num_runs, num_indices):
times = np.zeros(num_runs)
tree = tskit.Tree(ts)
indices = np.linspace(0, ts.num_trees - 1, num_indices, dtype=int)
for run in range(num_runs):
before = time.perf_counter()
for n in indices:
tree.clear()
tree.seek(n)
times[run] = time.perf_counter() - before
return np.mean(times), np.std(times) Here are the results, including the tests for other tree positioning methods (the times are averaged over 30 runs for @jeromekelleher Do you think this warrants further investigation / profiling? My test setup is far from ideal: background processes and the complexity of WSL + Jupyter + Windows causes significant variation in the timings I calculate. In my first experiments, the performance of the new method was actually worse, but I couldn't replicate that later on. |
697c591
to
ebf9159
Compare
I don't totally remember the details but my benchmarks were probably all done using a prototype written using tskit-rust and are now lost to history. |
Interesting - what's the percentage difference in the average |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an obvious pointer dereferencing that could be leading to the perf difference. Let's avoid using self->
within loops.
c/tskit/trees.c
Outdated
valid = tsk_tree_position_next(&self->tree_pos); | ||
|
||
if (valid) { | ||
for (j = self->tree_pos.out.start; j != self->tree_pos.out.stop; j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these pointer references might be adding up. Try something like
tree_pos = self->tree_pos;
for (j = tree_pos.out.start; j != tree_pos.out.stop; j++) {
...
}```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this fixes the problem. I'm just running a better perftest script now to make sure
ebf9159
to
99a3d7d
Compare
I wrote a better performance test that calculates mean seek times for each of a subset of indices def test_seek_from_null(ts, ts_name, num_runs, num_indices, method):
times_matrix = np.zeros((num_runs, num_indices))
indices = np.linspace(0, ts.num_trees - 1, num_indices, dtype=int)
for run in range(num_runs):
for i, n in enumerate(indices):
tree = tskit.Tree(ts)
before = time.perf_counter()
tree.seek(n)
after = time.perf_counter()
times_matrix[run, i] = (after - before) * 1000000
mean_times = np.mean(times_matrix, axis=0)
std_times = np.std(times_matrix, axis=0)
# Create a DataFrame to return
results_df = pd.DataFrame({
'index': indices,
'mean': mean_times,
'sd': std_times,
'method': method,
'ts': ts_name
})
return results_df I computed the seek times for three TS, starting with a huge simulated ts: The new algorithm performs substantially better toward the end of the sequence (results are averaged over 200 runs) The variance in seek times is extremely high in the HGDP case for both methods, but the implementations perform within measurement error for each other I had to reduce the number of runs to 50 for sc2ts because it takes so long. It's another clear win for the new implementation I've updated the LSHMM and KC distance code. I removed the |
99a3d7d
to
d1a44d9
Compare
@Mergifyio rebase |
❌ Unable to rebase: user
|
272c1d7
to
d88aff8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic. I've gone through this carefully and just spotted one detail. Ready to merge after that's fixed.
Can you keep the scripts etc that you use here handy, as it would be nice to do a tskit .dev news item with these perf numbers when we release this. Once you've implemented the seek_forward/backward functionality I think we'll want to do a joint Python and C API release, as the perf improvements are significant.
c/tskit/trees.c
Outdated
// searching in the first or last 1/2 | ||
// of trees. | ||
j = -1; | ||
double interval_left = self->interval.left; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not actually using these values that interval_left
and interval_right
are being initialised to. Better to just declare,
double interval_left, interval_right;
ret = tsk_tree_position_seek_forward(&self->tree_pos, index); | ||
if (ret != 0) { | ||
goto out; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a couple of minutes to realise that because we're seeking from the null tree there are no edges to remove. It's worth adding a comment to that effect here I think.
d88aff8
to
a614344
Compare
Thanks for the feedback; I've corrected both issues and rebased. I'll definitely keep my scripts handy; I'll be using them to assess the new |
Description
This PR aims to add
tree_pos
to thetsk_tree_t
struct and alternext
,prev
,first
,last
andseek_from_null
to use the tree positioning code. This means that we can remove the separatetsk_tree_position
iterator in LSHMM and KC distance code.Fixes #2794, Fixes #2795 and Fixes #2825.
PR Checklist: