Characterise propagation up the tree performance with seeking #2792

jeromekelleher · 2023-07-12T16:51:06Z

#2786 added the new tsk_tree_position_t class which takes care of tracking the edges that need to go in and out of a tree in order to transform it into a target tree. This works really well for the basic next and prev operations, and there is an implementation of seek_forward, which gives a straightforward way of transforming a tree into any other tree in the sequence, touching the minimum number of edges. It's worth looking at some of the key bits of code here:

       # The range of edges we need consider for removal starts
        # at the current right index and ends at the first edge
        # where the right coordinate is equal to the new tree's
        # left coordinate.
        j = right_current_index
        self.out_range.start = j
        # TODO This could be done with binary search
        while j < M and right_coords[right_order[j]] <= left:
            j += 1
        self.out_range.stop = j

        # The range of edges we need to consider for the new tree
        # must have right coordinate > left
        j = left_current_index
        while j < M and right_coords[left_order[j]] <= left:
            j += 1
        self.in_range.start = j
        # TODO this could be done with a binary search
        while j < M and left_coords[left_order[j]] <= left:
            j += 1
        self.in_range.stop = j

        self.interval.left = left
        self.interval.right = breakpoints[index + 1]

Here we fill in the in_range and out_range ranges of indexes into the left and right edge sortings which must be examined. We can be sure that we don't need to look at any edges outside of these ranges. Then, on the client side, we do something like this:

       old_left, old_right = self.tree_pos.interval
        self.tree_pos.seek_forward(index)
        left, right = self.tree_pos.interval
        for j in range(self.tree_pos.out_range.start, self.tree_pos.out_range.stop):
            e = self.tree_pos.out_range.order[j]
            e_left = self.ts.edges_left[e]
            # We only need to remove an edge if it's in the current tree, which
            # can only happen if the edge's left coord is < the current tree's
            # right coordinate.
            if e_left < old_right:
                c = self.ts.edges_child[e]
                assert self.parent[c] != -1
                self.parent[c] = -1
            assert e_left < left
        for j in range(self.tree_pos.in_range.start, self.tree_pos.in_range.stop):
            e = self.tree_pos.in_range.order[j]
            if self.ts.edges_left[e] <= left < self.ts.edges_right[e]:
                c = self.ts.edges_child[e]
                p = self.ts.edges_parent[e]
                self.parent[c] = p

I think this works quite well, and should correspond to something that works well in practise.

HOWEVER there is a problem: the edges that are removed and inserted are not necessarily in time-sorted order. This is a basic assumption that we lean on for incremental algorithms because it guarantees postorder-like behaviour, where we do the minimum amount of work in order to propagate information up the tree. In pathological cases, I think this could lead to O(tree height^2) performance on incremental algorithms that propagate information up the tree (like sample counting).

Unfortunately, this also applies to the current implementation of seek_from_null. Since we see great performance for this in benchmarks, I guess this means that either we've turned off sample counting for these benchmarks (which I doubt), or we have the possibility for some extra performance. See #2661 for more details.

I think the simplest thing to do initially would be to implement seek_from_null using the tree_pos based approach outlined above, and to defer propagating sample counts until after all the edges have been inserted. We can then do it using a standard postorder algorithm (which is as good as you're going to do anyway, if you insert the edges in the right time-order way).

How we approach the problem for the general seeking around the tree case, is more subtle. I guess we could either

Live with the edges going in and out in unsorted order, because it's probably going to be rare that we happen to do them in an order that's genuinely bad
Collect the edges we're examining and sort them so they are in order (but it's not obvious that this is actually better than the worst case?)

The text was updated successfully, but these errors were encountered:

jeromekelleher added enhancement New feature or request Performance This issue addresses performance, either runtime or memory labels Jul 13, 2023

This was referenced Jul 13, 2023

Replace low-level tree positioning code with tsk_tree_position_t calls #2794

Closed

Support windows in stats API that do not span whole genome #2782

Open

duncanMR mentioned this issue Feb 26, 2024

Implement efficient seeking from non-null trees using tree_pos #2911

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Characterise propagation up the tree performance with seeking #2792

Characterise propagation up the tree performance with seeking #2792

jeromekelleher commented Jul 12, 2023

Characterise propagation up the tree performance with seeking #2792

Characterise propagation up the tree performance with seeking #2792

Comments

jeromekelleher commented Jul 12, 2023