Keep unary for coalescent nodes in simplify #2381

jeromekelleher · 2022-07-04T18:58:50Z

See #2127. See also #1190 for when keep_unary_in_individuals was added.

This is a quick prototype to show how it can be done. Essentially we just need to make a second pass through the overlapping segments to see whether a node participates in any coalescences. The rest is refactoring.

Ideally we would like to use an enumeration like keep_unary="all" | "individuals" | "coalescent", but we need to be a bit careful about how this might interact with people using "truthy" inputs to keep_unary at the moment.

Here's an example:

When keep_unary = True:                                                                                               
3.86┊             24      ┊            24       ┊                     ┊ 
    ┊         ┏━━━━┻━━━━┓ ┊        ┏━━━━┻━━━━┓  ┊                     ┊ 
2.40┊        23         ┃ ┊       23         ┃  ┊            23       ┊ 
    ┊         ┃         ┃ ┊        ┃         ┃  ┊        ┏━━━━┻━━━━┓  ┊ 
1.32┊         ┃        21 ┊        ┃        21  ┊        ┃        22  ┊ 
    ┊         ┃         ┃ ┊        ┃         ┃  ┊        ┃         ┃  ┊ 
0.48┊        20         ┃ ┊       20         ┃  ┊       20         ┃  ┊ 
    ┊     ┏━━━┻━━━┓     ┃ ┊     ┏━━┻━━┓      ┃  ┊     ┏━━┻━━┓      ┃  ┊ 
0.38┊     ┃       ┃    19 ┊     ┃     ┃     19  ┊     ┃     ┃     19  ┊ 
    ┊     ┃       ┃     ┃ ┊     ┃     ┃     ┏┻┓ ┊     ┃     ┃     ┏┻┓ ┊ 
0.20┊    18       ┃     ┃ ┊    18     ┃     ┃ ┃ ┊    18     ┃     ┃ ┃ ┊ 
    ┊   ┏━┻━━┓    ┃     ┃ ┊   ┏━┻━┓   ┃     ┃ ┃ ┊   ┏━┻━┓   ┃     ┃ ┃ ┊ 
0.20┊  17    ┃    ┃     ┃ ┊  17   ┃   ┃     ┃ ┃ ┊  17   ┃   ┃     ┃ ┃ ┊ 
    ┊  ┏┻━┓  ┃    ┃     ┃ ┊  ┏┻━┓ ┃   ┃     ┃ ┃ ┊  ┏┻━┓ ┃   ┃     ┃ ┃ ┊ 
0.17┊  ┃  ┃ 16    ┃     ┃ ┊  ┃  ┃16   ┃     ┃ ┃ ┊  ┃  ┃16   ┃     ┃ ┃ ┊ 
    ┊  ┃  ┃ ┏┻┓   ┃     ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊ 
0.11┊ 15  ┃ ┃ ┃   ┃     ┃ ┊ 15  ┃ ┃   ┃     ┃ ┃ ┊ 15  ┃ ┃   ┃     ┃ ┃ ┊ 
    ┊ ┏┻┓ ┃ ┃ ┃   ┃     ┃ ┊ ┏┻┓ ┃ ┃   ┃     ┃ ┃ ┊ ┏┻┓ ┃ ┃   ┃     ┃ ┃ ┊ 
0.08┊ ┃ ┃ ┃ ┃ ┃  14     ┃ ┊ ┃ ┃ ┃ ┃  14     ┃ ┃ ┊ ┃ ┃ ┃ ┃  14     ┃ ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┃ ┊ 
0.07┊ ┃ ┃ ┃ ┃ ┃ ┃  13   ┃ ┊ ┃ ┃ ┃ ┃ ┃  13   ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃  13   ┃ ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┃ ┊ 
0.06┊ ┃ ┃ ┃ ┃ ┃ ┃ 12  ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 12  ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 12  ┃ ┃ ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ 
0.06┊ ┃ ┃ ┃10 ┃ ┃ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃11 ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃11 ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┊ 
0.00┊ 0 4 1 2 7 5 6 9 8 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊ 
    0                     2                     4                    10 
                                                                                                                      
                                                                                                                      
When keep_unary_if_coalescent = True                                                                                  
3.86┊             20      ┊            20       ┊                     ┊ 
    ┊         ┏━━━━┻━━━━┓ ┊        ┏━━━━┻━━━━┓  ┊                     ┊ 
2.40┊        19         ┃ ┊       19         ┃  ┊            19       ┊ 
    ┊         ┃         ┃ ┊        ┃         ┃  ┊        ┏━━━━┻━━━━┓  ┊ 
0.48┊        18         ┃ ┊       18         ┃  ┊       18         ┃  ┊ 
    ┊     ┏━━━┻━━━┓     ┃ ┊     ┏━━┻━━┓      ┃  ┊     ┏━━┻━━┓      ┃  ┊ 
0.38┊     ┃       ┃    17 ┊     ┃     ┃     17  ┊     ┃     ┃     17  ┊ 
    ┊     ┃       ┃     ┃ ┊     ┃     ┃     ┏┻┓ ┊     ┃     ┃     ┏┻┓ ┊ 
0.20┊    16       ┃     ┃ ┊    16     ┃     ┃ ┃ ┊    16     ┃     ┃ ┃ ┊ 
    ┊   ┏━┻━━┓    ┃     ┃ ┊   ┏━┻━┓   ┃     ┃ ┃ ┊   ┏━┻━┓   ┃     ┃ ┃ ┊ 
0.20┊  15    ┃    ┃     ┃ ┊  15   ┃   ┃     ┃ ┃ ┊  15   ┃   ┃     ┃ ┃ ┊ 
    ┊  ┏┻━┓  ┃    ┃     ┃ ┊  ┏┻━┓ ┃   ┃     ┃ ┃ ┊  ┏┻━┓ ┃   ┃     ┃ ┃ ┊ 
0.17┊  ┃  ┃ 14    ┃     ┃ ┊  ┃  ┃14   ┃     ┃ ┃ ┊  ┃  ┃14   ┃     ┃ ┃ ┊ 
    ┊  ┃  ┃ ┏┻┓   ┃     ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊ 
0.11┊ 13  ┃ ┃ ┃   ┃     ┃ ┊ 13  ┃ ┃   ┃     ┃ ┃ ┊ 13  ┃ ┃   ┃     ┃ ┃ ┊ 
    ┊ ┏┻┓ ┃ ┃ ┃   ┃     ┃ ┊ ┏┻┓ ┃ ┃   ┃     ┃ ┃ ┊ ┏┻┓ ┃ ┃   ┃     ┃ ┃ ┊ 
0.08┊ ┃ ┃ ┃ ┃ ┃  12     ┃ ┊ ┃ ┃ ┃ ┃  12     ┃ ┃ ┊ ┃ ┃ ┃ ┃  12     ┃ ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┃ ┊ 
0.07┊ ┃ ┃ ┃ ┃ ┃ ┃  11   ┃ ┊ ┃ ┃ ┃ ┃ ┃  11   ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃  11   ┃ ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┃ ┊ 
0.06┊ ┃ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ 
0.00┊ 0 4 1 2 7 5 6 9 8 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊ 
    0                     2                     4                    10 


When keep_unary = False                                    
3.86┊             20      ┊            20       ┊                     ┊                                                
    ┊         ┏━━━━┻━━━━┓ ┊        ┏━━━━┻━━━━┓  ┊                     ┊                                                
2.40┊         ┃         ┃ ┊        ┃         ┃  ┊            19       ┊                                                
    ┊         ┃         ┃ ┊        ┃         ┃  ┊        ┏━━━━┻━━━━┓  ┊                                                
0.48┊        18         ┃ ┊       18         ┃  ┊       18         ┃  ┊                                                
    ┊     ┏━━━┻━━━┓     ┃ ┊     ┏━━┻━━┓      ┃  ┊     ┏━━┻━━┓      ┃  ┊                                                
0.38┊     ┃       ┃     ┃ ┊     ┃     ┃     17  ┊     ┃     ┃     17  ┊                                                
    ┊     ┃       ┃     ┃ ┊     ┃     ┃     ┏┻┓ ┊     ┃     ┃     ┏┻┓ ┊                                                
0.20┊    16       ┃     ┃ ┊    16     ┃     ┃ ┃ ┊    16     ┃     ┃ ┃ ┊                                                
    ┊   ┏━┻━━┓    ┃     ┃ ┊   ┏━┻━┓   ┃     ┃ ┃ ┊   ┏━┻━┓   ┃     ┃ ┃ ┊                                                
0.20┊  15    ┃    ┃     ┃ ┊  15   ┃   ┃     ┃ ┃ ┊  15   ┃   ┃     ┃ ┃ ┊                                                
    ┊  ┏┻━┓  ┃    ┃     ┃ ┊  ┏┻━┓ ┃   ┃     ┃ ┃ ┊  ┏┻━┓ ┃   ┃     ┃ ┃ ┊                                                
0.17┊  ┃  ┃ 14    ┃     ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊                                                
    ┊  ┃  ┃ ┏┻┓   ┃     ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊                                                
0.11┊ 13  ┃ ┃ ┃   ┃     ┃ ┊ 13  ┃ ┃   ┃     ┃ ┃ ┊ 13  ┃ ┃   ┃     ┃ ┃ ┊                                                
    ┊ ┏┻┓ ┃ ┃ ┃   ┃     ┃ ┊ ┏┻┓ ┃ ┃   ┃     ┃ ┃ ┊ ┏┻┓ ┃ ┃   ┃     ┃ ┃ ┊                                                
0.08┊ ┃ ┃ ┃ ┃ ┃  12     ┃ ┊ ┃ ┃ ┃ ┃  12     ┃ ┃ ┊ ┃ ┃ ┃ ┃  12     ┃ ┃ ┊                                                
    ┊ ┃ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┃ ┊                                                
0.07┊ ┃ ┃ ┃ ┃ ┃ ┃  11   ┃ ┊ ┃ ┃ ┃ ┃ ┃  11   ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃  11   ┃ ┃ ┊                                                
    ┊ ┃ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┃ ┊                                                
0.06┊ ┃ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┃ ┊                                                
    ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊                                                
0.00┊ 0 4 1 2 7 5 6 9 8 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊                                                
    0                     2                     4                    10

codecov · 2022-07-04T19:11:42Z

Codecov Report

Attention: Patch coverage is 96.29630% with 1 line in your changes missing coverage. Please review.

Project coverage is 93.34%. Comparing base (06e3fb0) to head (daae563).
Report is 268 commits behind head on main.

Files with missing lines	Patch %	Lines
c/tskit/tables.c	95.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2381   +/-   ##
=======================================
  Coverage   93.33%   93.34%           
=======================================
  Files          28       28           
  Lines       26981    26997   +16     
  Branches     1246     1247    +1     
=======================================
+ Hits        25184    25201   +17     
+ Misses       1763     1762    -1     
  Partials       34       34

Flag	Coverage Δ
c-tests	`92.26% <95.00%> (-0.01%)`	⬇️
lwt-tests	`89.05% <ø> (ø)`
python-c-tests	`71.00% <85.71%> (+0.01%)`	⬆️
python-tests	`98.93% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
python/_tskitmodule.c	`90.70% <100.00%> (+0.03%)`	⬆️
python/tskit/tables.py	`98.96% <100.00%> (+<0.01%)`	⬆️
python/tskit/trees.py	`98.66% <ø> (ø)`
c/tskit/tables.c	`90.25% <95.00%> (+<0.01%)`	⬆️

petrelharp · 2022-07-04T20:09:27Z

If I've got it right, this is retaining all ancestry for any nodes that are coalescent nodes anywhere in the sequence. This is slightly different to what extend_edges does, which is to output any input edges on which somewhere there is a coalescence. So, we could have a node that is a coalscent node on [0, 10], and then we output an edge for which it's a parent on [0, 15], but not a different edge for which it's a parent on [50, 60]. The argument for doing the second thing but not the first is that (a) it's not clear if we have any information to infer the existence of these totally-unary edges; and (b) these totally unary edges only increase the size of the tree sequence (as opposed to the partially-coalescent ones, which reduce it).

Thoughts?

hyanwong · 2022-07-04T20:32:10Z

Just to get this right in my head, you mean that if the node has non-contiguous edges (i.e. "trapped material") where it exists in the topology from (say) 10-15 and 50-60, but is not on any lineage, even as a unary node, from 15-50, then the current implementation keeps the 50-60 edge too?

If so, I think that's right, and the implementation here is not unreasonable behaviour. For example you may want to keep the number of nodes unaffected, but preserve as much of the known ancestry as possible. Deleting these regions loses potentially useful ancestral information. It would be interesting to see if nodes like this are inferred by tsinfer. I'll check now, but I suspect there are some cases where they will be. I can see the rationale for sometimes wanting to deleting them too, though.

I suspect that you are right that they only increase the size of the tree sequence, but I'm not 100% convinced that this is always the case.

hyanwong · 2022-07-04T21:53:58Z

Following up on my comment above. we do have "split" edges like this in tsinfer (we also have quite a lot of nodes that are unary everywhere too). Here's an example. it may be worth putting something like this in the repo at https://github.com/petrelharp/num_edges

import collections

import numpy as np
import msprime
import tskit
import tsinfer

num_samples = 8
ts  = msprime.sim_ancestry(
    num_samples,
    ploidy=1,
    sequence_length=10e8,
    recombination_rate=1e-8,
    record_full_arg=True,
    random_seed=222
)

tsA = msprime.sim_mutations(ts, rate=1e-8, random_seed=123)
print(tsA.num_sites, "sites")

tsB = tsinfer.infer(tsinfer.SampleData.from_tree_sequence(tsA))
print(tsB.num_trees, "inferred trees")

def node_spans_max_children(ts):
    node_spans = collections.defaultdict(list)
    # node_id => [(left, right, [n_children1, n_children2,...]), ()]
    curr_parents = collections.defaultdict(set)

    for tree, diffs in zip(ts.trees(), ts.edge_diffs()):
        for e_in in diffs.edges_in:
            u = e_in.parent
            if len(curr_parents[u]) == 0:
                # node starts
                node_spans[u].append(
                    [diffs.interval.left, diffs.interval.right, tree.num_children(u)]
                )
            else:
                node_spans[u][-1][2] = max(node_spans[u][-1][2], tree.num_children(u))
            curr_parents[e_in.parent].add(e_in.id)
        for e_out in diffs.edges_out:
            u = e_out.parent
            curr_parents[u].remove(e_out.id)
            if len(curr_parents[u]) == 0:
                # node ends
                node_spans[u][-1][1] = diffs.interval.right


    for u, data in node_spans.items():
        max_children = max(contiguous[2] for contiguous in data)
        for contiguous in data:
            if max_children > 1 and contiguous[2] < 2:
                print("Node", u, "is contiguously unary from", contiguous)

node_spans_max_children(tsB)

43 sites
16 inferred trees
Node 16 is contiguously unary from [548658128.0, 993319578.0, 1]

tsB.draw_svg(time_scale="rank", style=".n16 > .sym {fill: red} .mut {display: none}")

petrelharp · 2022-07-04T23:18:28Z

Ah, great, nice example. I do agree that sometimes it'd be conceptually nice to have those nodes in their entireity, but the two operations are in principle different, and we should figure out whether we want one or both of them. If the unconnected, unary-only segments of a coalescent nodes are also inferrable, then we should also include them! But, I am not sure that they are? Do you have a good reason that they might be?

So, one question is: are these unary-only spans real or artifactual? Like, when tsinfer infers a unary-only bit, how often is that unary-only bit "correct"? In the case you've got here, the extra unary bit does not seem to be correct, at least it isn't a parent to the same samples. On my machine (tsinfer 0.2.3) the node in question is 14, not 16, but:

t = tsA.first()
orig_node = t.mrca(1, 3, 7)
t = tsB.first()
new_node = t.mrca(1, 3, 7)
assert new_node == 14

left = 0
sA = set()
sB = set()
for interval, tA, tB in tsA.coiterate(tsB):
    new_sA = set(tA.samples(orig_node))
    new_sB = set(tB.samples(new_node))
    if sA != new_sA or sB != new_sB:
        print(f"{left} to {interval[0]} the node is a parent to:")
        print(f"         truth: {sA}")
        print(f"      inferred: {sB}")
    left = interval[0]
    sA = new_sA
    sB = new_sB

gives

0 to 0 the node is a parent to:
         truth: set()
      inferred: set()
147684860.0 to 160570091.0 the node is a parent to:
         truth: {1, 3, 7}
      inferred: {1, 3, 7}
169968287.0 to 173906518.0 the node is a parent to:
         truth: {3, 7}
      inferred: {1, 3, 7}
173906518.0 to 178484736.0 the node is a parent to:
         truth: {3, 7}
      inferred: {1, 3, 5, 6, 7}
178484736.0 to 204789241.0 the node is a parent to:
         truth: {3, 7}
      inferred: {3, 5, 7}
269704499.0 to 320784505.0 the node is a parent to:
         truth: {3, 7}
      inferred: set()
480931907.0 to 488755800.0 the node is a parent to:
         truth: {3, 5, 7}
      inferred: set()
540120992.0 to 548658128.0 the node is a parent to:
         truth: {0, 3, 5, 7}
      inferred: set()
564756120.0 to 576356796.0 the node is a parent to:
         truth: {0, 3, 5, 7}
      inferred: {3}
576356796.0 to 577972638.0 the node is a parent to:
         truth: {0, 5, 7}
      inferred: {3}
861036781.0 to 884623975.0 the node is a parent to:
         truth: {0, 5, 7}
      inferred: {3, 6}
959461190.0 to 983144527.0 the node is a parent to:
         truth: {0, 5, 7}
      inferred: set()

Have you got any situations where there's clearly information to infer those unary-only bits?

hyanwong · 2022-07-05T07:29:23Z

Moving this to petrelharp/num_edges#2 so the discussion can continue easily even when the PR is closed.

jeromekelleher · 2022-07-05T08:42:25Z

I think what simplify is doing here is a reasonable operation, because it's not discarding any information. Given a full ARG, it keeps a full record of any topology for ancestors that are coalescent anywhere. This is a useful operation, and I think corresponds to what msprime may optionally output some day.

Implementing the exact mirror of edge_extend in simplify would be tricky I think, and much more complicated than this because you'd have to do a lot of reasoning about what's adjacent to what, and exclude edges depending on what they are adjacent to. I'm not enthusiastic about trying to do this - partly because it would be hard, but mostly because I don't see a lot of point (other than to provide the mirror-image of edge_extend). In practise the increase in size caused by extra unary-only edges will be negligible, I think, and not worth the substantially more complicated simplify algorithm (from say, the forward simulation perspective).

I guess we could do some experiments easily enough to see how significant the difference is by counting the number of unary-only edges output by this for a range of input parameters?

hyanwong · 2022-07-05T08:45:36Z

I agree 100% with Jerome here. The operation as currently defined is simple and easy to implement and explain. The one involving adjacency, although theoretically interesting, is much more complicated as a simplification method IMO.

As an example of an additional complication due to adjacency, what happens when all the current edges vanish and are replaced with a single different edge with the same parent. This would appear to be adjacent, but I'm not sure if it would be extended using @petrelharp's algorithm. The simplify method here can ignore all that stuff.

petrelharp · 2022-07-05T14:06:14Z

Sounds good - let's do some experiments and see what the difference is. I do worry though that the result will be harder to explain, though: to the question "why's that stuff in the tree sequence" it's less confusing to say "it's only what you can learn from the trees" than if we add "plus a bit more stuff that made the algorithm easier".

jeromekelleher · 2022-07-05T14:24:59Z

I've always seen edge_extend as being a method to infer a subset of the true ancestry (i.e., what's returned by this method), which is why I said it would be exact in the SMC case and an approximation for things like full Hudson and gene conversion. I would then put it the other way around - we can infer a subset (~the vast majority?) of the true unary nodes using these properties you're exploiting.

I think it's a pretty strong assertion to say that the current edge_extend algorithm is the best possible and no extra information about unary nodes could ever be derived from the trees.

hyanwong · 2022-07-21T09:35:07Z

This LGTM @jeromekelleher: it's nice it is relatively simple. I agree that a separate parameter is the way to go, in case people want to keep both unary in individuals and when coalescent (I don't think we test for combinations of the keep_unary_XXX params, do we?)

I assume that this would still remove a node which is unary in some places, and coalescent in others, but where the coalescent regions do not lead to more than descendant sample (because one of the lineages down from a child is a piece of "hanging" topology that does not lead to a sample). Should we test this too?

The only other comment I would have is to think about the name of the parameter. To someone who hasn't thought about it a lot, keep unary if coalescent could possibly seem like a contradiction in terms. It's more like keep_unary_if_coalescent_elsewhere, but that's too long. We could have keep_partially_unary, although it's nice to have all the params starting with keep_unary_XXXXX, which I presume is why we hit upon the current naming. It's probably fine as it is, if documented well.

hyanwong · 2022-07-21T10:03:25Z

The only other comment I would have is to think about the name of the parameter.

Another possible name: keep_unary_regions (to emphasise that it's not that we are keeping unary nodes per se, but the unary regions of nodes that we would have kept anyway).

hyanwong · 2022-07-21T10:24:09Z

Here's a possible docstring. It's a bit contorted, but the best I could do after a fair bit of rewording.

If True, keep unary regions of retained nodes. After simplification, unary nodes may continue to exist in local trees, but only if those nodes would have been retained by the simplification process anyway (e.g. because they represent coalescences between two or more lineages from sample nodes to the root). If False, these nodes will still be present in the tree sequence, but are subject to removal from lineages where they are not coalescent points. Default: False

petrelharp · 2022-07-21T14:23:21Z

I think it's a pretty strong assertion to say that the current edge_extend algorithm is the best possible and no extra information about unary nodes could ever be derived from the trees.

Totally agree - I wasn't trying to say that it was!

How about

If True, keep all regions over which retained nodes are ancestral to samples. By default, nodes not in samples are only retained on regions of the genome where they are coalescent (i.e., a MRCA to at least two nodes in samples); if this option is True then nodes not in samples are still only retained if they are a MRCA to at least two nodes in samples, but all regions of the genome on which they are ancestral to a node in sample is also retained. Any extra retained information will show up as unary nodes on local trees. Default: False

petrelharp · 2022-07-22T04:42:56Z

OK - I did some quick experiments. This keep_unary argument seems to usually increase the number of edges by about as much as extend_edges decreases it. It's very interesting how much of this stuff there is!

So - this operation might still be useful, but I was wanting to include it in SLiM, to produce smaller, faster files. It looks like this operation won't do that. I'm not sure about the bigger picture at the moment.

jeromekelleher · 2022-07-22T09:09:42Z

Huh, isn't that interesting... I wonder if that's still true if you have a larger sample size? 6 is small enough that I can imagine odd things happening.

hyanwong · 2022-07-22T11:47:14Z

I thought I had some code and plots that showed that the hacked version generally reduced the number of edges. It could have been with a bigger sample size: I can't remember.

petrelharp · 2022-07-22T15:27:23Z

Hah, you're right about the number of samples:

... why would this happen? (i.e., why would more samples make the proportion of edges that are isolated unary edges on coalescent nodes smaller?)

jeromekelleher · 2022-07-22T15:32:37Z

... why would this happen? (i.e., why would more samples make the proportion of edges that are isolated unary edges on coalescent nodes smaller?)

Not obvious to me...

petrelharp · 2022-07-22T15:40:11Z

I wonder how this will change with a longer genome, though...

petrelharp · 2022-07-22T20:14:33Z

Ack, here's results for a longer sequence (0.1M instead of 0.01M):

petrelharp · 2022-07-23T06:18:38Z

Here's a bit more of an experiment changing both the number of samples and sequence length; recomb rate is 1e-8. Left is 10 samples; center is 100 samples; right is 1000.

jeromekelleher · 2022-07-25T13:40:29Z

Very interesting! I really didn't see this coming - the plot makes a clear practical case for edge_extend.

I guess we do need to implement it in simplify to be truly useful for forward sims. I think the present operation is useful also, but we'd need some new names I think.

Maybe we should move away from the phrase keep_unary entirely? I'm finding it less helpful every time I look at this stuff.

hyanwong mentioned this pull request Jul 5, 2022

Keeping detached unary segments on simplify petrelharp/num_edges#2

Open

jeromekelleher force-pushed the simplify-unary-if-coalescent branch 2 times, most recently from 83a253c to 4d9e26f Compare July 18, 2022 08:17

jeromekelleher added 2 commits July 18, 2022 12:19

Prototype for keep_unary_if_coalescent

aa78fdf

Update

daae563

jeromekelleher force-pushed the simplify-unary-if-coalescent branch from 4d9e26f to daae563 Compare July 18, 2022 11:19

benjeffery closed this Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep unary for coalescent nodes in simplify #2381

Keep unary for coalescent nodes in simplify #2381

jeromekelleher commented Jul 4, 2022 •

edited

Loading

codecov bot commented Jul 4, 2022 •

edited

Loading

petrelharp commented Jul 4, 2022

hyanwong commented Jul 4, 2022 •

edited

Loading

hyanwong commented Jul 4, 2022 •

edited

Loading

petrelharp commented Jul 4, 2022 •

edited

Loading

hyanwong commented Jul 5, 2022

jeromekelleher commented Jul 5, 2022

hyanwong commented Jul 5, 2022 •

edited

Loading

petrelharp commented Jul 5, 2022

jeromekelleher commented Jul 5, 2022

hyanwong commented Jul 21, 2022 •

edited

Loading

hyanwong commented Jul 21, 2022 •

edited

Loading

hyanwong commented Jul 21, 2022 •

edited

Loading

petrelharp commented Jul 21, 2022

petrelharp commented Jul 22, 2022

jeromekelleher commented Jul 22, 2022

hyanwong commented Jul 22, 2022

petrelharp commented Jul 22, 2022

jeromekelleher commented Jul 22, 2022

petrelharp commented Jul 22, 2022

petrelharp commented Jul 22, 2022

petrelharp commented Jul 23, 2022 •

edited

Loading

jeromekelleher commented Jul 25, 2022

Keep unary for coalescent nodes in simplify #2381

Keep unary for coalescent nodes in simplify #2381

Conversation

jeromekelleher commented Jul 4, 2022 • edited Loading

codecov bot commented Jul 4, 2022 • edited Loading

Codecov Report

petrelharp commented Jul 4, 2022

hyanwong commented Jul 4, 2022 • edited Loading

hyanwong commented Jul 4, 2022 • edited Loading

petrelharp commented Jul 4, 2022 • edited Loading

hyanwong commented Jul 5, 2022

jeromekelleher commented Jul 5, 2022

hyanwong commented Jul 5, 2022 • edited Loading

petrelharp commented Jul 5, 2022

jeromekelleher commented Jul 5, 2022

hyanwong commented Jul 21, 2022 • edited Loading

hyanwong commented Jul 21, 2022 • edited Loading

hyanwong commented Jul 21, 2022 • edited Loading

petrelharp commented Jul 21, 2022

petrelharp commented Jul 22, 2022

jeromekelleher commented Jul 22, 2022

hyanwong commented Jul 22, 2022

petrelharp commented Jul 22, 2022

jeromekelleher commented Jul 22, 2022

petrelharp commented Jul 22, 2022

petrelharp commented Jul 22, 2022

petrelharp commented Jul 23, 2022 • edited Loading

jeromekelleher commented Jul 25, 2022

jeromekelleher commented Jul 4, 2022 •

edited

Loading

codecov bot commented Jul 4, 2022 •

edited

Loading

hyanwong commented Jul 4, 2022 •

edited

Loading

hyanwong commented Jul 4, 2022 •

edited

Loading

petrelharp commented Jul 4, 2022 •

edited

Loading

hyanwong commented Jul 5, 2022 •

edited

Loading

hyanwong commented Jul 21, 2022 •

edited

Loading

hyanwong commented Jul 21, 2022 •

edited

Loading

hyanwong commented Jul 21, 2022 •

edited

Loading

petrelharp commented Jul 23, 2022 •

edited

Loading