-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep unary for coalescent nodes in simplify #2381
Keep unary for coalescent nodes in simplify #2381
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2381 +/- ##
=======================================
Coverage 93.33% 93.34%
=======================================
Files 28 28
Lines 26981 26997 +16
Branches 1246 1247 +1
=======================================
+ Hits 25184 25201 +17
+ Misses 1763 1762 -1
Partials 34 34
Flags with carried forward coverage won't be shown. Click here to find out more.
|
If I've got it right, this is retaining all ancestry for any nodes that are coalescent nodes anywhere in the sequence. This is slightly different to what Thoughts? |
Just to get this right in my head, you mean that if the node has non-contiguous edges (i.e. "trapped material") where it exists in the topology from (say) 10-15 and 50-60, but is not on any lineage, even as a unary node, from 15-50, then the current implementation keeps the 50-60 edge too? If so, I think that's right, and the implementation here is not unreasonable behaviour. For example you may want to keep the number of nodes unaffected, but preserve as much of the known ancestry as possible. Deleting these regions loses potentially useful ancestral information. It would be interesting to see if nodes like this are inferred by I suspect that you are right that they only increase the size of the tree sequence, but I'm not 100% convinced that this is always the case. |
Following up on my comment above. we do have "split" edges like this in tsinfer (we also have quite a lot of nodes that are unary everywhere too). Here's an example. it may be worth putting something like this in the repo at https://github.com/petrelharp/num_edges import collections
import numpy as np
import msprime
import tskit
import tsinfer
num_samples = 8
ts = msprime.sim_ancestry(
num_samples,
ploidy=1,
sequence_length=10e8,
recombination_rate=1e-8,
record_full_arg=True,
random_seed=222
)
tsA = msprime.sim_mutations(ts, rate=1e-8, random_seed=123)
print(tsA.num_sites, "sites")
tsB = tsinfer.infer(tsinfer.SampleData.from_tree_sequence(tsA))
print(tsB.num_trees, "inferred trees")
def node_spans_max_children(ts):
node_spans = collections.defaultdict(list)
# node_id => [(left, right, [n_children1, n_children2,...]), ()]
curr_parents = collections.defaultdict(set)
for tree, diffs in zip(ts.trees(), ts.edge_diffs()):
for e_in in diffs.edges_in:
u = e_in.parent
if len(curr_parents[u]) == 0:
# node starts
node_spans[u].append(
[diffs.interval.left, diffs.interval.right, tree.num_children(u)]
)
else:
node_spans[u][-1][2] = max(node_spans[u][-1][2], tree.num_children(u))
curr_parents[e_in.parent].add(e_in.id)
for e_out in diffs.edges_out:
u = e_out.parent
curr_parents[u].remove(e_out.id)
if len(curr_parents[u]) == 0:
# node ends
node_spans[u][-1][1] = diffs.interval.right
for u, data in node_spans.items():
max_children = max(contiguous[2] for contiguous in data)
for contiguous in data:
if max_children > 1 and contiguous[2] < 2:
print("Node", u, "is contiguously unary from", contiguous)
node_spans_max_children(tsB)
|
Ah, great, nice example. I do agree that sometimes it'd be conceptually nice to have those nodes in their entireity, but the two operations are in principle different, and we should figure out whether we want one or both of them. If the unconnected, unary-only segments of a coalescent nodes are also inferrable, then we should also include them! But, I am not sure that they are? Do you have a good reason that they might be? So, one question is: are these unary-only spans real or artifactual? Like, when tsinfer infers a unary-only bit, how often is that unary-only bit "correct"? In the case you've got here, the extra unary bit does not seem to be correct, at least it isn't a parent to the same samples. On my machine (tsinfer 0.2.3) the node in question is 14, not 16, but:
gives
Have you got any situations where there's clearly information to infer those unary-only bits? |
Moving this to petrelharp/num_edges#2 so the discussion can continue easily even when the PR is closed. |
I think what simplify is doing here is a reasonable operation, because it's not discarding any information. Given a full ARG, it keeps a full record of any topology for ancestors that are coalescent anywhere. This is a useful operation, and I think corresponds to what msprime may optionally output some day. Implementing the exact mirror of edge_extend in simplify would be tricky I think, and much more complicated than this because you'd have to do a lot of reasoning about what's adjacent to what, and exclude edges depending on what they are adjacent to. I'm not enthusiastic about trying to do this - partly because it would be hard, but mostly because I don't see a lot of point (other than to provide the mirror-image of I guess we could do some experiments easily enough to see how significant the difference is by counting the number of unary-only edges output by this for a range of input parameters? |
I agree 100% with Jerome here. The operation as currently defined is simple and easy to implement and explain. The one involving adjacency, although theoretically interesting, is much more complicated as a simplification method IMO. As an example of an additional complication due to adjacency, what happens when all the current edges vanish and are replaced with a single different edge with the same parent. This would appear to be adjacent, but I'm not sure if it would be extended using @petrelharp's algorithm. The simplify method here can ignore all that stuff. |
Sounds good - let's do some experiments and see what the difference is. I do worry though that the result will be harder to explain, though: to the question "why's that stuff in the tree sequence" it's less confusing to say "it's only what you can learn from the trees" than if we add "plus a bit more stuff that made the algorithm easier". |
I've always seen I think it's a pretty strong assertion to say that the current |
83a253c
to
4d9e26f
Compare
4d9e26f
to
daae563
Compare
This LGTM @jeromekelleher: it's nice it is relatively simple. I agree that a separate parameter is the way to go, in case people want to keep both unary in individuals and when coalescent (I don't think we test for combinations of the I assume that this would still remove a node which is unary in some places, and coalescent in others, but where the coalescent regions do not lead to more than descendant sample (because one of the lineages down from a child is a piece of "hanging" topology that does not lead to a sample). Should we test this too? The only other comment I would have is to think about the name of the parameter. To someone who hasn't thought about it a lot, |
Another possible name: |
Here's a possible docstring. It's a bit contorted, but the best I could do after a fair bit of rewording.
|
Totally agree - I wasn't trying to say that it was! How about
|
Huh, isn't that interesting... I wonder if that's still true if you have a larger sample size? 6 is small enough that I can imagine odd things happening. |
I thought I had some code and plots that showed that the hacked version generally reduced the number of edges. It could have been with a bigger sample size: I can't remember. |
Not obvious to me... |
I wonder how this will change with a longer genome, though... |
Very interesting! I really didn't see this coming - the plot makes a clear practical case for edge_extend. I guess we do need to implement it in simplify to be truly useful for forward sims. I think the present operation is useful also, but we'd need some new names I think. Maybe we should move away from the phrase |
See #2127. See also #1190 for when keep_unary_in_individuals was added.
This is a quick prototype to show how it can be done. Essentially we just need to make a second pass through the overlapping segments to see whether a node participates in any coalescences. The rest is refactoring.
Ideally we would like to use an enumeration like
keep_unary="all" | "individuals" | "coalescent"
, but we need to be a bit careful about how this might interact with people using "truthy" inputs tokeep_unary
at the moment.Here's an example: