-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Union of tree sequences #623
Conversation
ps. A bit of clarification on how we're specifying where to graft. The signature of the proposed TableCollection method is
Here
This means that:
Here are two use cases, with different sorts of overlap: (A) Simulate population (a) forwards in time for a while; then save the state, and start two independent simulations, each one starting a new population (call these (b) and (c)) with offspring of the final generation of the first simulation. The two resulting sets of tables should agree for everything in population (a), so we can graft together on all nodes in population (a). Their overlap should agree, as long as the two subsequent simulations don't do something funny. (B) Simulate population (a) for a while, recording everyone alive at time T along the way. Say there were N(T) individuals alive at that time. Then, simulate another population, (b), starting from N(T) individuals but no prior history, and you'd like to say that the first generation of population (b) is the same as the that time slice in (a). So, the shared nodes are only those alive at time T; not all of prior history. We can't check overlap, since for instance, some of those N(T) individuals might be parent-child pairs, but those relationships won't be there in (b). Since the "same" individuals are alive in both, we could also worry that they would eg accumulate mutations in one that aren't present in the other. Luckily, however, this won't happen in SLiM, since mutations occur at birth only (unless you add them manually); and it won't happen in msprime, since we'd do this by a census event. Note that we could also do recapitation by grafting an msprime simulation onto the top of a SLiM simulation.. As I think through these use cases more, I'm thinking we don't need |
This looks interesting! I'm not quite following what this is doing yet, but I'll think about it some more. Two high-level questions
|
This would also seem a bit easier to do/better defined if we expressed it as |
good idea, yes
Could do, although for me
Oh, so the idea is that the top part of all the table collections in |
I agree with Peter about the name. A graft operation is usually defined as: break tree A wherever and put the resulting subtree somewhere along a branch in tree B. This is basically what we are doing here: breaking the
I don't know enough about tree sequences to see how this would be easier to implement. But I feel like this would be more redundant. I agree it would be easier to understand, though. Basically we are replacing the |
I don't like the fact that consistency checking is something we do optionally in the current formulation and that we have to resort to using simplify() in order to do it. It seems like a sledgehammer cracking a nut. I think we'd end up spending quite a bit of time and effort trying to figure out what all the corner cases are under the current semantics and then trying to write test cases to exercise them. Simply stated semantics usually translate into simpler code once you've closed all the loopholes. Particularly since we'll (probably?) want to do this in C, for SLiM. So, I'd advocate for |
I don't like that either. I guess that the other use case I'm thinking of - sticking two tree sequences together that don't overlap - could be a different method, say, So, the proposed interface simplifies things because the user doesn't have to pass in this node mapping? That's good; I like that. It does mean they need the tree-sequence-up-to-the-split point handy, but that seems OK (if we had a table collection method to cut out a time chunk that'd be easy to get). |
Another thought. What if we made the The semantics are even simpler then, because there is no overlap between |
@mufernando and I have done a bunch more thinking about this. Observations:
However, it should be true that if you "subset" the two tables you want to graft by their shared nodes (see below), and possibly reorder them, then the resulting table collections should be equal, up to provenance. So, I'm back to our original proposal: The proposed subset-and-reorder method is:
Note that this would also provide the reordering operation that Yan wanted recently. |
This is
That sounds like a useful general tool to me, and wouldn't be too hard to do? I'm worried that we're writing something that's solving quite specific problem here, that is bound very tightly to how SLiM does things, but giving it a general name in |
I don't see this as being SLiM-specific, but definitely agree it is more useful for forward simulations. A graft operation only makes sense going forward-in-time. If you want to do something similar but back in time you would use msprime's |
It's quite a bit simpler than simplify; it is only re-indexing, mostly. Here's the start of a draft, to give you the idea: def subset(tables, nodes):
new = tables.copy()
n = tables.nodes
new.nodes.set_columns(
flags = n.flags[nodes],
population = n.flags[nodes],
individual = n.flags[nodes], # TODO: reindex
time = n.time[nodes],
**_subset_array(n.metadata, n.metadata_offset nodes))
node_map = np.arange(tables.nodes.num_rows)
node_map[nodes] = np.arange(new.nodes.num_rows)
keep_indivs = np.unique(new.nodes.individual) # TODO: put in order
i = tables.individuals
new.individuals.set_columns(
flags = i.flags[keep_indivs],
**_subset_array(i.location, i.location_offset, keep_indivs),
**_subset_array(i.metadata, i.metadata_offset), keep_indivs))
e = tables.edges
keep_edges = np.logical_and(np.isin(e.parent, nodes), np.isin(e.child, nodes))
new.edges.set_columns(
left = e.left[keep_edges],
right = e.right[keep_edges],
parent = node_map[e.parent[keep_edges]],
child = node_map[e.child[keep_edges]],
**_subset_array(e.metadata, e.metadata_offset, keep_edges))
# etcetera |
As for being SLiM-specific: it's true, we don't have other current use cases for this more restrictive version that requires checking overlap. But, what about, for instance, when someone working with decode's genealogy figures out that ancestor A in pedigree X is the same as ancestor B in pedigree Y, and thus wants to merge pedigree X and pedigree Y? (hm, I said "merge"...) |
It all sounds plausible @petrelharp - I guess I just don't have the time to think about this properly right now as there's a lot of stuff queuing up for release. Is this something you and @mufernando would like to see in 0.3.0? |
Well, on the one hand we can be flexible, but on the other, @mufernando is needing it basically now, so we're going to be developing it now, and so it'd make most sense to get it in while it's fresh. 0.3.0 or not doesn't matter, though. We could get input from others, perhaps? |
Update: @mufernando and I just spent quite a while talking through the various options. Here's what we think is the right way forward. We would like to move forward, if you're comfortable, @jeromekelleher - maybe @gtsambos and/or @hyanwong could give us a sanity check? I'll try to make this self-contained, so you don't need to wade through the stuff above. Goal: implement a method, provisionally called Use cases:
API options: Input should be two sets of tables, We think that option (a) is the simpler one, because the tool we need to check for overlap is straightforward ("subset"; see below), and constructing the edge tables needed for (b) is more error-prone. And, for (b) you need to pass in two things, not just one. So, the proposed method of a table collection is:
where
To do the sanity check we need a "reorder and subset" method of a table collection, which is described here, and is something that Yan has recently wanted (the reordering bit anyhow). We'll describe this in a different PR. Questions:
ps: changes from previous proposal:
pps: @jeromekelleher previously proposed doing this by assuming no overlap between the two sets of tables; this would be pretty easy using the "subset" method, and would require passing in edge tables, as in (b) above, which is why we went with this proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. The operation looks simple enough and makes sense (I think - once a few things have been clarified). There's a lot of work to be done in testing it though, as I'm sure there's a bunch of icky corner cases we've not considered and won't come out of the woodwork until we've tested it on a bunch of different simulated topologies.
I like this speculative idea. One possibility is that, if a migrant in population (a) comes from a different population (b), then the ts in (a) wouldn't actually need to track the migrant back in time until the MRCA: it would be tracked within population (b), and there would be no need to duplicate the history within (a) too. That would mean "cutting off" the migrant once it entered the other population, and once all its relevant genome was covered by parent edges in the other population. The migrant would appear to lead to a separate root in (a), and it would only be after the merge that a singly rooted tree would emerge. However, there's a fair bit of extra work here required for this, such as checking that span covered by the edges leading to this migrant is also covered by edges leading out of the migrant, etc.
Agree.
For comparison I guess we need to be careful about sorting order? Does this not, for instance, require that
Yes, I think so. A graphical representation would be nice. It would be trivial to also make this a non-in-place method of a tree sequence, as well as an in-place method on a TableCollection, just like we do for simplify, trim, etc.
Definitely
perhaps my comment above covers one of those. What about tree sequences for separate chromosomes in the same (haploid?) population? Might we want to combine those somehow (seems a little different, though)
I vote for
Down the line, it presumably would be useful to have a separate function which adds and re-indexes populations on this basis of the populations in another TS, to allow a merge to occur when a new population appears in one of the tree sequences. Presumably we would allow the node map to contain sample nodes too, so that the two tree sequences need not be describing different sample populations? |
By the way, is recapitation just a limited version of the functionality proposed here? I.e. instead of recapitating an existing TS, you could set up a new msprime simulation with the individuals from the start of the SLiM simulation, simulate in msprime, then graft the two together? It might be elegant to rephrase recapitation in these terms. |
Yes, good spotting. Strictly speaking this will be the case, but in practice this won't be a problem, although it's extra incentive to get a total order sort implemented. |
Hi all, sorry for the radio silence, it's been very busy down here for me 😫 just chiming in to say that I plan to get to grips with this sometime over the next day or two! |
Another possible name for this is |
You previously said "for me merge sounds like you'll get all the structure in other, which isn't necessarily true.". Union sounds even more like this, in a nod to set theory. Is this more like a intersection or a database-like LEFT JOIN? I've lost track of whether it is asymmetrical. |
Aha, interesting question! I think this doesn't characterize as a So |
Initially it was more asymmetrical. We've put this aside while implementing subset, which will make this easier, but here is I think the current plan: sanity check whether Ah, but you can't use it for #675 because the mutations would be associated with already-existing nodes. (I mean, maybe we could get it in there, but it would get in the way of this simple interpretation.) |
Hm, maybe this one is |
how did this happen? the tests had passed before and are not failing locally. I'll look into it. |
Gee, I thought it was passing, too! Well, it fails locally now, but it didn't just a bit ago, so something must have happened in the last few commits? |
I think the bot that rebased against upstream/master changed sth... |
well, now I tested with both gcc and clang and it worked fine. I'm pretty sure it was something about the AdminBot-tskit that effed it up. |
wtf is going on here... is it possible to rerun all the tests on this same commit? |
10b9eff
to
b030523
Compare
when I rebase bad stuff happens... but the rebase worked fine and there were no conflicts. so not sure what is going on. |
what sort of bad stuff? |
Ah-ha: the problem is because of actual behavior that changed upstream. This fixes it:
|
Looks like you should do the rebase yourself and check everything still works. |
5c12b7d
to
985644e
Compare
something about rebasing against upstream/master is royally fucked up. now I broke the tests for |
unless someone has a better idea, tomorrow morning I'll reset against upstream/master and manually add all my changes. |
You just changed a |
If we have two tree sequences which share part of their past histories, we might want to graft them together (also see #381). One use case could be that one population split into two and you might want to simulate their branches independently as a way of parallelising the simulations.
Here, @petrelharp and I implemented a method to
graft
two tables, a base table and its sister (details of dev process can be found in this repo). The main rationale is:node_map
;node_map
don't match. For that reason, we also implemented a method toadd_time
to the node times and migration times of a Table Collection;node_map
are added to new -- these are the target nodes;We implemented some tests, which rely on simplification. The idea is that if you simulate a history of one population splitting into two, simplify independently on the nodes of population 0 and 1, and graft them together you should get equivalent TreeSequences.
The implementation is not finalised, this is a slow -- but likely correct -- solution to the problem.
We would like any input on implementation/interface, but most importantly there are two points of improvement:
simplify
(see here)