Finish/remove simulate_from #572

jeromekelleher · 2018-08-15T10:03:07Z

#541 added a method to complete a simulation from an existing tree sequence that has partially coalesced. However, this method is a poor approximation of the standard coalescent, as it discards all information about how different root nodes are grouped together into individuals.

We either need to (a) figure out a better way to do it, or (b) remove this method.

cc @petrelharp, @bhaller.

jeromekelleher · 2018-08-15T10:28:52Z

Coming back to the example in #541, this is what we should have:

        [0][(1084-10000) 0] 
        [0][(2033-10000) 2] 
        [0][(0-6840) 12] [(6840-10000) 1] 
        [0][(6840-10000) 11] 
        [0][(0-3066) 13] [(3066-10000) 7] 
        [0][(5804-10000) 4] 
        [0][(0-5804) 4] 
        [0][(0-10000) 3] 
        [0][(0-2033) 2] 
        [0][(0-1084) 0] [(3066-10000) 5]

and this is what we get

        [0][(6840-10000) 1] 
        [0][(6840-10000) 11] 
        [0][(3066-10000) 5] 
        [0][(3066-10000) 7] 
        [0][(0-3066) 13] 
        [0][(0-10000) 4] 
        [0][(0-10000) 3] 
        [0][(0-10000) 2] 
        [0][(0-6840) 12] 
        [0][(0-10000) 0]

The only way I can see around this is to use the tables to record the state of these extant ancestors. For example, we might explicitly store the segment chain for each one as individual metadata, probably via a list of (double, double, int32) tuples. I'm sure this would work well for backwards-in-time simulations, and would be a lot simpler in the end as we wouldn't need to piece together the root segments for trees.

However, it's not clear to me whether the same approach could be applied to forward simulations. Do we have this information for forward-time simulations? If not, then it may be best to just forget about this idea, as we already have a good way to composing different backward-time simulation models within msprime.

petrelharp · 2018-08-15T18:48:56Z

Ah, I see! Yes, that is annoying. Let's see: I think that this would not be a problem if any nodes that contain not-totally-coalesced segments at the longest-ago-time are recorded in the tables, right? In forwards simulations, we can guarantee this just by marking the intial generation as samples (remembering them); then their identiy remains recorded through rounds of simplify, and we'll always know when more than one nonoverlapping segment is on the same chromosome at the time we start recapitation from.

Is this right? If the tables say there are two nonoverlapping segments inheriting from the same node at the time we start recapitating from, would these two segments end up in the same individual?

For reverse-time simulations, we would also have this information if we recorded in the tables nodes and edges corresponding to any not-yet-coalesced segments at the time we stop simulating.

Am I making sense?

petrelharp · 2018-08-16T05:26:11Z

I've come across another side effect of the same issue: recapitating a tree sequence with a lot of roots can take a LOT of memory.

I'm encountering it in trees produced by SLiM; trying to replicate it with just msprime produces a different, but I think related, issue:

import msprime
ts = msprime.simulate(int(1e6), recombination_rate=100.0, __tmp_max_time=1)
ts.num_trees
# 4898
sum([t.num_roots for t in ts.trees()])
# 21182
ots = msprime.simulate(from_ts=ts, start_time=1, Ne=100)
# _msprime.InputError: The simulation model supplied resulted in a parent node having a time value <= to its child. This can occur either as a result of multiple bottlenecks happening at the same time or because of numerical imprecision with very small population sizes.

The balooning memory happens in set-up, specifically in this loop. One issue here looks to be that the loop produces a new segment for each root of each tree, when many of these could be merged into a single segment.

petrelharp · 2018-08-16T05:29:21Z

Hm, I realized my suggestion earlier was foolish:

we can guarantee this just by marking the intial generation as samples (remembering them);

Yes, this would work, but it defeats the purpose of recapitation (we might as well start with a tree sequence). We could keep the required information around by (a) marking the initial generation as samples, but then (b) just before recapitation, removing them as samples and simplifying. Otherwise, we'll need simplify to record nodes and edges for not-totally-coalesced roots to retain this information.

jeromekelleher · 2018-08-16T09:03:35Z

Thanks @petrelharp, very helpful. The memory issue can surely be solved --- I was hoping I'd get away with this laziness, but I know how to tackle it.

I'll have a think about the deeper problem and try playing with some smaller forward sims to see what I can come up with.

petrelharp · 2018-08-21T21:46:05Z

Note: we should make sure to document that simulate from remaps samples, just as simplify does. (presumably because it calls simplify, right?)

jeromekelleher · 2018-08-22T07:50:43Z

Note: we should make sure to document that simulate from remaps samples, just as simplify does. (presumably because it calls simplify, right?)

I'm having second thoughts about this. Calling simplify is pretty ugly, as it does remap all the nodes and makes it quite tricky to compare the before-and-after tables. Maybe it's less ugly to just leave the unary edges in there, point out their presence in the documentation and explain that if you want to get rid of them you call simplify(). The only time they'll make a difference is if you are doing statistical comparisons of tree sequences based on the numbers of nodes, edges or trees --- in this case, you'll need to simplify to get the true values.

What do you think @petrelharp?

petrelharp · 2018-08-22T16:10:49Z

I'm having second thoughts about this. Calling simplify is pretty ugly, as it does remap all the nodes and makes it quite tricky to compare the before-and-after tables. Maybe it's less ugly to just leave the unary edges in there, point out their presence in the documentation and explain that if you want to get rid of them you call simplify(). The only time they'll make a difference is if you are doing statistical comparisons of tree sequences based on the numbers of nodes, edges or trees --- in this case, you'll need to simplify to get the true values.

I was thinking the same thing.

petrelharp · 2018-08-22T16:22:51Z

The only time they'll make a difference is if you are doing statistical comparisons of tree sequences based on the numbers of nodes, edges or trees

Or if you're going along getting the "tree heights" by getting the root times.

jeromekelleher · 2018-08-24T13:21:02Z

Closed in #581.

jeromekelleher added this to the Release 0.6.1 milestone Aug 15, 2018

jeromekelleher mentioned this issue Aug 15, 2018

Simulate from #541

Merged

jeromekelleher mentioned this issue Aug 15, 2018

Add from argument to simulate #503

Closed

jeromekelleher closed this as completed Aug 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish/remove simulate_from #572

Finish/remove simulate_from #572

jeromekelleher commented Aug 15, 2018

jeromekelleher commented Aug 15, 2018

petrelharp commented Aug 15, 2018

petrelharp commented Aug 16, 2018

petrelharp commented Aug 16, 2018

jeromekelleher commented Aug 16, 2018

petrelharp commented Aug 21, 2018 •

edited

jeromekelleher commented Aug 22, 2018

petrelharp commented Aug 22, 2018

petrelharp commented Aug 22, 2018

jeromekelleher commented Aug 24, 2018

Finish/remove simulate_from #572

Finish/remove simulate_from #572

Comments

jeromekelleher commented Aug 15, 2018

jeromekelleher commented Aug 15, 2018

petrelharp commented Aug 15, 2018

petrelharp commented Aug 16, 2018

petrelharp commented Aug 16, 2018

jeromekelleher commented Aug 16, 2018

petrelharp commented Aug 21, 2018 • edited

jeromekelleher commented Aug 22, 2018

petrelharp commented Aug 22, 2018

petrelharp commented Aug 22, 2018

jeromekelleher commented Aug 24, 2018

petrelharp commented Aug 21, 2018 •

edited