Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish/remove simulate_from #572

Closed
jeromekelleher opened this issue Aug 15, 2018 · 10 comments
Closed

Finish/remove simulate_from #572

jeromekelleher opened this issue Aug 15, 2018 · 10 comments
Milestone

Comments

@jeromekelleher
Copy link
Member

#541 added a method to complete a simulation from an existing tree sequence that has partially coalesced. However, this method is a poor approximation of the standard coalescent, as it discards all information about how different root nodes are grouped together into individuals.

We either need to (a) figure out a better way to do it, or (b) remove this method.

cc @petrelharp, @bhaller.

@jeromekelleher jeromekelleher added this to the Release 0.6.1 milestone Aug 15, 2018
@jeromekelleher
Copy link
Member Author

Coming back to the example in #541, this is what we should have:

        [0][(1084-10000) 0] 
        [0][(2033-10000) 2] 
        [0][(0-6840) 12] [(6840-10000) 1] 
        [0][(6840-10000) 11] 
        [0][(0-3066) 13] [(3066-10000) 7] 
        [0][(5804-10000) 4] 
        [0][(0-5804) 4] 
        [0][(0-10000) 3] 
        [0][(0-2033) 2] 
        [0][(0-1084) 0] [(3066-10000) 5] 

and this is what we get

        [0][(6840-10000) 1] 
        [0][(6840-10000) 11] 
        [0][(3066-10000) 5] 
        [0][(3066-10000) 7] 
        [0][(0-3066) 13] 
        [0][(0-10000) 4] 
        [0][(0-10000) 3] 
        [0][(0-10000) 2] 
        [0][(0-6840) 12] 
        [0][(0-10000) 0] 

The only way I can see around this is to use the tables to record the state of these extant ancestors. For example, we might explicitly store the segment chain for each one as individual metadata, probably via a list of (double, double, int32) tuples. I'm sure this would work well for backwards-in-time simulations, and would be a lot simpler in the end as we wouldn't need to piece together the root segments for trees.

However, it's not clear to me whether the same approach could be applied to forward simulations. Do we have this information for forward-time simulations? If not, then it may be best to just forget about this idea, as we already have a good way to composing different backward-time simulation models within msprime.

@petrelharp
Copy link
Contributor

Ah, I see! Yes, that is annoying. Let's see: I think that this would not be a problem if any nodes that contain not-totally-coalesced segments at the longest-ago-time are recorded in the tables, right? In forwards simulations, we can guarantee this just by marking the intial generation as samples (remembering them); then their identiy remains recorded through rounds of simplify, and we'll always know when more than one nonoverlapping segment is on the same chromosome at the time we start recapitation from.

Is this right? If the tables say there are two nonoverlapping segments inheriting from the same node at the time we start recapitating from, would these two segments end up in the same individual?

For reverse-time simulations, we would also have this information if we recorded in the tables nodes and edges corresponding to any not-yet-coalesced segments at the time we stop simulating.

Am I making sense?

@petrelharp
Copy link
Contributor

I've come across another side effect of the same issue: recapitating a tree sequence with a lot of roots can take a LOT of memory.

I'm encountering it in trees produced by SLiM; trying to replicate it with just msprime produces a different, but I think related, issue:

import msprime
ts = msprime.simulate(int(1e6), recombination_rate=100.0, __tmp_max_time=1)
ts.num_trees
# 4898
sum([t.num_roots for t in ts.trees()])
# 21182
ots = msprime.simulate(from_ts=ts, start_time=1, Ne=100)
# _msprime.InputError: The simulation model supplied resulted in a parent node having a time value <= to its child. This can occur either as a result of multiple bottlenecks happening at the same time or because of numerical imprecision with very small population sizes.

The balooning memory happens in set-up, specifically in this loop. One issue here looks to be that the loop produces a new segment for each root of each tree, when many of these could be merged into a single segment.

@petrelharp
Copy link
Contributor

Hm, I realized my suggestion earlier was foolish:

we can guarantee this just by marking the intial generation as samples (remembering them);

Yes, this would work, but it defeats the purpose of recapitation (we might as well start with a tree sequence). We could keep the required information around by (a) marking the initial generation as samples, but then (b) just before recapitation, removing them as samples and simplifying. Otherwise, we'll need simplify to record nodes and edges for not-totally-coalesced roots to retain this information.

@jeromekelleher
Copy link
Member Author

Thanks @petrelharp, very helpful. The memory issue can surely be solved --- I was hoping I'd get away with this laziness, but I know how to tackle it.

I'll have a think about the deeper problem and try playing with some smaller forward sims to see what I can come up with.

@petrelharp
Copy link
Contributor

petrelharp commented Aug 21, 2018

Note: we should make sure to document that simulate from remaps samples, just as simplify does. (presumably because it calls simplify, right?)

@jeromekelleher
Copy link
Member Author

Note: we should make sure to document that simulate from remaps samples, just as simplify does. (presumably because it calls simplify, right?)

I'm having second thoughts about this. Calling simplify is pretty ugly, as it does remap all the nodes and makes it quite tricky to compare the before-and-after tables. Maybe it's less ugly to just leave the unary edges in there, point out their presence in the documentation and explain that if you want to get rid of them you call simplify(). The only time they'll make a difference is if you are doing statistical comparisons of tree sequences based on the numbers of nodes, edges or trees --- in this case, you'll need to simplify to get the true values.

What do you think @petrelharp?

@petrelharp
Copy link
Contributor

I'm having second thoughts about this. Calling simplify is pretty ugly, as it does remap all the nodes and makes it quite tricky to compare the before-and-after tables. Maybe it's less ugly to just leave the unary edges in there, point out their presence in the documentation and explain that if you want to get rid of them you call simplify(). The only time they'll make a difference is if you are doing statistical comparisons of tree sequences based on the numbers of nodes, edges or trees --- in this case, you'll need to simplify to get the true values.

I was thinking the same thing.

@petrelharp
Copy link
Contributor

The only time they'll make a difference is if you are doing statistical comparisons of tree sequences based on the numbers of nodes, edges or trees

Or if you're going along getting the "tree heights" by getting the root times.

@jeromekelleher
Copy link
Member Author

Closed in #581.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants