-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More general mutation models #553
Comments
Further notes:
|
This makes senses for finite-site schemes, but one has to keep in mind that not all int/double conversions are simple when the numbers get really big. For example: #include <iostream>
#include <cstdint>
using namespace std;
int main(int argc, char ** argv)
{
double x = 9007199254000000; //A bit less than 2^53.
while (x+1 != x){++x;}
std::cout << static_cast<uint64_t>(x-1) << ' ' << static_cast<uint64_t>(x) << ' ' << static_cast<uint64_t>(x+1) << '\n';
} In practice, you'll probably never simulate a genome that large, but the fact that most simulations tend to handle positions as doubles means that the int/double casting becomes a nagging issue that may bite you eventually.
This may be way overkill? I assume you mean double and not the actual C float? In either case, the number of possible values in [0,1) is quite large. It seems simpler to store derived state as something like int32_t and, for each replicate, store a map of [alphabet entry, int32_t] for later decoding. That's the same size as a float, smaller than a double, and possibly more portable.
|
How do you propose doing this - taking
True, and we don't actually avoid the nonuniqueness problem anyhow. I was just trying to avoid keeping track of what alleles had been used at that site already, which we have to do because we've said that the derived state had to be different than the mutation.parent's derived state. So, I guess we'd have to have a way of looking up the alleles that have been assigned at each site already. |
Imagine the scenario where you evolve for some amount of time, and then do some analysis, and then evolve some more. There's a chance that the state will change at already-mutated sites, and it seems that this case should be supported. |
I'm not sure what |
Oh, right, duh. Well, that case would be fine, because the proposed workflow only gets messed up if there are out-of-order mutations coming in. So, I propose two flags:
|
Sounds good. My other suggestion wouldn't have helped anything. |
I think this works. For the case of selected mutations and someone is doing evolve-stop-evolve again, it is up to the forward simulation engine to properly update mutation/site tables w.r.to derived state at the relevant nodes. |
General picture: we cannot generate mutations twice on the same chunk of the trees. Our two use cases therefore either (a) generate mutations on a different time slice, or (b) generate mutations on a different sequence slice. So, it should be an error to: say |
Sounds good in general to me @petrelharp, but I'm a bit confused about some of the details. I'll need to think about it a bit more. One high-level point though: Why are we trying so hard to avoid using the trees? Having a valid and indexed tree sequence is a pre-requisite for |
I'm just trying to factor things out into simple steps that are re-usable between simulation methods. Maybe we should think out the alternative method. |
I think the scheme laid out by @petrelharp is the most straightforward for 'standard' cases. (I also think this is how seq-gen works, as it handles primarily standard models of molecular evolution, IIRC.) The much more difficult case involves mutation models where the transition matrix for a site depends upon the k-mer in which the site is centered. For these models, I think you have to generate the ancestral sequence for each node for each marginal tree and account for the complication that k-mers cross marginal tree (stop, start) boundaries. Further, these models appear to present a complication for the idea of not simulating neutral sites 'as you go', as the mutation matrix for the selected sites will depend on the (unsimulated) neutral sites. This case is potentially pathological for how we've all been thinking of using tree sequences. For @jeromekelleher's idea of using trees, under models that the matrix method can handle, the scheme would be to go site by site and assign a derived state to each node, taking care to avoid already-mutated (presumably selected?) sites along the way. This gets repeated from root to tip for each marginal. For models of intermediate complexity (Markov models of dinucleotides), I assume that there's something out there in the phylogenetic literature. I'd wager that these are handleable via the @petrelharp scheme, and that the matrices are just larger and there are probably some simplifying symmetry assumptions. |
OK, thanks for the clarification @petrelharp and @molpopgen, this confirms I'm just being thick. Peter's scheme is AOK with me. |
Yep. However, note that we could do these as above, only looking up haplotypes in step (3).
Ack, you are right. Well, but then the neighboring neutral sites are sorta not neutral. I could imagine sorting this out. But agreed: there are many complications we should ignore atm. |
Note: we should include #508 also when implementing this. |
Thinking about this again (while dealing with #711), I came up with the following scheme which I think must be pretty efficient (although perhaps not as general as @petrelharp's scheme?):
There may be probabilistic subtleties here which I'm missing, so I'm not certain the approach is correct, but I am sure it can be done efficiently.
So, the cost is something like O(log n) per mutation O(log n) per tree transition which seems about as good as you're going to do. This does treat sites as isolated (so not worrying about surrounding sites), but I guess you could add this complication later. |
A scheme like this is totally general, with the caveat that since mutation rate may differ by identity of the site, The advantage to my proposal is it let us put down mutations in just the same way we do currently - independently by edge - so we only need to traverse trees at those sites where we have more than one mutation. It would therefore require writing less new code. |
And your scheme is also more efficient, as we just have to look at it edge-by-edge. My only issue with your method is that I don't understand it, which makes me reluctant to start implementing it! I'm deeply suspicious of random programs, and I assume anything that's not "obviously right" will have subtle problems... But I think I'm slowly seeing how it works. We're basically generating a bunch of 'stateless' mutations at finite sites along the edges, and then in a second pass choosing what the states actually are, right? So, for every edge/slice of the mutation map we'd generate k mutations from poisson(rate), and then uniformly assign these into the site 'bins' along this continuous interval. In the second pass, we'd go through the trees and for every site (a) create an ancestral state; and (b) going down the tree, for each mutation generate a state according to the transition matrix. (We don't have to do a full traversal to implement b, we can sort the sites by time and traverse upwards to find previously assigned state.) Is this right? What are the uniform doubles assigned to the derived state in your summary above for then? |
Another question: what happens when we have multiple mutations on a branch for a give site? My take is that we should only have mutation (i.e., remove all the silent changes along the branch), although it would be best to do this afterwards in a postprocessing step, since the state changes matter. It'll make our lives easier in tskit algorithms if we can assume there's one mutation per branch, right? |
The methods are quite similar. Let me slightly rewrite yours to make this more clear:
So: suppose there's two states, To do the "tree" method (yours), you lay down To do the "edge" method (mine), you lay down abstract mutations on the whole tree sequence (edgewise) with rate So, it's the same method: both are rejection sampling; just we're choosing the proposals differently. |
Life would be easier if we could assume only one mutation per branch, but I don't think we can assume that. Well, we could, but then we'd need to deal with the cases when that's not true on input (e.g., from SLiM). That was the point of the whole mutation.parent thing. I vote to continue to allow multiple mutations per branch. Whether to leave multiple mutations in: probably we don't usually need them, but it would make it easier to compare to theory if we left them in? |
Regarding number of mutations per site/branch: the theory will make a statement about the state of the descendant node given the state of the ancestral node + branch length + transition matrix. The detailed mutation history wouldn't be helpful? |
That and being able to determine your state without having to do a full upwards traversal.
Hmmm, true. How about we provide a method |
That was helpful, thanks. OK, I think I get it now. I'll have a go at implementing soon --- I'm working |
I'm not sure if this is useful , or where it fits in, but @marianne-aspbury and I are using something like the following code to generate finite sites mutations. It's basically the original suggestion by @petrelharp but (since it's a simple model) we don't need to throw out mutations (we just pick one from the remaining "atgc" letters, after removing the previous derived state). The only difficult bit is assigning the derived states to mutations so that they aren't the same as the previous derived state on a lineage. There a bit of faffing with import msprime
import tskit
import itertools
import random
import math
import numpy as np
# Construct a tree sequence with integerized breakpoints
length = 100
recomb_map = msprime.RecombinationMap.uniform_map(length, rate=1, num_loci=length)
ts = msprime.simulate(30, mutation_rate=1, recombination_map=recomb_map)
# sentinel_allele = b"\0" # changed following comments below
null_allele = b"\0"
states = np.array(['a','t','g','c'], dtype='|S1')
tables = ts.dump_tables()
tables.sites.clear()
tables.mutations.clear()
for integer_pos, sites_at_pos in itertools.groupby(ts.sites(), lambda x: math.floor(x.position)):
ancestral_state = random.choice(states)
site_id = tables.sites.add_row(integer_pos, ancestral_state)
# Order mutations by time
mutations = []
for site in sites_at_pos:
mutations.extend(site.mutations)
mutations.sort(key = lambda x: ts.node(x.node).time, reverse=True)
for m in mutations:
# Assign mutations with null parents & derived_states (will be filled in later)
tables.mutations.add_row(site_id, m.node, derived_state=null_allele, parent=tskit.NULL)
# Assign parents
tables.compute_mutation_parents()
# Assign derived states (use an array of chars)
ancestral_states = tables.sites.ancestral_state.view(dtype='|S1')
mutation_derived_state = np.full(tables.mutations.num_rows, null_allele, dtype='|S1')
for i, m in enumerate(tables.mutations):
if m.parent == tskit.NULL:
prev_state = ancestral_states[m.site]
else:
prev_state = mutation_derived_state[m.parent]
# Pick a new state that is different from the old one
new_state = random.choice([s for s in states if s != prev_state])
mutation_derived_state[i] = new_state
tables.mutations.derived_state = mutation_derived_state.view(tables.mutations.derived_state.dtype)
finite_sites_ts = tables.tree_sequence()
# Try printing them out
for h in finite_sites_ts.haplotypes():
print(h) |
@jeromekelleher - I suspect that anyone using |
This is neat @hyanwong --- I haven't thought hard about the properties of the mutations it outputs, but it looks plausible to me. What's the issue with compute_mutation_parents --- we're not checking the allele values to see if they make sense, are we? I don't get why a string with an embedded null helps here, isn't it the same if you have an arbitrary one character string here? |
It's only that if for some reason the parents have changed, or we didn't know them in the first place (which is presumably why we are running
Yes, anything could go here. I just stuck a |
We can add in a warning I guess, but the operation is low-level and intended for uses in which the user is in charge of making sure the state is consistent. This is a GIGO situation I think.
It looks like it's doing something special then though, and the name |
Given that this function is indeed documented, I think a brief note in the docs would be helpful (it would have been helpful to us, and saved us 1/2 hour of thinking). I'll bung it in a PR.
Right - my bad naming. I guess I should change this to "null_char" or something. It was just meant to be the character equivalent of |
ping @carjed who just volunteered to do this! <3 <3 |
@carjed - any update on this. It would be great to get something kicked off? |
Closed in #946 |
We'd like to provide support for other models of mutation. Here is a proposal for how to do this, in a two-stage protocol, that separates generating locations of the mutations (which can be done independently on each edge) from allelic states. We'd need to do two things:
Finite sites, parent-independent mutation: First, we need to modify the current algorithm to produce mutations at a finite set of sites.
I'd propose for this to be only at integer locations; we could make the set of possible locations more general, but why bother?
These mutations would be generated independently on each edge, and derived state could be random floats in
[0,1)
to ensure uniqueness (and stored as bytes).We would not need to specify the mutation
parent
property, because it could be done afterwards withcompute_mutation_parent()
if we traversed the edges in order by forwards-time.Transformation to finite-state model: The next step is to convert the mutations produced in the first step -- or the mutations produced by the infinite-sites model -- to a finite-state model, e.g., Jukes-Cantor. To do this:
We'd be provided with an alphabet of length$m$ , a distribution $\pi_i$ for the root state, and a $m \times m$ matrix $Q_{ij}$ that gives the rate at which allele $i$ mutates into allele $j$ , with $Q_{ii} = 0$ . We then define $\mu = \max_i \sum_j Q_{ij}$ and $P_{ij} = Q_{ij} / \mu$ for $i \neq j$ and $P_{ii} = 1 - \sum_{j \neq i} P_{ij}$ .
We generate parent-independent mutations as above using mutation rate$\mu$ , and use
compute_mutation_parents
to find the mutation parents.We step through the sites:
Note that thanks to having mutation parents we don't need to generate the trees to do this. Also note that if the derived states produced by parent-independent mutation are random floats, we don't need to produce any more randomness to do this.
Advantages to this decoupled scheme:
The main advantage to this scheme is it simplifies bookkeeping:
This decoupling is totally general, and should make it easy to implement new models.
Disadvantages:
cc @molpopgen
The text was updated successfully, but these errors were encountered: