Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finite sites branchwise #946

Merged
merged 1 commit into from Apr 23, 2020

Conversation

petrelharp
Copy link
Contributor

I'm having a go at #553. The two methods sketched out there for doing this are roughly:

  • build the trees and step along site-by-site
  • throw down mutations and go back and deal with state later (including rejection sampling)

This PR is the latter. It doesn't work yet. Without fully explaining (yet), the idea is that:

  • the previous mutgen code can almost lay down multiple mutations at the same site...
  • however, it had no facilities for actually keeping track of those mutation; so I had to add that.
  • edges come in order by time-ago
  • compute_mutation_parents wants mutations to come in order by time (ie, reverse of the order that they are produced by mutgen)

So, once I've done this right, I should be able to

  • generate mutations
  • compute mutation parents
  • figure out state (and, there's python code in this PR to do this bit)

This is not yet right, and I've probably done something terribly wrong C-wise, but maybe you'll get the idea. Feel free to ignore for the moment (but suggestions appreciated).

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woohoo, this is great, thanks @petrelharp! I'm not fully following how you're doing this. One thing that might be useful is to update the mutate_map function in test_mutations.py to give us some confidence in the C implementation and hopefully also clarify what it's doing as a reference implementation.

The Python level post-processing is less clear to me, but I'll try and figure it out. It's not obvious to me how we can compute the states in general without building the trees.

lib/mutgen.c Outdated Show resolved Hide resolved
lib/mutgen.c Outdated Show resolved Hide resolved
lib/mutgen.c Outdated Show resolved Hide resolved
lib/mutgen.c Outdated Show resolved Hide resolved
lib/mutgen.c Outdated Show resolved Hide resolved
lib/mutgen.c Outdated Show resolved Hide resolved
lib/mutgen.c Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor Author

It's not obvious to me how we can compute the states in general without building the trees.

We do build the trees - this happens using compute_mutation_parents - but given the parent attribute, everything is O(1).

But - I'll write out what's going on.

@petrelharp
Copy link
Contributor Author

There's a complication: suppose that an edge goes from time=0 to time=100, and it has a mutation on it, and we want to add mutations with finite sites with start_time=20 and end_time=40. We have no way of knowing where the existing mutation was on that branch. And yet, we definately want to be able to do this, because it totally makes sense to add mutations to the tree in subsequent time slices with finite sites, as long as those slices are moving forwards in time (e.g., if in the example above the existing mutation was added with start_time=40 and end_time=100).

This would be fixed if the mutation table had a time column. But, here's another solution.

This all is OK as long as mutations are added in order by time (ie longest-ago mutations first). We can specify in the documentation that if you're adding mutations to separate time chunks with finite sites then you have to do it that way. And in the code for mutate( ), we can (a) assume that existing mutations happen at the top of whatever branch they are on (ie. at the longest-ago possible time), and (b) throw an error if finite=True and there are any mutations whose thus-assigned time would be more recent than end_time.

It'll be a bit annoying to actually do that check, since finding the parent node for the edge that each mutation is on is not an operation we already have. But we can do it on a single pass through the tables, without building the trees.

How's that sound?

@jeromekelleher
Copy link
Member

jeromekelleher commented Apr 6, 2020

Sounds like a pragmatic solution to me @petrelharp, I like it. But, should we add a time column to mutations now(ish)? We're going through a format update in tskit at the moment anyway and this is a persistent issue. It should be possible to do in a backwards compatible way (if the column isn't specified, it defaults to the time of the mutation's node. Multiple mutations on a branch have the same time, but oh well)?

@petrelharp
Copy link
Contributor Author

should we add a time column to mutations now(ish)

My sense is "probably yes"? It will definately make our lives easier in the long run. It's a pretty big change, though. I'm also happy with this alternative proposal.

A question if we do add it is whether to keep the parent column, since IIRC we added parent because we didn't want to add time. The parent column is really useful to have around (it lets things be O(1)), so I kinda want to keep it, but it is overhead.

@jeromekelleher
Copy link
Member

A question if we do add it is whether to keep the parent column, since IIRC we added parent because we didn't want to add time. The parent column is really useful to have around (it lets things be O(1)), so I kinda want to keep it, but it is overhead.

We should totally keep the parent column. The overhead is minimal RAM wise, and we can make it optional for the file format (e.g., it's redundant if there's < 2 mutations at any site). I'll move this over to tskit and expore there. Let's assume we will have the time column for the purpose of this PR - I guess we just error out for the moment if we hit the situation you're talking about?

@petrelharp
Copy link
Contributor Author

petrelharp commented Apr 6, 2020

Ok, here's the description in words of how this would work:

What this relies on is the following method for simulating from a continuous-time Markov chain model of mutation: suppose we have a rate matrix, G, so that when you are in state i you mutate to state j at rate G[i,j], and let's say G[i,i] = 0. The total mutation rate for state i is sum(G[i,:]); let lambda be the maximum total mutation rate: mu = max([sum(G[i,:]) for i in range(shape(G)[0])]). Let P[i,j] = G[i,j] / mu and P[i,i] = 1 - sum(G[i,:]) / mu; note that rows of P sum to 1. Now, to simulate from this mutation process, we start at some state; and at the times of a Poisson(mu) process look at the current state and pick a new state with probabilties given in the corresponding row of P.

I'm proposing parameterizing this at the user interface with mu (i.e., "mutation rate"), P (which I'm calling the "mutation matrix"), and the distribution of states at the root (which I'm calling the "alphabet distribution").

To implement this, we'll:

  1. Throw down possible mutation locations on the edges, using mostly the existing algorithm, at rate mu.
  2. Build and traverse the trees to find out the dependencies for each possible mutation, and pick ancestral states (using the alphabet distribution) and derived states (with the mutation matrix, P). Some mutations and even some sites will be discarded because they are no-ops.
  3. Populate the tables with what's left.

It might be cleaner to throw down mutations while stepping through the trees, as described here? But, I started doing it this way because it seemed easier - closer to the existing code. And, this way is more modular?

To do (1), and to keep track of which sites/mutations are new and which are "kept", we keep a list of sites, and each site has a list of mutations, all kept in the same order they'll be in when we write them out to the tables.

@petrelharp
Copy link
Contributor Author

I've updated the code so that maybe it does finite sites actually at integer sites (untested!), and got it to pass all tests except some provenance tests that I totally don't understand. So, I think that my changes have not broken previously working things (but could you give me some pointers on the provenance, @benjeffery?), but doesn't yet actually do the new things.

A separate thing: I decided that start_time and end_time should really be arguments to _msprime.MutationGenerator.generate, not _msprime.MutationGenerator( ). Happy to change this back, but this made more sense to me.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the explanation @petrelharp. I think it would be better to nail down the model in Python first before doing any more C coding. There's a lot of boiler plate to peer through otherwise.

lib/mutgen.c Outdated Show resolved Hide resolved
lib/mutgen.c Outdated Show resolved Hide resolved
lib/mutgen.c Show resolved Hide resolved
lib/mutgen.c Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
@benjeffery
Copy link
Member

I've made a commit over at benjeffery@dad4d07
that fixes up the provenance by refactoring a bit.

It seems a shame there is no longer a way to get uniform rate with a different alphabet without manually specifying the rate - for example https://github.com/tskit-dev/tskit/blob/master/python/tests/test_highlevel.py#L215 but it's no big deal.

@petrelharp
Copy link
Contributor Author

Thanks for the comments and PRs - I'll get to them!

But in the meantime, a design choice. How's this sound: suppose we add mutations with keep=True, discrete_sites=True. Then we want to be able to add new mutations coming after the previous ones; and it is an error if we end up trying to add a new mutation before an existing one (since you need to know the previous state to properly pick a new state). This also means that the existing ancestral states and derived states have to match the alphabet that's being used (e.g., BINARY or NUCLEOTIDES). I suppose we could disable this check for discrete_sites = False, but I'm inclined not to.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python implementation is really helpful, thanks @petrelharp! I've left a bunch of comments. Basically I think we need to split things up a bit more to simplify the code a bit.

The more I think about the rates the less convinced I am that we're doing it properly here too. I would have thought that the right thing to do would be to generate a poisson number of mutations proportional to the area of each edge (without discretising), and then throwing that number of "darts" uniformly along the extend of the edge. We then floor the position they fall at to decide which integer bin they fall in to. This works fine for very short edges with span << 1, whereas the current approach doesn't I think.

tests/test_mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor Author

Basically I think we need to split things up a bit more to simplify the code a bit.

Yes, sorry - I'm totally not done (should have said) - but thanks for the comments!

I would have thought that the right thing to do would be to generate a poisson number of mutations proportional to the area of each edge (without discretising), and then throwing that number of "darts" uniformly along the extend of the edge. We then floor the position they fall at to decide which integer bin they fall in to. This works fine for very short edges with span << 1, whereas the current approach doesn't I think.

So, the model I'm using here with the floor/ceiling is that mutations only occur at integer locations, and so the mutational "mass" of a span is equal to the number of integers falling in it. Since our intervals are left-closed, right-open, the integers in [left, right) are ceil(left), ceil(left) + 1, ..., ceil(right) - 1. This explains the ceil( )s.

And, I must be confused about your model, but I don't see how we can throw down positions as floats and then round them - they might end up outside of the current span (even outside of the current mutation map rate interval!). It's true that the current approach won't put any mutations in edges that don't cover an integer, but I think that's OK - really, we want this mutation model to be used in conjunction with discrete recombination as well, in wihch case all edges will contain at least one integer. Or, am I missing something?

@petrelharp
Copy link
Contributor Author

OK, I think the python version is in reasonably good shape. I've edited my description above to reflect the current state of things (traversing the tree; no longer using compute_mutation_parents). I will work on writing more tests and getting this into C next. The design decisions above - like exactly how discrete_sites works - can be easily changed.

But: something I'm unclear on is the right way to do the allele <-> index conversion in C; e.g., here. What's the right way to this?

@jeromekelleher
Copy link
Member

So, the model I'm using here with the floor/ceiling is that mutations only occur at integer locations, and so the mutational "mass" of a span is equal to the number of integers falling in it. Since our intervals are left-closed, right-open, the integers in [left, right) are ceil(left), ceil(left) + 1, ..., ceil(right) - 1. This explains the ceil( )s.

I see, that's a lot clearer, thanks. A quick comment would help here, as it's reasonably subtle. I still wonder if we should just error-out if a user tries to mix integer and floating point topologies and site positions, but maybe that's too stringent.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @petrelharp. Sorry for all the nitpicking here, but I really want to understand this and to make sure it's as simple as it possibly can be!

tests/test_mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor Author

I still wonder if we should just error-out if a user tries to mix integer and floating point topologies and site positions, but maybe that's too stringent.

I wondered this too, but I think the model is conceptually clear - the mutation mapp has unit mass at each integer site instead of spread uniformly - so I think it's OK to just strongly recommend in the docs that people use discrete recomb along with discrete mutation. This will be helped along if we just have a single "discrete sites" parameter to simulate() as we discussed elsewhere.

@petrelharp
Copy link
Contributor Author

petrelharp commented Apr 9, 2020

Ok - I think this refactor is close to as simple as it's going to be. Separating out building the mutation tree did not result in simpler code, as remarked above. Could you have a look, @jeromekelleher?

Besides "how's this look", a question I have about moving on to C is:

  • What is the right way to find the index of an allele in self.alleles in C, as I'm doing e.g. here?

@petrelharp
Copy link
Contributor Author

Hi, @jeromekelleher: the current status here is that:

  • I could be happy with the algorithm and how it's factored out as implemented in python in test_mutations.py; it is passing a bunch of tests I have put in there (although some statistical verification will be necessary at some point). Input (and even refactoring suggestions) appreciated (but no hurry).

  • I've translated the algorithm over into mutgen.c - no doubt with remaining errors. The major missing piece I'm aware of is getting the mutation model into mutgen, which I'm working on now.

I would like input on this: it seems like the right thing to do is to define a mutation model at the C level, and a python interface and everything. I think this is what was intended when model= was added to mutate( ). Does this sound right?

@jeromekelleher
Copy link
Member

Thanks @petrelharp - I'll have a look tomorrow morning. (Holiday here today)

@petrelharp
Copy link
Contributor Author

Holiday here today

No hurry! Happy spring!! 🥚 🌷 🌳

@petrelharp
Copy link
Contributor Author

Update: the C code is now passing the C tests!

@petrelharp
Copy link
Contributor Author

So - spelling things out a bit more: it's seeming to me that we should define in C a mutation_model_t (and have done so), and make this visible at the python level, and have mutate( ) take a model= argument, etcetera. This doesn't seem too hard but does involve a change to the API, which is currently passing in the alphabet as just a flag, and is calling a "mutation map" what I think was supposed to be the same thing as the "mutation model".

@petrelharp
Copy link
Contributor Author

petrelharp commented Apr 14, 2020

It now builds and passes many of the tests. However, I've made some memory management error somewhere: I keep getting strange errors that move when I insert printf statements.

In getting it to pass the tests, I've:

  • made it so existing mutations retain the ordering they came in with
  • throw an error when an ancestral or derived state is not in the alphabet only if we're actually trying to mutate it

I've not gotten to implementing the provenance fix that @benjeffery kindly provided above.

@petrelharp
Copy link
Contributor Author

Did a little squash'n'rebase here to get the coverage diff simpler.

_msprimemodule.c Outdated Show resolved Hide resolved
@jeromekelleher
Copy link
Member

I've gone through this again @petrelharp, and I think we should merge ASAP and follow up with some issues to track what needs to be done afterwards. I've run some of the statistical checks, and it seems nothing has been broken in existing code by what we're merging.

There's only one thing I don't like and don't want to get forgotten about: how does the second MSP_ERR_MUTATION_GENERATION_OUT_OF_ORDER error happen? I think we need to either craft a C test case to provoke it or prove to ourselves it can't happen and stick in an assert.

@petrelharp
Copy link
Contributor Author

I think we should merge ASAP

Sounds good. I've changed the second out of order error, which should have been an assert. I want to do a quick statistical test of the number of mutations; then I'll squash & rebase.

Did you notice the statistical tests? We didn't have any such tests in before, i don't think, so at first I was going to put them in verification.py, but I think this is the right place for them. We can adjust the p-value threshold to make false positives less likely.

@jeromekelleher
Copy link
Member

Did you notice the statistical tests? We didn't have any such tests in before, i don't think, so at first I was going to put them in verification.py, but I think this is the right place for them. We can adjust the p-value threshold to make false positives less likely.

Spotted them, but didn't think too hard about what they are doing. Once they're quick to run and don't result in random CI failures I'm all for it.

@jeromekelleher
Copy link
Member

Sounds good. I've changed the second out of order error, which should have been an assert. I want to do a quick statistical test of the number of mutations; then I'll squash & rebase.

Great! Please merge away whenever you're happy.

@petrelharp
Copy link
Contributor Author

Ok - I wrote those mutation rate tests, and am pretty happy with them. We're just testing if each tree looks like it has the right Poisson number of mutations on it - we could look by branch - but this seems fine. We're not doing a statistical test with a non-uniform mutation map, but again, that seems OK. Perhaps this should be merged? (oh gee, once I fix these errors??)

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment about the stats tests, since you're fixing problems anyway

tests/test_mutations.py Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor Author

There, how's that look?

@petrelharp
Copy link
Contributor Author

petrelharp commented Apr 23, 2020

I don't know why codecov is unhappy - seems like we have improved coverage? But maybe I don't know how to parse this right. (oh, but: we are maybe not testing the MutationModel.asdict method, but it seemed like we should test that elsewhere?)

@petrelharp
Copy link
Contributor Author

Added a test for asdict..

@petrelharp
Copy link
Contributor Author

I think we can merge, but am bothered by codecov. Could you take a look and hit merge if it's good, @jeromekelleher?

@benjeffery
Copy link
Member

Codecov looks good to me - only OOMs and hard to do errors that aren't covered.

@petrelharp
Copy link
Contributor Author

thanks, @benjeffery! merging 🎉!!!

@petrelharp petrelharp merged commit 8fcdee0 into tskit-dev:master Apr 23, 2020
@jeromekelleher
Copy link
Member

Great, thanks @petrelharp! Hooray!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants