Finite sites branchwise #946

petrelharp · 2020-04-06T06:06:55Z

I'm having a go at #553. The two methods sketched out there for doing this are roughly:

build the trees and step along site-by-site
throw down mutations and go back and deal with state later (including rejection sampling)

This PR is the latter. It doesn't work yet. Without fully explaining (yet), the idea is that:

the previous mutgen code can almost lay down multiple mutations at the same site...
however, it had no facilities for actually keeping track of those mutation; so I had to add that.
edges come in order by time-ago
compute_mutation_parents wants mutations to come in order by time (ie, reverse of the order that they are produced by mutgen)

So, once I've done this right, I should be able to

generate mutations
compute mutation parents
figure out state (and, there's python code in this PR to do this bit)

This is not yet right, and I've probably done something terribly wrong C-wise, but maybe you'll get the idea. Feel free to ignore for the moment (but suggestions appreciated).

jeromekelleher

Woohoo, this is great, thanks @petrelharp! I'm not fully following how you're doing this. One thing that might be useful is to update the mutate_map function in test_mutations.py to give us some confidence in the C implementation and hopefully also clarify what it's doing as a reference implementation.

The Python level post-processing is less clear to me, but I'll try and figure it out. It's not obvious to me how we can compute the states in general without building the trees.

lib/mutgen.c

msprime/mutations.py

petrelharp · 2020-04-06T14:12:51Z

It's not obvious to me how we can compute the states in general without building the trees.

We do build the trees - this happens using compute_mutation_parents - but given the parent attribute, everything is O(1).

But - I'll write out what's going on.

petrelharp · 2020-04-06T14:23:17Z

There's a complication: suppose that an edge goes from time=0 to time=100, and it has a mutation on it, and we want to add mutations with finite sites with start_time=20 and end_time=40. We have no way of knowing where the existing mutation was on that branch. And yet, we definately want to be able to do this, because it totally makes sense to add mutations to the tree in subsequent time slices with finite sites, as long as those slices are moving forwards in time (e.g., if in the example above the existing mutation was added with start_time=40 and end_time=100).

This would be fixed if the mutation table had a time column. But, here's another solution.

This all is OK as long as mutations are added in order by time (ie longest-ago mutations first). We can specify in the documentation that if you're adding mutations to separate time chunks with finite sites then you have to do it that way. And in the code for mutate( ), we can (a) assume that existing mutations happen at the top of whatever branch they are on (ie. at the longest-ago possible time), and (b) throw an error if finite=True and there are any mutations whose thus-assigned time would be more recent than end_time.

It'll be a bit annoying to actually do that check, since finding the parent node for the edge that each mutation is on is not an operation we already have. But we can do it on a single pass through the tables, without building the trees.

How's that sound?

jeromekelleher · 2020-04-06T16:23:20Z

Sounds like a pragmatic solution to me @petrelharp, I like it. But, should we add a time column to mutations now(ish)? We're going through a format update in tskit at the moment anyway and this is a persistent issue. It should be possible to do in a backwards compatible way (if the column isn't specified, it defaults to the time of the mutation's node. Multiple mutations on a branch have the same time, but oh well)?

petrelharp · 2020-04-06T17:06:17Z

should we add a time column to mutations now(ish)

My sense is "probably yes"? It will definately make our lives easier in the long run. It's a pretty big change, though. I'm also happy with this alternative proposal.

A question if we do add it is whether to keep the parent column, since IIRC we added parent because we didn't want to add time. The parent column is really useful to have around (it lets things be O(1)), so I kinda want to keep it, but it is overhead.

jeromekelleher · 2020-04-06T17:12:28Z

A question if we do add it is whether to keep the parent column, since IIRC we added parent because we didn't want to add time. The parent column is really useful to have around (it lets things be O(1)), so I kinda want to keep it, but it is overhead.

We should totally keep the parent column. The overhead is minimal RAM wise, and we can make it optional for the file format (e.g., it's redundant if there's < 2 mutations at any site). I'll move this over to tskit and expore there. Let's assume we will have the time column for the purpose of this PR - I guess we just error out for the moment if we hit the situation you're talking about?

petrelharp · 2020-04-06T17:50:47Z

Ok, here's the description in words of how this would work:

What this relies on is the following method for simulating from a continuous-time Markov chain model of mutation: suppose we have a rate matrix, G, so that when you are in state i you mutate to state j at rate G[i,j], and let's say G[i,i] = 0. The total mutation rate for state i is sum(G[i,:]); let lambda be the maximum total mutation rate: mu = max([sum(G[i,:]) for i in range(shape(G)[0])]). Let P[i,j] = G[i,j] / mu and P[i,i] = 1 - sum(G[i,:]) / mu; note that rows of P sum to 1. Now, to simulate from this mutation process, we start at some state; and at the times of a Poisson(mu) process look at the current state and pick a new state with probabilties given in the corresponding row of P.

I'm proposing parameterizing this at the user interface with mu (i.e., "mutation rate"), P (which I'm calling the "mutation matrix"), and the distribution of states at the root (which I'm calling the "alphabet distribution").

To implement this, we'll:

Throw down possible mutation locations on the edges, using mostly the existing algorithm, at rate mu.
Build and traverse the trees to find out the dependencies for each possible mutation, and pick ancestral states (using the alphabet distribution) and derived states (with the mutation matrix, P). Some mutations and even some sites will be discarded because they are no-ops.
Populate the tables with what's left.

It might be cleaner to throw down mutations while stepping through the trees, as described here? But, I started doing it this way because it seemed easier - closer to the existing code. And, this way is more modular?

To do (1), and to keep track of which sites/mutations are new and which are "kept", we keep a list of sites, and each site has a list of mutations, all kept in the same order they'll be in when we write them out to the tables.

petrelharp · 2020-04-07T05:36:31Z

I've updated the code so that maybe it does finite sites actually at integer sites (untested!), and got it to pass all tests except some provenance tests that I totally don't understand. So, I think that my changes have not broken previously working things (but could you give me some pointers on the provenance, @benjeffery?), but doesn't yet actually do the new things.

A separate thing: I decided that start_time and end_time should really be arguments to _msprime.MutationGenerator.generate, not _msprime.MutationGenerator( ). Happy to change this back, but this made more sense to me.

jeromekelleher

Looks good, thanks for the explanation @petrelharp. I think it would be better to nail down the model in Python first before doing any more C coding. There's a lot of boiler plate to peer through otherwise.

lib/mutgen.c

msprime/mutations.py

benjeffery · 2020-04-07T14:18:23Z

I've made a commit over at benjeffery@dad4d07
that fixes up the provenance by refactoring a bit.

It seems a shame there is no longer a way to get uniform rate with a different alphabet without manually specifying the rate - for example https://github.com/tskit-dev/tskit/blob/master/python/tests/test_highlevel.py#L215 but it's no big deal.

petrelharp · 2020-04-08T05:32:10Z

Thanks for the comments and PRs - I'll get to them!

But in the meantime, a design choice. How's this sound: suppose we add mutations with keep=True, discrete_sites=True. Then we want to be able to add new mutations coming after the previous ones; and it is an error if we end up trying to add a new mutation before an existing one (since you need to know the previous state to properly pick a new state). This also means that the existing ancestral states and derived states have to match the alphabet that's being used (e.g., BINARY or NUCLEOTIDES). I suppose we could disable this check for discrete_sites = False, but I'm inclined not to.

jeromekelleher

The Python implementation is really helpful, thanks @petrelharp! I've left a bunch of comments. Basically I think we need to split things up a bit more to simplify the code a bit.

The more I think about the rates the less convinced I am that we're doing it properly here too. I would have thought that the right thing to do would be to generate a poisson number of mutations proportional to the area of each edge (without discretising), and then throwing that number of "darts" uniformly along the extend of the edge. We then floor the position they fall at to decide which integer bin they fall in to. This works fine for very short edges with span << 1, whereas the current approach doesn't I think.

tests/test_mutations.py

petrelharp · 2020-04-08T14:22:55Z

Basically I think we need to split things up a bit more to simplify the code a bit.

Yes, sorry - I'm totally not done (should have said) - but thanks for the comments!

I would have thought that the right thing to do would be to generate a poisson number of mutations proportional to the area of each edge (without discretising), and then throwing that number of "darts" uniformly along the extend of the edge. We then floor the position they fall at to decide which integer bin they fall in to. This works fine for very short edges with span << 1, whereas the current approach doesn't I think.

So, the model I'm using here with the floor/ceiling is that mutations only occur at integer locations, and so the mutational "mass" of a span is equal to the number of integers falling in it. Since our intervals are left-closed, right-open, the integers in [left, right) are ceil(left), ceil(left) + 1, ..., ceil(right) - 1. This explains the ceil( )s.

And, I must be confused about your model, but I don't see how we can throw down positions as floats and then round them - they might end up outside of the current span (even outside of the current mutation map rate interval!). It's true that the current approach won't put any mutations in edges that don't cover an integer, but I think that's OK - really, we want this mutation model to be used in conjunction with discrete recombination as well, in wihch case all edges will contain at least one integer. Or, am I missing something?

petrelharp · 2020-04-09T03:47:55Z

OK, I think the python version is in reasonably good shape. I've edited my description above to reflect the current state of things (traversing the tree; no longer using compute_mutation_parents). I will work on writing more tests and getting this into C next. The design decisions above - like exactly how discrete_sites works - can be easily changed.

But: something I'm unclear on is the right way to do the allele <-> index conversion in C; e.g., here. What's the right way to this?

jeromekelleher · 2020-04-09T08:48:13Z

So, the model I'm using here with the floor/ceiling is that mutations only occur at integer locations, and so the mutational "mass" of a span is equal to the number of integers falling in it. Since our intervals are left-closed, right-open, the integers in [left, right) are ceil(left), ceil(left) + 1, ..., ceil(right) - 1. This explains the ceil( )s.

I see, that's a lot clearer, thanks. A quick comment would help here, as it's reasonably subtle. I still wonder if we should just error-out if a user tries to mix integer and floating point topologies and site positions, but maybe that's too stringent.

jeromekelleher

Looks good, thanks @petrelharp. Sorry for all the nitpicking here, but I really want to understand this and to make sure it's as simple as it possibly can be!

tests/test_mutations.py

petrelharp · 2020-04-09T13:59:46Z

I still wonder if we should just error-out if a user tries to mix integer and floating point topologies and site positions, but maybe that's too stringent.

I wondered this too, but I think the model is conceptually clear - the mutation mapp has unit mass at each integer site instead of spread uniformly - so I think it's OK to just strongly recommend in the docs that people use discrete recomb along with discrete mutation. This will be helped along if we just have a single "discrete sites" parameter to simulate() as we discussed elsewhere.

petrelharp · 2020-04-09T19:36:42Z

Ok - I think this refactor is close to as simple as it's going to be. Separating out building the mutation tree did not result in simpler code, as remarked above. Could you have a look, @jeromekelleher?

Besides "how's this look", a question I have about moving on to C is:

What is the right way to find the index of an allele in self.alleles in C, as I'm doing e.g. here?

petrelharp · 2020-04-13T13:54:45Z

Hi, @jeromekelleher: the current status here is that:

I could be happy with the algorithm and how it's factored out as implemented in python in test_mutations.py; it is passing a bunch of tests I have put in there (although some statistical verification will be necessary at some point). Input (and even refactoring suggestions) appreciated (but no hurry).
I've translated the algorithm over into mutgen.c - no doubt with remaining errors. The major missing piece I'm aware of is getting the mutation model into mutgen, which I'm working on now.

I would like input on this: it seems like the right thing to do is to define a mutation model at the C level, and a python interface and everything. I think this is what was intended when model= was added to mutate( ). Does this sound right?

jeromekelleher · 2020-04-13T16:25:45Z

Thanks @petrelharp - I'll have a look tomorrow morning. (Holiday here today)

petrelharp · 2020-04-13T17:08:53Z

Holiday here today

No hurry! Happy spring!! 🥚 🌷 🌳

petrelharp · 2020-04-13T18:54:44Z

Update: the C code is now passing the C tests!

petrelharp · 2020-04-14T00:06:33Z

So - spelling things out a bit more: it's seeming to me that we should define in C a mutation_model_t (and have done so), and make this visible at the python level, and have mutate( ) take a model= argument, etcetera. This doesn't seem too hard but does involve a change to the API, which is currently passing in the alphabet as just a flag, and is calling a "mutation map" what I think was supposed to be the same thing as the "mutation model".

petrelharp · 2020-04-14T05:28:24Z

It now builds and passes many of the tests. However, I've made some memory management error somewhere: I keep getting strange errors that move when I insert printf statements.

In getting it to pass the tests, I've:

made it so existing mutations retain the ordering they came in with
throw an error when an ancestral or derived state is not in the alphabet only if we're actually trying to mutate it

I've not gotten to implementing the provenance fix that @benjeffery kindly provided above.

petrelharp · 2020-04-22T18:12:56Z

Did a little squash'n'rebase here to get the coverage diff simpler.

msprime/provenance.py

_msprimemodule.c

lib/mutgen.c

jeromekelleher · 2020-04-23T10:02:16Z

I've gone through this again @petrelharp, and I think we should merge ASAP and follow up with some issues to track what needs to be done afterwards. I've run some of the statistical checks, and it seems nothing has been broken in existing code by what we're merging.

There's only one thing I don't like and don't want to get forgotten about: how does the second MSP_ERR_MUTATION_GENERATION_OUT_OF_ORDER error happen? I think we need to either craft a C test case to provoke it or prove to ourselves it can't happen and stick in an assert.

petrelharp · 2020-04-23T15:11:55Z

I think we should merge ASAP

Sounds good. I've changed the second out of order error, which should have been an assert. I want to do a quick statistical test of the number of mutations; then I'll squash & rebase.

Did you notice the statistical tests? We didn't have any such tests in before, i don't think, so at first I was going to put them in verification.py, but I think this is the right place for them. We can adjust the p-value threshold to make false positives less likely.

jeromekelleher · 2020-04-23T16:50:03Z

Did you notice the statistical tests? We didn't have any such tests in before, i don't think, so at first I was going to put them in verification.py, but I think this is the right place for them. We can adjust the p-value threshold to make false positives less likely.

Spotted them, but didn't think too hard about what they are doing. Once they're quick to run and don't result in random CI failures I'm all for it.

jeromekelleher · 2020-04-23T16:52:02Z

Sounds good. I've changed the second out of order error, which should have been an assert. I want to do a quick statistical test of the number of mutations; then I'll squash & rebase.

Great! Please merge away whenever you're happy.

petrelharp · 2020-04-23T17:22:24Z

Ok - I wrote those mutation rate tests, and am pretty happy with them. We're just testing if each tree looks like it has the right Poisson number of mutations on it - we could look by branch - but this seems fine. We're not doing a statistical test with a non-uniform mutation map, but again, that seems OK. Perhaps this should be merged? (oh gee, once I fix these errors??)

jeromekelleher

Minor comment about the stats tests, since you're fixing problems anyway

tests/test_mutations.py

petrelharp · 2020-04-23T17:34:34Z

There, how's that look?

petrelharp · 2020-04-23T17:45:41Z

I don't know why codecov is unhappy - seems like we have improved coverage? But maybe I don't know how to parse this right. (oh, but: we are maybe not testing the MutationModel.asdict method, but it seemed like we should test that elsewhere?)

petrelharp · 2020-04-23T17:56:03Z

Added a test for asdict..

mutations; and more.

petrelharp · 2020-04-23T18:18:28Z

I think we can merge, but am bothered by codecov. Could you take a look and hit merge if it's good, @jeromekelleher?

benjeffery · 2020-04-23T18:23:15Z

Codecov looks good to me - only OOMs and hard to do errors that aren't covered.

petrelharp · 2020-04-23T18:41:20Z

thanks, @benjeffery! merging 🎉!!!

jeromekelleher · 2020-04-23T18:43:47Z

Great, thanks @petrelharp! Hooray!!!!

jeromekelleher reviewed Apr 6, 2020

View reviewed changes

jeromekelleher mentioned this pull request Apr 6, 2020

Add MutationTable.time column tskit-dev/tskit#513

Closed

jeromekelleher reviewed Apr 7, 2020

View reviewed changes

lib/mutgen.c Outdated Show resolved Hide resolved

lib/mutgen.c Outdated Show resolved Hide resolved

lib/mutgen.c Show resolved Hide resolved

lib/mutgen.c Show resolved Hide resolved

msprime/mutations.py Outdated Show resolved Hide resolved

benjeffery reviewed Apr 7, 2020

View reviewed changes

msprime/mutations.py Show resolved Hide resolved

jeromekelleher reviewed Apr 8, 2020

View reviewed changes

tests/test_mutations.py Outdated Show resolved Hide resolved

jeromekelleher reviewed Apr 8, 2020

View reviewed changes

tests/test_mutations.py Show resolved Hide resolved

petrelharp force-pushed the finite_sites_branchwise branch from c81b332 to 8fb70e9 Compare April 9, 2020 04:31

jeromekelleher reviewed Apr 9, 2020

View reviewed changes

jeromekelleher mentioned this pull request Apr 14, 2020

Docs for finite-sites features #950

Closed

jeromekelleher reviewed Apr 22, 2020

View reviewed changes

msprime/provenance.py Outdated Show resolved Hide resolved

petrelharp mentioned this pull request Apr 22, 2020

document provenance requirements for new/existing code #968

Open

petrelharp force-pushed the finite_sites_branchwise branch from f3149c2 to e5a7fab Compare April 22, 2020 21:41

jeromekelleher reviewed Apr 23, 2020

View reviewed changes

_msprimemodule.c Outdated Show resolved Hide resolved

jeromekelleher reviewed Apr 23, 2020

View reviewed changes

lib/mutgen.c Show resolved Hide resolved

jeromekelleher force-pushed the finite_sites_branchwise branch from a81bcd3 to 52288dd Compare April 23, 2020 09:58

petrelharp force-pushed the finite_sites_branchwise branch from 9f9d8a9 to ab48e21 Compare April 23, 2020 17:20

jeromekelleher reviewed Apr 23, 2020

View reviewed changes

tests/test_mutations.py Outdated Show resolved Hide resolved

petrelharp force-pushed the finite_sites_branchwise branch from ab48e21 to 001188c Compare April 23, 2020 17:27

Finite sites mutations: added MutationModel; statistical tests for

550d689

mutations; and more.

petrelharp force-pushed the finite_sites_branchwise branch from fa37ffa to 550d689 Compare April 23, 2020 18:09

petrelharp merged commit 8fcdee0 into tskit-dev:master Apr 23, 2020

This was referenced Apr 23, 2020

API for 'finite_sites'? #880

Closed

finite-sites recapitation and mutation tskit-dev/pyslim#77

Closed

More general mutation models #553

Closed

Update mutation code once mutations have time #977

Closed

petrelharp deleted the finite_sites_branchwise branch April 30, 2020 14:33

Finite sites branchwise #946

Finite sites branchwise #946

Conversation

petrelharp commented Apr 6, 2020

jeromekelleher left a comment

Choose a reason for hiding this comment

petrelharp commented Apr 6, 2020

petrelharp commented Apr 6, 2020

jeromekelleher commented Apr 6, 2020 • edited

petrelharp commented Apr 6, 2020

jeromekelleher commented Apr 6, 2020

petrelharp commented Apr 6, 2020 • edited

petrelharp commented Apr 7, 2020

jeromekelleher left a comment

Choose a reason for hiding this comment

benjeffery commented Apr 7, 2020

petrelharp commented Apr 8, 2020

jeromekelleher left a comment

Choose a reason for hiding this comment

petrelharp commented Apr 8, 2020

petrelharp commented Apr 9, 2020

jeromekelleher commented Apr 9, 2020

jeromekelleher left a comment

Choose a reason for hiding this comment

petrelharp commented Apr 9, 2020

petrelharp commented Apr 9, 2020 • edited

petrelharp commented Apr 13, 2020

jeromekelleher commented Apr 13, 2020

petrelharp commented Apr 13, 2020

petrelharp commented Apr 13, 2020

petrelharp commented Apr 14, 2020

petrelharp commented Apr 14, 2020 • edited

petrelharp commented Apr 22, 2020

jeromekelleher commented Apr 23, 2020

petrelharp commented Apr 23, 2020

jeromekelleher commented Apr 23, 2020

jeromekelleher commented Apr 23, 2020

petrelharp commented Apr 23, 2020

jeromekelleher left a comment

Choose a reason for hiding this comment

petrelharp commented Apr 23, 2020

petrelharp commented Apr 23, 2020 • edited

petrelharp commented Apr 23, 2020

petrelharp commented Apr 23, 2020

benjeffery commented Apr 23, 2020

petrelharp commented Apr 23, 2020

jeromekelleher commented Apr 23, 2020

jeromekelleher commented Apr 6, 2020 •

edited

petrelharp commented Apr 6, 2020 •

edited

petrelharp commented Apr 9, 2020 •

edited

petrelharp commented Apr 14, 2020 •

edited

petrelharp commented Apr 23, 2020 •

edited