New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finite sites branchwise #946
Finite sites branchwise #946
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woohoo, this is great, thanks @petrelharp! I'm not fully following how you're doing this. One thing that might be useful is to update the mutate_map
function in test_mutations.py
to give us some confidence in the C implementation and hopefully also clarify what it's doing as a reference implementation.
The Python level post-processing is less clear to me, but I'll try and figure it out. It's not obvious to me how we can compute the states in general without building the trees.
We do build the trees - this happens using But - I'll write out what's going on. |
There's a complication: suppose that an edge goes from This would be fixed if the mutation table had a This all is OK as long as mutations are added in order by time (ie longest-ago mutations first). We can specify in the documentation that if you're adding mutations to separate time chunks with finite sites then you have to do it that way. And in the code for It'll be a bit annoying to actually do that check, since finding the parent node for the edge that each mutation is on is not an operation we already have. But we can do it on a single pass through the tables, without building the trees. How's that sound? |
Sounds like a pragmatic solution to me @petrelharp, I like it. But, should we add a |
My sense is "probably yes"? It will definately make our lives easier in the long run. It's a pretty big change, though. I'm also happy with this alternative proposal. A question if we do add it is whether to keep the |
We should totally keep the parent column. The overhead is minimal RAM wise, and we can make it optional for the file format (e.g., it's redundant if there's < 2 mutations at any site). I'll move this over to tskit and expore there. Let's assume we will have the time column for the purpose of this PR - I guess we just error out for the moment if we hit the situation you're talking about? |
Ok, here's the description in words of how this would work: What this relies on is the following method for simulating from a continuous-time Markov chain model of mutation: suppose we have a rate matrix, I'm proposing parameterizing this at the user interface with To implement this, we'll:
It might be cleaner to throw down mutations while stepping through the trees, as described here? But, I started doing it this way because it seemed easier - closer to the existing code. And, this way is more modular? To do (1), and to keep track of which sites/mutations are new and which are "kept", we keep a list of sites, and each site has a list of mutations, all kept in the same order they'll be in when we write them out to the tables. |
I've updated the code so that maybe it does finite sites actually at integer sites (untested!), and got it to pass all tests except some provenance tests that I totally don't understand. So, I think that my changes have not broken previously working things (but could you give me some pointers on the provenance, @benjeffery?), but doesn't yet actually do the new things. A separate thing: I decided that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for the explanation @petrelharp. I think it would be better to nail down the model in Python first before doing any more C coding. There's a lot of boiler plate to peer through otherwise.
I've made a commit over at benjeffery@dad4d07 It seems a shame there is no longer a way to get uniform rate with a different alphabet without manually specifying the rate - for example https://github.com/tskit-dev/tskit/blob/master/python/tests/test_highlevel.py#L215 but it's no big deal. |
Thanks for the comments and PRs - I'll get to them! But in the meantime, a design choice. How's this sound: suppose we add mutations with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Python implementation is really helpful, thanks @petrelharp! I've left a bunch of comments. Basically I think we need to split things up a bit more to simplify the code a bit.
The more I think about the rates the less convinced I am that we're doing it properly here too. I would have thought that the right thing to do would be to generate a poisson number of mutations proportional to the area of each edge (without discretising), and then throwing that number of "darts" uniformly along the extend of the edge. We then floor the position they fall at to decide which integer bin they fall in to. This works fine for very short edges with span << 1, whereas the current approach doesn't I think.
Yes, sorry - I'm totally not done (should have said) - but thanks for the comments!
So, the model I'm using here with the floor/ceiling is that mutations only occur at integer locations, and so the mutational "mass" of a span is equal to the number of integers falling in it. Since our intervals are left-closed, right-open, the integers in And, I must be confused about your model, but I don't see how we can throw down positions as floats and then round them - they might end up outside of the current span (even outside of the current mutation map rate interval!). It's true that the current approach won't put any mutations in edges that don't cover an integer, but I think that's OK - really, we want this mutation model to be used in conjunction with discrete recombination as well, in wihch case all edges will contain at least one integer. Or, am I missing something? |
OK, I think the python version is in reasonably good shape. I've edited my description above to reflect the current state of things (traversing the tree; no longer using compute_mutation_parents). I will work on writing more tests and getting this into C next. The design decisions above - like exactly how But: something I'm unclear on is the right way to do the allele <-> index conversion in C; e.g., here. What's the right way to this? |
c81b332
to
8fb70e9
Compare
I see, that's a lot clearer, thanks. A quick comment would help here, as it's reasonably subtle. I still wonder if we should just error-out if a user tries to mix integer and floating point topologies and site positions, but maybe that's too stringent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @petrelharp. Sorry for all the nitpicking here, but I really want to understand this and to make sure it's as simple as it possibly can be!
I wondered this too, but I think the model is conceptually clear - the mutation mapp has unit mass at each integer site instead of spread uniformly - so I think it's OK to just strongly recommend in the docs that people use discrete recomb along with discrete mutation. This will be helped along if we just have a single "discrete sites" parameter to |
Ok - I think this refactor is close to as simple as it's going to be. Separating out building the mutation tree did not result in simpler code, as remarked above. Could you have a look, @jeromekelleher? Besides "how's this look", a question I have about moving on to C is:
|
Hi, @jeromekelleher: the current status here is that:
I would like input on this: it seems like the right thing to do is to define a mutation model at the C level, and a python interface and everything. I think this is what was intended when |
Thanks @petrelharp - I'll have a look tomorrow morning. (Holiday here today) |
No hurry! Happy spring!! 🥚 🌷 🌳 |
Update: the C code is now passing the C tests! |
So - spelling things out a bit more: it's seeming to me that we should define in C a |
It now builds and passes many of the tests. However, I've made some memory management error somewhere: I keep getting strange errors that move when I insert printf statements. In getting it to pass the tests, I've:
I've not gotten to implementing the provenance fix that @benjeffery kindly provided above. |
Did a little squash'n'rebase here to get the coverage diff simpler. |
f3149c2
to
e5a7fab
Compare
a81bcd3
to
52288dd
Compare
I've gone through this again @petrelharp, and I think we should merge ASAP and follow up with some issues to track what needs to be done afterwards. I've run some of the statistical checks, and it seems nothing has been broken in existing code by what we're merging. There's only one thing I don't like and don't want to get forgotten about: how does the second MSP_ERR_MUTATION_GENERATION_OUT_OF_ORDER error happen? I think we need to either craft a C test case to provoke it or prove to ourselves it can't happen and stick in an assert. |
Sounds good. I've changed the second out of order error, which should have been an assert. I want to do a quick statistical test of the number of mutations; then I'll squash & rebase. Did you notice the statistical tests? We didn't have any such tests in before, i don't think, so at first I was going to put them in |
Spotted them, but didn't think too hard about what they are doing. Once they're quick to run and don't result in random CI failures I'm all for it. |
Great! Please merge away whenever you're happy. |
9f9d8a9
to
ab48e21
Compare
Ok - I wrote those mutation rate tests, and am pretty happy with them. We're just testing if each tree looks like it has the right Poisson number of mutations on it - we could look by branch - but this seems fine. We're not doing a statistical test with a non-uniform mutation map, but again, that seems OK. Perhaps this should be merged? (oh gee, once I fix these errors??) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment about the stats tests, since you're fixing problems anyway
ab48e21
to
001188c
Compare
There, how's that look? |
I don't know why codecov is unhappy - seems like we have improved coverage? But maybe I don't know how to parse this right. (oh, but: we are maybe not testing the |
Added a test for asdict.. |
mutations; and more.
fa37ffa
to
550d689
Compare
I think we can merge, but am bothered by codecov. Could you take a look and hit merge if it's good, @jeromekelleher? |
Codecov looks good to me - only OOMs and hard to do errors that aren't covered. |
thanks, @benjeffery! merging 🎉!!! |
Great, thanks @petrelharp! Hooray!!!! |
I'm having a go at #553. The two methods sketched out there for doing this are roughly:
This PR is the latter. It doesn't work yet. Without fully explaining (yet), the idea is that:
So, once I've done this right, I should be able to
This is not yet right, and I've probably done something terribly wrong C-wise, but maybe you'll get the idea. Feel free to ignore for the moment (but suggestions appreciated).