-
Notifications
You must be signed in to change notification settings - Fork 78
Add ancestral_state to map_mutations #1550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks basically right to me.
Codecov Report
@@ Coverage Diff @@
## main #1550 +/- ##
=======================================
Coverage 93.70% 93.70%
=======================================
Files 27 27
Lines 22759 22775 +16
Branches 1076 1076
=======================================
+ Hits 21326 21342 +16
Misses 1399 1399
Partials 34 34
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I don't understand what you mean here - the actual site.ancestral_state is guaranteed to be the value you asked for, right? Once this holds, then it's up to the algorithm to decide a parsimonious assignment, and putting a mutation over the root will often be the only way to do this (.e.g, genotypes = [0,...,0] and specifying ancestral state = 1). |
Yep, it just depends if by ancestral state you mean the state of the MRCA. If that were the case, then the most parsimonious explanation would be to have all branches descending from the MRCA to have a separate switch. I guess it's just a question of making sure the documentation is clear that the ancestral state is not necessarily that shown by the grand MRCA, if mutations above the root are allowed (perhaps that's obvious, but it bears saying somewhere). |
Thanks @jeromekelleher . It would be useful to get this in for the tsdate paper revisions, so I'll have a go at the C side. Are there any tips on how in C you are supposed to check for a variable that can get passed in as either an integer or |
|
In the Python C API you'd make it an optional value. It's not obvious what the right approach to getting the plumbing set up, might be simplest if I put in a commit here doing the C side and then perhaps you could write more tests? |
If you have time, that would be great, thanks. Then I can see how it's done too. |
16a340f to
ee6896c
Compare
|
Changes to tests now pushed - this now has compare_lib=True, so tests should fail until the C version is working. |
|
Done - C stuff was slightly tricky because we already had an I think we should probably support specifying the ancestral state as a string as well, since it's easy enough to do. What do you think? |
|
Also, it's not clear to me when we'll have mutations over roots and mutations over the children of the root - it would be good to test this on a few different oddball scenarios to make sure it's doing what we think it's doing. |
Thanks - quick work.
If you don't mind variable type params, that's fine, yeah. I guess this would just be in Python though? |
Yes, just at the top-level Python function. See the comment. |
|
I think that there's a slight wrinkle: at the moment we calculate the number of alleles as the maximum integer in the genotype array. But actually we should be calculating it as the length of the provided alleles list (or one less than that if the last item in the alleles list is For the python code, we'd just have (instead of I guess something similar could happen in the C code? I assume it was done like that previously simply for efficiency reasons, e.g. if the user specified a long list of alleles but only a few were actually present in the genotype array? (edit) - or we could simply do |
|
This is a quirk all right, and we could work around it by letting anything < 64 be the ancestral state (from the C code's perspective, which doesn't know anything about alleles). My thought was though, that specifying a value 2 (say) as the ancestrel state when the genotypes array is full of 0s and 1s is probably a mistake. We can change it though, if you think there's a reasonable use case. |
3420384 to
9c3c1e7
Compare
I think it needn't be a mistake to specify an ancestral state that isn't in the haplotype array. Indeed, it is quite likely to happen in user code. We have had a fair number of examples brought up by users where the ancestral state in the VCF is, say, "C", but the samples that are actually used in the analysis have only "A" and "T". It would be a mistake, IMO, not to be able to lay down variation on these triallelic sites via parsimony in tsinfer, which is one of the use-cases of this new functionality. I've just pushed a commit which changes both the C and python versions as appropriate, along with a few more tests. Note that I haven't added to the C tests, so the check in the C code that ancestral_state < HARTIGAN_MAX_ALLELES is not tested. |
9c3c1e7 to
c503dd2
Compare
|
Yep, looks good. |
83b9631 to
1d886a4
Compare
|
Squashed and ready for review |
|
This is failing CI because the C checker is failing to notice that |
|
Just set ancestral_state to zero before the |
1d886a4 to
1683fb8
Compare
Perfect, thanks. Now it's complaining about (error: conversion from ‘int’ to ‘int8_t’ {aka ‘signed char’} may change value) presumably because although both |
|
Probably cast the 1 - either is fine though |
d85a39b to
ace432b
Compare
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final things to clear up
f691f6a to
e6ab993
Compare
|
Seems to all be passing now, and all comments addressed. Thanks for the review @jeromekelleher |
e6ab993 to
4a26304
Compare
Starts to address #1542. Currently this is a draft version of a change to the python-only Hartigan
map_mutationscode. I think this is right, but it could do with checking and also we should add a few round-trip tests with ancestral state specified (I'm not sure how comprehensive to make this, e.g. whether to patch it in to the round trip tests higher up).The doesn't code up the C version, but I guess that should be easy. It also doesn't add the ancestral state function to the fitch map_mutations code, but we don't use that except for testing anyway.
I guess it's OK that the algorithm is allowed to put a mutation above the root, so that the ancestral state for the whole tree switches immediately?