-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are the HMM emission probabilities parameterised correctly? #403
Comments
Another thought: it would be better (at least in the match-samples phase) to use the actual allele frequency to determine the "uninformative" emission probabilities. If the derived allele is at freq This is a bit more difficult during the match_ancestors phase, because early in time, the derived allele will be at freq 0. But even if we used the current freq as a proxy for this, I suspect it would be better than using equal probabilities of 0.5/0.5. |
The emission proba is based on the existing treatments: see, .e.g, Rosen and Paten, Lunter and Stepens and Donelly. We have dropped the "coalescenty" terms from the emission proba in, e.g., S&D, as these are constants. But, we are, I think, consistent with the other derivations. How do you see what we're doing as different from these treatments? |
Thanks for the links @jeromekelleher. I guess this is a question of the interpretation of the "mutation" rate: we aren't parameterising it as a mutation but potentially as an "error", which might change the angle we come at it. Regardless, I can't see how it makes sense to allow negative emission probabilities in their formulations (e.g. for mu=1, A>2). I feel I should ask Gil about this anyway. |
Easiest it to think about the extreme case: take a mutation (or mismatch) probability of mu=1. Do we mean that this is a probability of 1 that we mutate to another state, or a probability of 1 that we are hit by a mutagenic (or error-prone) process? The parameterisation in Rosen and Paten (for example) assumes the first: I would argue that if we are thinking of this in terms of error, the maximum mismatch probability should correspond to the case where the current copying path provides no information about the emitted state. That's a different parameterisation of the emission probs. This feels like something to chat about face-to-face, TBH. Edit - we could, of course, change this by simply changing the meaning we attach to mu, and have out "mismatch probability" max out (where there are 2 alleles) at 0.5 (maximally uninformative), which would (coincidentally) match how the recombination probabilities are calculated from the recombination rate. |
Actually, reading though the papers, Equation 4.2 in As they say: "Notice that as θ → ∞ the alleles 0 and 1 at any given site become equally likely", which is definitely not true for the Rosen and Paten formulation. Personally, I think we could do even better by using the allele frequencies, as I commented above, but I think that's probably an extension of the idea that needs to be tested separately. |
I spent a long time working on this in the context of the general purpose code for tskit, and I'm happy with the independent-of-n parameterisation of the emission probabilities. The clearest implementation is here. There's plenty of precedent for doing it this way, and S&D do say that the mutation values don't matter much. We can always add an option to change the behaviour later if we wish. Unless there's an obvious actual error in what we're doing we shouldn't change things now - we just need to state that this is what we're doing. |
I think there is (essentially) an error for num_alleles > 2, but perhaps we don't care about that. As you say, Edit - another way to look at it: currently if we set the mismatch probability to 1, we force a recombination to occur, whatever the recombination rate. That seems.... wrong. |
I mean. basically, we should never allow p_e or p_t to be less than 0 or greater than 1, otherwise we break the algorithm. At the moment, especially because we (accidentally?) count missing data as an extra allele (see #406) , we can make p_e > 1 by inputting otherwise reasonable sounding mismatch probabilities (e.g. 0.8) |
You're right, the R&P parameterisation that I worked from is wrong for k > 2. Goddamit! Well spotted, thanks. Hmm, this is a mess. Perhaps it's not so bad though: we can just disallow sites with > 2 alleles for inference. We don't actually count missing data as an allele, AFAIK, so this should be OK. In practise, we don't use > biallelic sites for inference, anyway, right? This is as a short-term fix to get 0.2 out the door. |
That's OK. I feel like I've been bugging you too much about this, so happy it's not been a waste of time for you or me.
Yes, I think we can do this for the time being.
Hmm. I'm worried that we might. I'll check in a little bit, just to satisfy me, and if you're right, I'll update the docs.
Correct. So I think we're OK in that respect.
Yes, I agree. But I do think that we need to decide whether a mismatch probability of 1 means that the value being emitted is always the exact opposite of that suggested by the hidden state, or if it means that the hidden state is (basically) uninformative about what allele is emitted. The first seems a bit weird to me, and I'm strongly inclined to go for the latter, which means (I think) dividing by a factor of 2 (well num_alleles, but we're going to constraining that to 2 anyway). |
I think it's simplest to keep things parameterised as mu meaning "the probability of mutating to the other allele". If mu is 0 then we must have a match and if mu is 1 we must have a mismatch. Lunter also takes this view: "and pμ is the probability of a mutation to one of the three other nucleotides." - he hard-codes into the model the 4 nucleotide perspective. |
Hmm, in that case we need to document it really carefully, I think, because I (and I suspect others) would assume that a "mismatch" probability of 1 means that the emitted state is random, not that it cannot match the haplotype. If we do go down this route, I would rescale the mismatch ratio so that it could only ever produce "mismatch probabilities" from 0 to 1/num_alleles. It seems nonsensical that if we tweak the mismatch_ratio parameter up to high values, it will force everything to deliberately mismatch. |
By the way, that's not what the parameterisation of the recombination rate does, so we're being a little inconsistent here, not that anyone but ourselves would notice. A recombination probability of 1 means there's a chance of recombining onto any haplotype, including oneself (that's the 1/n bit). A mismatch/mutation probability of 1 means there's a chance of mutation to any other allele excluding oneself. |
We should also write this all down somewhere, for ourselves if nothing else, as we'll probably want to adjust for >2 alleles later, etc. |
As I suspected, we actually do. Fixed in #413 , but it defaults to counting them, except when passed to the tree seq builder. Perhaps we never want to count them (i.e. default to count_missing=False) |
Closed & documented by #404 |
Currently, the emission probabilities seem to be given by this line:
tsinfer/tsinfer/algorithm.py
Line 736 in 7f69d59
That means when
mu
is 1 (which I would take to mean that the emitted allele is completely unrelated to the haplotype we are copying from) then for a 2-allele model, p_e is1- (2-1) * 1 = 0
Can this be correct? Shouldn't the minimum emission probability be1/num_alleles
, @jeromekelleher ?The text was updated successfully, but these errors were encountered: