Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are the HMM emission probabilities parameterised correctly? #403

Closed
hyanwong opened this issue Dec 8, 2020 · 16 comments
Closed

Are the HMM emission probabilities parameterised correctly? #403

hyanwong opened this issue Dec 8, 2020 · 16 comments

Comments

@hyanwong
Copy link
Member

hyanwong commented Dec 8, 2020

Currently, the emission probabilities seem to be given by this line:

p_e = 1 - (num_alleles - 1) * mu

That means when mu is 1 (which I would take to mean that the emitted allele is completely unrelated to the haplotype we are copying from) then for a 2-allele model, p_e is 1- (2-1) * 1 = 0 Can this be correct? Shouldn't the minimum emission probability be 1/num_alleles, @jeromekelleher ?

@hyanwong
Copy link
Member Author

hyanwong commented Dec 8, 2020

Another thought: it would be better (at least in the match-samples phase) to use the actual allele frequency to determine the "uninformative" emission probabilities. If the derived allele is at freq p, then (assuming the underlying state is uninformative), the emissions should be p for the derived allele, and 1-p for the ancestral allele.

This is a bit more difficult during the match_ancestors phase, because early in time, the derived allele will be at freq 0. But even if we used the current freq as a proxy for this, I suspect it would be better than using equal probabilities of 0.5/0.5.

@jeromekelleher
Copy link
Member

The emission proba is based on the existing treatments: see, .e.g, Rosen and Paten, Lunter and Stepens and Donelly.

We have dropped the "coalescenty" terms from the emission proba in, e.g., S&D, as these are constants. But, we are, I think, consistent with the other derivations. How do you see what we're doing as different from these treatments?

@hyanwong
Copy link
Member Author

hyanwong commented Dec 9, 2020

Thanks for the links @jeromekelleher. I guess this is a question of the interpretation of the "mutation" rate: we aren't parameterising it as a mutation but potentially as an "error", which might change the angle we come at it.

Regardless, I can't see how it makes sense to allow negative emission probabilities in their formulations (e.g. for mu=1, A>2). I feel I should ask Gil about this anyway.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 9, 2020

Easiest it to think about the extreme case: take a mutation (or mismatch) probability of mu=1. Do we mean that this is a probability of 1 that we mutate to another state, or a probability of 1 that we are hit by a mutagenic (or error-prone) process? The parameterisation in Rosen and Paten (for example) assumes the first: 1-(A-1)mu, so that for A=2, if mu is 1, we cannot match any more, and are forced to switch (emission probability is 0 => this path becomes impossible). It's a bit like whether or not you allow mutation to the same state in tskit, I suppose.

I would argue that if we are thinking of this in terms of error, the maximum mismatch probability should correspond to the case where the current copying path provides no information about the emitted state. That's a different parameterisation of the emission probs.

This feels like something to chat about face-to-face, TBH.

Edit - we could, of course, change this by simply changing the meaning we attach to mu, and have out "mismatch probability" max out (where there are 2 alleles) at 0.5 (maximally uninformative), which would (coincidentally) match how the recombination probabilities are calculated from the recombination rate.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 9, 2020

Actually, reading though the papers, Equation 4.2 in Stephens and Donelly Donnelly & Leslie uses my initial suggestion, not that in Rosen and Paten, I think (replacing the mu with theta)?

Screenshot 2020-12-09 at 12 30 32

As they say: "Notice that as θ → ∞ the alleles 0 and 1 at any given site become equally likely", which is definitely not true for the Rosen and Paten formulation. Personally, I think we could do even better by using the allele frequencies, as I commented above, but I think that's probably an extension of the idea that needs to be tested separately.

@jeromekelleher
Copy link
Member

I spent a long time working on this in the context of the general purpose code for tskit, and I'm happy with the independent-of-n parameterisation of the emission probabilities. The clearest implementation is here. There's plenty of precedent for doing it this way, and S&D do say that the mutation values don't matter much. We can always add an option to change the behaviour later if we wish.

Unless there's an obvious actual error in what we're doing we shouldn't change things now - we just need to state that this is what we're doing.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 9, 2020

I think there is (essentially) an error for num_alleles > 2, but perhaps we don't care about that. As you say, n probably doesn't matter too much as we get to large numbers of haplotypes, but I think that there's a missing factor of num_alleles (i.e. 2) in the current parameterisation: it should be something like (1 - (num_alleles - 1) * mu) / num_alleles and mu/num_alleles. Otherwise we get nonsensical values for mu=1. The other option is to scale mu so that the maximum "mismatch_probability" is 0.5, rather than 1, but that seems a hack?

Edit - another way to look at it: currently if we set the mismatch probability to 1, we force a recombination to occur, whatever the recombination rate. That seems.... wrong.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 9, 2020

I mean. basically, we should never allow p_e or p_t to be less than 0 or greater than 1, otherwise we break the algorithm. At the moment, especially because we (accidentally?) count missing data as an extra allele (see #406) , we can make p_e > 1 by inputting otherwise reasonable sounding mismatch probabilities (e.g. 0.8)

@jeromekelleher
Copy link
Member

You're right, the R&P parameterisation that I worked from is wrong for k > 2. Goddamit! Well spotted, thanks.

Hmm, this is a mess. Perhaps it's not so bad though: we can just disallow sites with > 2 alleles for inference. We don't actually count missing data as an allele, AFAIK, so this should be OK. In practise, we don't use > biallelic sites for inference, anyway, right?

This is as a short-term fix to get 0.2 out the door.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 9, 2020

You're right, the R&P parameterisation that I worked from is wrong for k > 2. Goddamit! Well spotted, thanks.

That's OK. I feel like I've been bugging you too much about this, so happy it's not been a waste of time for you or me.

Hmm, this is a mess. Perhaps it's not so bad though: we can just disallow sites with > 2 alleles for inference.

Yes, I think we can do this for the time being.

We don't actually count missing data as an allele, AFAIK, so this should be OK.

Hmm. I'm worried that we might. I'll check in a little bit, just to satisfy me, and if you're right, I'll update the docs.

In practise, we don't use > biallelic sites for inference, anyway, right?

Correct. So I think we're OK in that respect.

This is as a short-term fix to get 0.2 out the door.

Yes, I agree. But I do think that we need to decide whether a mismatch probability of 1 means that the value being emitted is always the exact opposite of that suggested by the hidden state, or if it means that the hidden state is (basically) uninformative about what allele is emitted. The first seems a bit weird to me, and I'm strongly inclined to go for the latter, which means (I think) dividing by a factor of 2 (well num_alleles, but we're going to constraining that to 2 anyway).

@jeromekelleher
Copy link
Member

I think it's simplest to keep things parameterised as mu meaning "the probability of mutating to the other allele". If mu is 0 then we must have a match and if mu is 1 we must have a mismatch. Lunter also takes this view: "and pμ is the probability of a mutation to one of the three other nucleotides." - he hard-codes into the model the 4 nucleotide perspective.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 9, 2020

I think it's simplest to keep things parameterised as mu meaning "the probability of mutating to the other allele". If mu is 0 then we must have a match and if mu is 1 we must have a mismatch.

Hmm, in that case we need to document it really carefully, I think, because I (and I suspect others) would assume that a "mismatch" probability of 1 means that the emitted state is random, not that it cannot match the haplotype.

If we do go down this route, I would rescale the mismatch ratio so that it could only ever produce "mismatch probabilities" from 0 to 1/num_alleles. It seems nonsensical that if we tweak the mismatch_ratio parameter up to high values, it will force everything to deliberately mismatch.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 9, 2020

By the way, that's not what the parameterisation of the recombination rate does, so we're being a little inconsistent here, not that anyone but ourselves would notice. A recombination probability of 1 means there's a chance of recombining onto any haplotype, including oneself (that's the 1/n bit). A mismatch/mutation probability of 1 means there's a chance of mutation to any other allele excluding oneself.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 9, 2020

We should also write this all down somewhere, for ourselves if nothing else, as we'll probably want to adjust for >2 alleles later, etc.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 9, 2020

We don't actually count missing data as an allele, AFAIK, so this should be OK.

As I suspected, we actually do. Fixed in #413 , but it defaults to counting them, except when passed to the tree seq builder. Perhaps we never want to count them (i.e. default to count_missing=False)

@hyanwong
Copy link
Member Author

Closed & documented by #404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants