Are the HMM emission probabilities parameterised correctly? #403

hyanwong · 2020-12-08T20:42:33Z

Currently, the emission probabilities seem to be given by this line:

Line 736 in 7f69d59

p_e = 1 - (num_alleles - 1) * mu

That means when mu is 1 (which I would take to mean that the emitted allele is completely unrelated to the haplotype we are copying from) then for a 2-allele model, p_e is 1- (2-1) * 1 = 0 Can this be correct? Shouldn't the minimum emission probability be 1/num_alleles, @jeromekelleher ?

The text was updated successfully, but these errors were encountered:

hyanwong · 2020-12-08T20:54:58Z

Another thought: it would be better (at least in the match-samples phase) to use the actual allele frequency to determine the "uninformative" emission probabilities. If the derived allele is at freq p, then (assuming the underlying state is uninformative), the emissions should be p for the derived allele, and 1-p for the ancestral allele.

This is a bit more difficult during the match_ancestors phase, because early in time, the derived allele will be at freq 0. But even if we used the current freq as a proxy for this, I suspect it would be better than using equal probabilities of 0.5/0.5.

jeromekelleher · 2020-12-09T11:48:26Z

The emission proba is based on the existing treatments: see, .e.g, Rosen and Paten, Lunter and Stepens and Donelly.

We have dropped the "coalescenty" terms from the emission proba in, e.g., S&D, as these are constants. But, we are, I think, consistent with the other derivations. How do you see what we're doing as different from these treatments?

hyanwong · 2020-12-09T11:58:29Z

Thanks for the links @jeromekelleher. I guess this is a question of the interpretation of the "mutation" rate: we aren't parameterising it as a mutation but potentially as an "error", which might change the angle we come at it.

Regardless, I can't see how it makes sense to allow negative emission probabilities in their formulations (e.g. for mu=1, A>2). I feel I should ask Gil about this anyway.

hyanwong · 2020-12-09T12:13:00Z

Easiest it to think about the extreme case: take a mutation (or mismatch) probability of mu=1. Do we mean that this is a probability of 1 that we mutate to another state, or a probability of 1 that we are hit by a mutagenic (or error-prone) process? The parameterisation in Rosen and Paten (for example) assumes the first: 1-(A-1)mu, so that for A=2, if mu is 1, we cannot match any more, and are forced to switch (emission probability is 0 => this path becomes impossible). It's a bit like whether or not you allow mutation to the same state in tskit, I suppose.

I would argue that if we are thinking of this in terms of error, the maximum mismatch probability should correspond to the case where the current copying path provides no information about the emitted state. That's a different parameterisation of the emission probs.

This feels like something to chat about face-to-face, TBH.

Edit - we could, of course, change this by simply changing the meaning we attach to mu, and have out "mismatch probability" max out (where there are 2 alleles) at 0.5 (maximally uninformative), which would (coincidentally) match how the recombination probabilities are calculated from the recombination rate.

hyanwong · 2020-12-09T12:31:38Z

Actually, reading though the papers, Equation 4.2 in ~~Stephens and Donelly~~ Donnelly & Leslie uses my initial suggestion, not that in Rosen and Paten, I think (replacing the mu with theta)?

As they say: "Notice that as θ → ∞ the alleles 0 and 1 at any given site become equally likely", which is definitely not true for the Rosen and Paten formulation. Personally, I think we could do even better by using the allele frequencies, as I commented above, but I think that's probably an extension of the idea that needs to be tested separately.

jeromekelleher · 2020-12-09T14:35:17Z

I spent a long time working on this in the context of the general purpose code for tskit, and I'm happy with the independent-of-n parameterisation of the emission probabilities. The clearest implementation is here. There's plenty of precedent for doing it this way, and S&D do say that the mutation values don't matter much. We can always add an option to change the behaviour later if we wish.

Unless there's an obvious actual error in what we're doing we shouldn't change things now - we just need to state that this is what we're doing.

hyanwong · 2020-12-09T14:51:37Z

I think there is (essentially) an error for num_alleles > 2, but perhaps we don't care about that. As you say, n probably doesn't matter too much as we get to large numbers of haplotypes, but I think that there's a missing factor of num_alleles (i.e. 2) in the current parameterisation: it should be something like (1 - (num_alleles - 1) * mu) / num_alleles and mu/num_alleles. Otherwise we get nonsensical values for mu=1. The other option is to scale mu so that the maximum "mismatch_probability" is 0.5, rather than 1, but that seems a hack?

Edit - another way to look at it: currently if we set the mismatch probability to 1, we force a recombination to occur, whatever the recombination rate. That seems.... wrong.

hyanwong · 2020-12-09T14:54:35Z

I mean. basically, we should never allow p_e or p_t to be less than 0 or greater than 1, otherwise we break the algorithm. At the moment, especially because we (accidentally?) count missing data as an extra allele (see #406) , we can make p_e > 1 by inputting otherwise reasonable sounding mismatch probabilities (e.g. 0.8)

jeromekelleher · 2020-12-09T15:30:11Z

You're right, the R&P parameterisation that I worked from is wrong for k > 2. Goddamit! Well spotted, thanks.

Hmm, this is a mess. Perhaps it's not so bad though: we can just disallow sites with > 2 alleles for inference. We don't actually count missing data as an allele, AFAIK, so this should be OK. In practise, we don't use > biallelic sites for inference, anyway, right?

This is as a short-term fix to get 0.2 out the door.

hyanwong · 2020-12-09T15:42:31Z

You're right, the R&P parameterisation that I worked from is wrong for k > 2. Goddamit! Well spotted, thanks.

That's OK. I feel like I've been bugging you too much about this, so happy it's not been a waste of time for you or me.

Hmm, this is a mess. Perhaps it's not so bad though: we can just disallow sites with > 2 alleles for inference.

Yes, I think we can do this for the time being.

We don't actually count missing data as an allele, AFAIK, so this should be OK.

Hmm. I'm worried that we might. I'll check in a little bit, just to satisfy me, and if you're right, I'll update the docs.

In practise, we don't use > biallelic sites for inference, anyway, right?

Correct. So I think we're OK in that respect.

This is as a short-term fix to get 0.2 out the door.

Yes, I agree. But I do think that we need to decide whether a mismatch probability of 1 means that the value being emitted is always the exact opposite of that suggested by the hidden state, or if it means that the hidden state is (basically) uninformative about what allele is emitted. The first seems a bit weird to me, and I'm strongly inclined to go for the latter, which means (I think) dividing by a factor of 2 (well num_alleles, but we're going to constraining that to 2 anyway).

jeromekelleher · 2020-12-09T16:01:07Z

I think it's simplest to keep things parameterised as mu meaning "the probability of mutating to the other allele". If mu is 0 then we must have a match and if mu is 1 we must have a mismatch. Lunter also takes this view: "and pμ is the probability of a mutation to one of the three other nucleotides." - he hard-codes into the model the 4 nucleotide perspective.

hyanwong · 2020-12-09T16:15:47Z

I think it's simplest to keep things parameterised as mu meaning "the probability of mutating to the other allele". If mu is 0 then we must have a match and if mu is 1 we must have a mismatch.

Hmm, in that case we need to document it really carefully, I think, because I (and I suspect others) would assume that a "mismatch" probability of 1 means that the emitted state is random, not that it cannot match the haplotype.

If we do go down this route, I would rescale the mismatch ratio so that it could only ever produce "mismatch probabilities" from 0 to 1/num_alleles. It seems nonsensical that if we tweak the mismatch_ratio parameter up to high values, it will force everything to deliberately mismatch.

hyanwong · 2020-12-09T16:22:14Z

By the way, that's not what the parameterisation of the recombination rate does, so we're being a little inconsistent here, not that anyone but ourselves would notice. A recombination probability of 1 means there's a chance of recombining onto any haplotype, including oneself (that's the 1/n bit). A mismatch/mutation probability of 1 means there's a chance of mutation to any other allele excluding oneself.

hyanwong · 2020-12-09T16:29:06Z

We should also write this all down somewhere, for ourselves if nothing else, as we'll probably want to adjust for >2 alleles later, etc.

hyanwong · 2020-12-09T20:29:26Z

We don't actually count missing data as an allele, AFAIK, so this should be OK.

As I suspected, we actually do. Fixed in #413 , but it defaults to counting them, except when passed to the tree seq builder. Perhaps we never want to count them (i.e. default to count_missing=False)

hyanwong · 2020-12-11T12:57:03Z

Closed & documented by #404

hyanwong mentioned this issue Dec 8, 2020

Correct recombination probabilities #404

Merged

hyanwong mentioned this issue Dec 9, 2020

Use Donnelly & Leslie parameterization of LS emission probs #411

Closed

hyanwong mentioned this issue Dec 9, 2020

Do not include missing allele in tree seq builder num_alleles #413

Merged

hyanwong mentioned this issue Dec 9, 2020

Cannot match if mismatch probabilities are 1 (assertion fails in PY_ENGINE, path not found in C_ENGINE) #402

Closed

jeromekelleher mentioned this issue Dec 10, 2020

Emission probabilities in HMM wrong for > 2 alleles #415

Closed

hyanwong closed this as completed Dec 11, 2020

hyanwong mentioned this issue Oct 18, 2022

Fix HMM for > 2 alleles #437

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are the HMM emission probabilities parameterised correctly? #403

Are the HMM emission probabilities parameterised correctly? #403

hyanwong commented Dec 8, 2020

hyanwong commented Dec 8, 2020

jeromekelleher commented Dec 9, 2020

hyanwong commented Dec 9, 2020

hyanwong commented Dec 9, 2020 •

edited

Loading

hyanwong commented Dec 9, 2020 •

edited

Loading

jeromekelleher commented Dec 9, 2020

hyanwong commented Dec 9, 2020 •

edited

Loading

hyanwong commented Dec 9, 2020 •

edited

Loading

jeromekelleher commented Dec 9, 2020

hyanwong commented Dec 9, 2020 •

edited

Loading

jeromekelleher commented Dec 9, 2020

hyanwong commented Dec 9, 2020 •

edited

Loading

hyanwong commented Dec 9, 2020 •

edited

Loading

hyanwong commented Dec 9, 2020

hyanwong commented Dec 9, 2020

hyanwong commented Dec 11, 2020

Are the HMM emission probabilities parameterised correctly? #403

Are the HMM emission probabilities parameterised correctly? #403

Comments

hyanwong commented Dec 8, 2020

hyanwong commented Dec 8, 2020

jeromekelleher commented Dec 9, 2020

hyanwong commented Dec 9, 2020

hyanwong commented Dec 9, 2020 • edited Loading

hyanwong commented Dec 9, 2020 • edited Loading

jeromekelleher commented Dec 9, 2020

hyanwong commented Dec 9, 2020 • edited Loading

hyanwong commented Dec 9, 2020 • edited Loading

jeromekelleher commented Dec 9, 2020

hyanwong commented Dec 9, 2020 • edited Loading

jeromekelleher commented Dec 9, 2020

hyanwong commented Dec 9, 2020 • edited Loading

hyanwong commented Dec 9, 2020 • edited Loading

hyanwong commented Dec 9, 2020

hyanwong commented Dec 9, 2020

hyanwong commented Dec 11, 2020

hyanwong commented Dec 9, 2020 •

edited

Loading

hyanwong commented Dec 9, 2020 •

edited

Loading

hyanwong commented Dec 9, 2020 •

edited

Loading

hyanwong commented Dec 9, 2020 •

edited

Loading

hyanwong commented Dec 9, 2020 •

edited

Loading

hyanwong commented Dec 9, 2020 •

edited

Loading

hyanwong commented Dec 9, 2020 •

edited

Loading