Fix HMM for > 2 alleles #437

jeromekelleher · 2020-12-18T15:18:45Z

The emission probas in the HMM are not computed correctly when we have > 2 alleles. We currently raise a ValueError if the user tries to force inference at such sites.

See #415 and the linked issues for background.

jeromekelleher · 2022-10-17T09:07:37Z

Need this for #675

Check for tests in referring to #415 and enable.

hyanwong · 2022-10-18T16:42:21Z

I see that I raised this issue originally in #403. I'm happy to revisit it for this PR.

hyanwong · 2022-10-26T19:48:25Z

A summary is helpful here, I think. Internally we use recombination probabilities, calculated from a recombination rate combined with a the distance between sites (e.g. here). This calculated recombination probability is passed to the internal algorithm (as rho) and then used to calculate the transition probabilities, p_recomb and p_no_recomb e.g. here:

tsinfer/tsinfer/algorithm.py

Line 736 in 1290344

p_no_recomb = p_last * (1 - rho + rho / n)

We also specify mismatch probabilities (mu, often equated with the mutation probability in L&S HMMs). In tsinfer these are calculated by combining the mismatch ratio with the median recombination probability. We convert the mismatch probability into an emission probability e via the Rosen & Paton parameterization (their eqn 4): e = 1-(A-1) * µ if the hidden state is the same as the emitted state, or e=µ if the hidden state is different from the emitted state, here:

tsinfer/tsinfer/algorithm.py

Line 745 in 1290344

p_e = mu

For A=2 alleles, the emission probs are therefore 1-mu (if hidden and emitted states are the same) and mu (if different), This has the desirable feature that if mu=0, the probs are 1 and 0, or if mu=1 the probs are 0 and 1. We decided in #403 (comment) that we were happy with this definition of mismatch: i.e. if mu=1 it forces the hidden state to be different from the emitted state (this contrasts to the treatment in Donnelly & Leslie, eqn 4.2, where an infinite mutation rate results in the emitted state having an equal probability of being any of the allelic states).

The problem with this particular parameterisation is that with a high mismatch probability (mu=1) and more than 2 alleles, the emission probability becomes less than zero! So effectively, the Rosen & Paton parameterization requires mu to be less than 1/(A-1). Another way of thinking about this is that it treats the "mismatch probability" as the probability of mutation from one allele to a specific other allele, rather than to any other allele. This is the same in the Lunter parameterization, where A is hardcoded to 4, so that his mutation rate mu can never be set above 1/3.

I suggest that for L&S matching with >2 alleles we don't want to have this particular parameterization. I suggest 2 possibilities:

We could treat all non-matching alleles as identical for the purpose of matching (i.e. partition the alleles into match/non-match, rather than a set of allelic states). That means we simply have 1-mu and mu as the emission probabilities, and don't bother incorporating the number of alleles at all.
We could switch to the Donnelly & Leslie parameterization, and have mu=1 meaning that the emission probabilities are all equal (e.g. 1/num_alleles) and mu=0 meaning that the emission probabilities are 1 and 0, so something like
1/num_alleles * mu and 1/num_alleles + (1-mu) * (num_alleles - 1)/num_alleles (I think this is the equivalent to Donnelly & Leslie's eqn 4.2 but with mu going from 0..1 rather than theta going from 0..infinity).

Note that @astheeggeggs has thought about this, for example at https://github.com/astheeggeggs/lshmm/blob/9b15d8417e2856e13b28d59b4c8d078068fd8c0b/lshmm/api.py#L142 so we should ask him to look at any reparameterisation we do here to check that it makes sense.

jeromekelleher added this to the Release 0.3.0 milestone Oct 17, 2022

jeromekelleher modified the milestones: Release 0.3.0, 0.3.1 Oct 19, 2022

hyanwong mentioned this issue Oct 25, 2022

Make dependence on 0/1 mutations in ancestors ts explicit or remove #107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix HMM for > 2 alleles #437

Fix HMM for > 2 alleles #437

jeromekelleher commented Dec 18, 2020

jeromekelleher commented Oct 17, 2022

hyanwong commented Oct 18, 2022

hyanwong commented Oct 26, 2022 •

edited

Fix HMM for > 2 alleles #437

Fix HMM for > 2 alleles #437

Comments

jeromekelleher commented Dec 18, 2020

jeromekelleher commented Oct 17, 2022

hyanwong commented Oct 18, 2022

hyanwong commented Oct 26, 2022 • edited

hyanwong commented Oct 26, 2022 •

edited