Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HMM for > 2 alleles #437

Open
jeromekelleher opened this issue Dec 18, 2020 · 3 comments
Open

Fix HMM for > 2 alleles #437

jeromekelleher opened this issue Dec 18, 2020 · 3 comments
Milestone

Comments

@jeromekelleher
Copy link
Member

The emission probas in the HMM are not computed correctly when we have > 2 alleles. We currently raise a ValueError if the user tries to force inference at such sites.

See #415 and the linked issues for background.

@jeromekelleher jeromekelleher added this to the Release 0.3.0 milestone Oct 17, 2022
@jeromekelleher
Copy link
Member Author

Need this for #675

Check for tests in referring to #415 and enable.

@hyanwong
Copy link
Member

I see that I raised this issue originally in #403. I'm happy to revisit it for this PR.

@hyanwong
Copy link
Member

hyanwong commented Oct 26, 2022

A summary is helpful here, I think. Internally we use recombination probabilities, calculated from a recombination rate combined with a the distance between sites (e.g. here). This calculated recombination probability is passed to the internal algorithm (as rho) and then used to calculate the transition probabilities, p_recomb and p_no_recomb e.g. here:

p_no_recomb = p_last * (1 - rho + rho / n)

We also specify mismatch probabilities (mu, often equated with the mutation probability in L&S HMMs). In tsinfer these are calculated by combining the mismatch ratio with the median recombination probability. We convert the mismatch probability into an emission probability e via the Rosen & Paton parameterization (their eqn 4): e = 1-(A-1) * µ if the hidden state is the same as the emitted state, or e=µ if the hidden state is different from the emitted state, here:

p_e = mu

For A=2 alleles, the emission probs are therefore 1-mu (if hidden and emitted states are the same) and mu (if different), This has the desirable feature that if mu=0, the probs are 1 and 0, or if mu=1 the probs are 0 and 1. We decided in #403 (comment) that we were happy with this definition of mismatch: i.e. if mu=1 it forces the hidden state to be different from the emitted state (this contrasts to the treatment in Donnelly & Leslie, eqn 4.2, where an infinite mutation rate results in the emitted state having an equal probability of being any of the allelic states).

The problem with this particular parameterisation is that with a high mismatch probability (mu=1) and more than 2 alleles, the emission probability becomes less than zero! So effectively, the Rosen & Paton parameterization requires mu to be less than 1/(A-1). Another way of thinking about this is that it treats the "mismatch probability" as the probability of mutation from one allele to a specific other allele, rather than to any other allele. This is the same in the Lunter parameterization, where A is hardcoded to 4, so that his mutation rate mu can never be set above 1/3.

I suggest that for L&S matching with >2 alleles we don't want to have this particular parameterization. I suggest 2 possibilities:

  1. We could treat all non-matching alleles as identical for the purpose of matching (i.e. partition the alleles into match/non-match, rather than a set of allelic states). That means we simply have 1-mu and mu as the emission probabilities, and don't bother incorporating the number of alleles at all.
  2. We could switch to the Donnelly & Leslie parameterization, and have mu=1 meaning that the emission probabilities are all equal (e.g. 1/num_alleles) and mu=0 meaning that the emission probabilities are 1 and 0, so something like
    1/num_alleles * mu and 1/num_alleles + (1-mu) * (num_alleles - 1)/num_alleles (I think this is the equivalent to Donnelly & Leslie's eqn 4.2 but with mu going from 0..1 rather than theta going from 0..infinity).

Note that @astheeggeggs has thought about this, for example at https://github.com/astheeggeggs/lshmm/blob/9b15d8417e2856e13b28d59b4c8d078068fd8c0b/lshmm/api.py#L142 so we should ask him to look at any reparameterisation we do here to check that it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants