Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct recombination probabilities #404

Merged

Conversation

hyanwong
Copy link
Member

@hyanwong hyanwong commented Dec 8, 2020

This is a template for the sort of thing I suggest we might use to fix #398 . It still suffers from pathological probabilities with extreme values of e.g. mismatch_ratio, though. Some of this might need to be discussed in conjunction with #403

@codecov
Copy link

codecov bot commented Dec 8, 2020

Codecov Report

Merging #404 (af69e48) into main (d74f0e1) will increase coverage by 0.01%.
The diff coverage is 88.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #404      +/-   ##
==========================================
+ Coverage   92.78%   92.80%   +0.01%     
==========================================
  Files          17       17              
  Lines        4948     4961      +13     
  Branches      909      913       +4     
==========================================
+ Hits         4591     4604      +13     
+ Misses        232      231       -1     
- Partials      125      126       +1     
Flag Coverage Δ
C 92.80% <88.88%> (+0.01%) ⬆️
python 95.93% <88.88%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tsinfer/algorithm.py 98.29% <40.00%> (+<0.01%) ⬆️
tsinfer/inference.py 98.61% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d74f0e1...af69e48. Read the comment docs.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 9, 2020

I have now updated this to bound both the recombination probabilities and mismatch probabilities at 0.5, as discussed in #403 (comment). For our own reference, I have written a reasonably extensive discussion of the parameterisations & meaning of the arrays in the docstrings of the Matcher class, which are not exposed publicly. I guess that's OK @jeromekelleher ?

This should fix #398

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Needs some tests and minor clarifications, but I think it's basically there.

docs/inference.rst Show resolved Hide resolved
``recombination`` probabilities measure the probability of a recombination event
between adjacent inference sites, used to calculate the HMM transition probabilities
in the L&S-like matching algorithm. When matching a haplotype against the ancestor
in the immediately previous generation, ``recombination`` probabilities should reach
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"generation" is a bit misleading here - we don't have any concept of time and I think this will confuse people. A recombination proba of > 0.5 is always a weird thing, for example when dealing with LD calculations. This is a pretty standard issue when thinking about these extreme values for probability values in mutation and recombination processes. I think we can just say something like "Note that values > 0.5 for the recombination and mutation parameters will likely lead to pathological behaviour - for example, a mismatch probability of 1 means that a mismatch is required at every site.".

But nailing these parameters down as probabilities is very valuable and a great clarification!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I suspect they are actually likelihoods (because they are conditional on a hypothesis, that this is the actual hidden state), but since everyone calls them probabilities, I guess we might as well stick with that.

tsinfer/inference.py Outdated Show resolved Hide resolved
tsinfer/inference.py Show resolved Hide resolved
@hyanwong
Copy link
Member Author

Looks great! Needs some tests and minor clarifications, but I think it's basically there.

Great. Do we want an assert (perhaps only in the python version of the algorithm) that we have only 2 alleles, so that we don't forget we need to change the emission probs for > 2?

@jeromekelleher
Copy link
Member

I'll follow up with something to take care of that (#415)

@hyanwong hyanwong force-pushed the correct-recombination-probabilities branch from e6666a3 to 9d276d6 Compare December 11, 2020 00:11
@hyanwong hyanwong marked this pull request as ready for review December 11, 2020 11:24
@hyanwong hyanwong force-pushed the correct-recombination-probabilities branch from 9d276d6 to af69e48 Compare December 11, 2020 11:47
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mergify mergify bot merged commit 8bba9ca into tskit-dev:main Dec 11, 2020
@hyanwong hyanwong deleted the correct-recombination-probabilities branch December 11, 2020 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert cM to probability of recombination
2 participants