-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct recombination probabilities #404
Correct recombination probabilities #404
Conversation
Codecov Report
@@ Coverage Diff @@
## main #404 +/- ##
==========================================
+ Coverage 92.78% 92.80% +0.01%
==========================================
Files 17 17
Lines 4948 4961 +13
Branches 909 913 +4
==========================================
+ Hits 4591 4604 +13
+ Misses 232 231 -1
- Partials 125 126 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
f089107
to
e6666a3
Compare
I have now updated this to bound both the recombination probabilities and mismatch probabilities at 0.5, as discussed in #403 (comment). For our own reference, I have written a reasonably extensive discussion of the parameterisations & meaning of the arrays in the docstrings of the This should fix #398 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Needs some tests and minor clarifications, but I think it's basically there.
tsinfer/inference.py
Outdated
``recombination`` probabilities measure the probability of a recombination event | ||
between adjacent inference sites, used to calculate the HMM transition probabilities | ||
in the L&S-like matching algorithm. When matching a haplotype against the ancestor | ||
in the immediately previous generation, ``recombination`` probabilities should reach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"generation" is a bit misleading here - we don't have any concept of time and I think this will confuse people. A recombination proba of > 0.5 is always a weird thing, for example when dealing with LD calculations. This is a pretty standard issue when thinking about these extreme values for probability values in mutation and recombination processes. I think we can just say something like "Note that values > 0.5 for the recombination and mutation parameters will likely lead to pathological behaviour - for example, a mismatch probability of 1 means that a mismatch is required at every site.".
But nailing these parameters down as probabilities is very valuable and a great clarification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I suspect they are actually likelihoods (because they are conditional on a hypothesis, that this is the actual hidden state), but since everyone calls them probabilities, I guess we might as well stick with that.
Great. Do we want an assert (perhaps only in the python version of the algorithm) that we have only 2 alleles, so that we don't forget we need to change the emission probs for > 2? |
I'll follow up with something to take care of that (#415) |
e6666a3
to
9d276d6
Compare
9d276d6
to
af69e48
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This is a template for the sort of thing I suggest we might use to fix #398 . It still suffers from pathological probabilities with extreme values of e.g. mismatch_ratio, though. Some of this might need to be discussed in conjunction with #403