-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial extend #675
Initial extend #675
Conversation
6fe7d97
to
3cc5157
Compare
Codecov Report
@@ Coverage Diff @@
## main #675 +/- ##
==========================================
+ Coverage 93.40% 93.50% +0.10%
==========================================
Files 17 17
Lines 5354 5469 +115
Branches 984 1002 +18
==========================================
+ Hits 5001 5114 +113
- Misses 233 234 +1
- Partials 120 121 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
9fb1774
to
b299d66
Compare
b299d66
to
c86bef9
Compare
170af0a
to
61af151
Compare
Adding this to the 0.3.0 milestone as I'd like to get it in as undocumented functionality to enable some stuff downstream. |
a9a81d0
to
25bb6e6
Compare
I think this is ready to go in, so good to get some eyes on it. The basic idea is that this provides the basic infrastructure needed to do some inferences I'm working on, and which seem to be working reasonably well now. Most of the code is for working around awkward annoying problems which will hopefully go away in the 0.4.0 version of tsinfer with the more flexible API. So, basically I'd like to get this in as an undocumented API to enable some upstream code, but aim to remove it for 0.4.0 when the new APIs will make it unnecessary. @hyanwong - would you mind taking a quick look through please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite following the general gist of this PR (it looks like you are adding the "exemplar" sample nodes as ancestors into the and_ts, right?), but I'm happy to take it on trust, especially if it is a temporary fix for something that can be done properly later. Perhaps you can talk me through the general idea in person, @jeromekelleher ?
@@ -2179,3 +2298,93 @@ def minimise(ts): | |||
filter_individuals=False, | |||
filter_populations=False, | |||
) | |||
|
|||
|
|||
def solve_num_mismatches(ts, k): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite following here, but I think that calculating the low-level recombination probability depends on the distance between the two sites, and I can't see that being used here. That might be OK for your usage, however.
@@ -37,6 +37,9 @@ | |||
# file | |||
NODE_IS_HISTORICAL_SAMPLE = 1 << 20 | |||
|
|||
# Undocumented. | |||
NODE_IS_IDENTICAL_SAMPLE_ANCESTOR = 1 << 21 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we call these "proxy ancestors" in other code (i.e. an ancestor identical to a sample inserted a little earlier in time, to help matching)
raise ValueError( | ||
f"Mismatched time_units: {time_units} != ancestors_ts.time_units", | ||
) | ||
# Add in the existing haplotypes. Note - this will probably |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At scale, we probably don't want to match against the sample haplotypes in the next round of extend, if there is a simple "exemplar" we can match against instead. But there's no reason why we can't have these in the actual TS. Is there an argument here for a flag "NODE_HAS_OLDER_PROXY", which would simply cause the matching algorithm to skip matching against those nodes, since we are guaranteed that we can match against an older haplotype anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all LGTM, and given that's it's not documented, I think we should merge and released tsinfer 0.3. We can always improve iteratively, like the NODE_HAS_OLDER_PROXY
flag I suggested above.
My main reservation is actually the naming. I was originally confused by the term "extend", which could be tend to mean extending the ts along the genome. I'm not sure it's immediately obvious what "extending an existing inferred TS" means. Can we think of alternative terminology, perhaps? This needn't stop merging, however.
In the easy case where the set of sites is the same
25bb6e6
to
dcf7939
Compare
Great, thanks @hyanwong. I've updated the documentation a bit. I agree on the terminology, it's not great. It's not worth changing now though, as I think it'll be subsumed in to the more general "match" operation that's we're hoping to use eventually. Also agreed on the exemplars thing, we'll want to figure out a better way of doing that in general. |
Early WIP, not ready for comments.