Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial extend #675

Merged
merged 6 commits into from
Oct 24, 2022
Merged

Conversation

jeromekelleher
Copy link
Member

Early WIP, not ready for comments.

@codecov
Copy link

codecov bot commented Jun 16, 2022

Codecov Report

Merging #675 (dcf7939) into main (1daa0d0) will increase coverage by 0.10%.
The diff coverage is 97.52%.

@@            Coverage Diff             @@
##             main     #675      +/-   ##
==========================================
+ Coverage   93.40%   93.50%   +0.10%     
==========================================
  Files          17       17              
  Lines        5354     5469     +115     
  Branches      984     1002      +18     
==========================================
+ Hits         5001     5114     +113     
- Misses        233      234       +1     
- Partials      120      121       +1     
Flag Coverage Δ
C 93.50% <97.52%> (+0.10%) ⬆️
python 96.63% <97.52%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tsinfer/inference.py 98.71% <97.50%> (-0.06%) ⬇️
tsinfer/constants.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jeromekelleher
Copy link
Member Author

Adding this to the 0.3.0 milestone as I'd like to get it in as undocumented functionality to enable some stuff downstream.

@jeromekelleher jeromekelleher marked this pull request as ready for review October 13, 2022 08:47
@jeromekelleher
Copy link
Member Author

I think this is ready to go in, so good to get some eyes on it. The basic idea is that this provides the basic infrastructure needed to do some inferences I'm working on, and which seem to be working reasonably well now. Most of the code is for working around awkward annoying problems which will hopefully go away in the 0.4.0 version of tsinfer with the more flexible API.

So, basically I'd like to get this in as an undocumented API to enable some upstream code, but aim to remove it for 0.4.0 when the new APIs will make it unnecessary.

@hyanwong - would you mind taking a quick look through please?

Copy link
Member

@hyanwong hyanwong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite following the general gist of this PR (it looks like you are adding the "exemplar" sample nodes as ancestors into the and_ts, right?), but I'm happy to take it on trust, especially if it is a temporary fix for something that can be done properly later. Perhaps you can talk me through the general idea in person, @jeromekelleher ?

tsinfer/inference.py Show resolved Hide resolved
tsinfer/inference.py Outdated Show resolved Hide resolved
@@ -2179,3 +2298,93 @@ def minimise(ts):
filter_individuals=False,
filter_populations=False,
)


def solve_num_mismatches(ts, k):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite following here, but I think that calculating the low-level recombination probability depends on the distance between the two sites, and I can't see that being used here. That might be OK for your usage, however.

@@ -37,6 +37,9 @@
# file
NODE_IS_HISTORICAL_SAMPLE = 1 << 20

# Undocumented.
NODE_IS_IDENTICAL_SAMPLE_ANCESTOR = 1 << 21
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we call these "proxy ancestors" in other code (i.e. an ancestor identical to a sample inserted a little earlier in time, to help matching)

raise ValueError(
f"Mismatched time_units: {time_units} != ancestors_ts.time_units",
)
# Add in the existing haplotypes. Note - this will probably
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At scale, we probably don't want to match against the sample haplotypes in the next round of extend, if there is a simple "exemplar" we can match against instead. But there's no reason why we can't have these in the actual TS. Is there an argument here for a flag "NODE_HAS_OLDER_PROXY", which would simply cause the matching algorithm to skip matching against those nodes, since we are guaranteed that we can match against an older haplotype anyway?

Copy link
Member

@hyanwong hyanwong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all LGTM, and given that's it's not documented, I think we should merge and released tsinfer 0.3. We can always improve iteratively, like the NODE_HAS_OLDER_PROXY flag I suggested above.

My main reservation is actually the naming. I was originally confused by the term "extend", which could be tend to mean extending the ts along the genome. I'm not sure it's immediately obvious what "extending an existing inferred TS" means. Can we think of alternative terminology, perhaps? This needn't stop merging, however.

@jeromekelleher
Copy link
Member Author

Great, thanks @hyanwong. I've updated the documentation a bit.

I agree on the terminology, it's not great. It's not worth changing now though, as I think it'll be subsumed in to the more general "match" operation that's we're hoping to use eventually.

Also agreed on the exemplars thing, we'll want to figure out a better way of doing that in general.

@jeromekelleher jeromekelleher merged commit 910e0eb into tskit-dev:main Oct 24, 2022
@jeromekelleher jeromekelleher deleted the initial-extend branch October 24, 2022 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants