Initial extend #675

jeromekelleher · 2022-06-10T14:42:43Z

Early WIP, not ready for comments.

codecov · 2022-06-16T14:05:19Z

Codecov Report

Merging #675 (dcf7939) into main (1daa0d0) will increase coverage by 0.10%.
The diff coverage is 97.52%.

@@            Coverage Diff             @@
##             main     #675      +/-   ##
==========================================
+ Coverage   93.40%   93.50%   +0.10%     
==========================================
  Files          17       17              
  Lines        5354     5469     +115     
  Branches      984     1002      +18     
==========================================
+ Hits         5001     5114     +113     
- Misses        233      234       +1     
- Partials      120      121       +1

Flag	Coverage Δ
C	`93.50% <97.52%> (+0.10%)`	⬆️
python	`96.63% <97.52%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
tsinfer/inference.py	`98.71% <97.50%> (-0.06%)`	⬇️
tsinfer/constants.py	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

jeromekelleher · 2022-10-12T18:40:14Z

Adding this to the 0.3.0 milestone as I'd like to get it in as undocumented functionality to enable some stuff downstream.

jeromekelleher · 2022-10-13T08:52:47Z

I think this is ready to go in, so good to get some eyes on it. The basic idea is that this provides the basic infrastructure needed to do some inferences I'm working on, and which seem to be working reasonably well now. Most of the code is for working around awkward annoying problems which will hopefully go away in the 0.4.0 version of tsinfer with the more flexible API.

So, basically I'd like to get this in as an undocumented API to enable some upstream code, but aim to remove it for 0.4.0 when the new APIs will make it unnecessary.

@hyanwong - would you mind taking a quick look through please?

hyanwong

I'm not quite following the general gist of this PR (it looks like you are adding the "exemplar" sample nodes as ancestors into the and_ts, right?), but I'm happy to take it on trust, especially if it is a temporary fix for something that can be done properly later. Perhaps you can talk me through the general idea in person, @jeromekelleher ?

tsinfer/inference.py

hyanwong · 2022-10-17T09:22:30Z

tsinfer/inference.py

@@ -2179,3 +2298,93 @@ def minimise(ts):
        filter_individuals=False,
        filter_populations=False,
    )
+
+
+def solve_num_mismatches(ts, k):


I'm not quite following here, but I think that calculating the low-level recombination probability depends on the distance between the two sites, and I can't see that being used here. That might be OK for your usage, however.

hyanwong · 2022-10-24T08:48:44Z

tsinfer/constants.py

@@ -37,6 +37,9 @@
 # file
 NODE_IS_HISTORICAL_SAMPLE = 1 << 20

+# Undocumented.
+NODE_IS_IDENTICAL_SAMPLE_ANCESTOR = 1 << 21


I think we call these "proxy ancestors" in other code (i.e. an ancestor identical to a sample inserted a little earlier in time, to help matching)

hyanwong · 2022-10-24T09:35:28Z

tsinfer/inference.py

+                raise ValueError(
+                    f"Mismatched time_units: {time_units} != ancestors_ts.time_units",
+                )
+            # Add in the existing haplotypes. Note - this will probably


At scale, we probably don't want to match against the sample haplotypes in the next round of extend, if there is a simple "exemplar" we can match against instead. But there's no reason why we can't have these in the actual TS. Is there an argument here for a flag "NODE_HAS_OLDER_PROXY", which would simply cause the matching algorithm to skip matching against those nodes, since we are guaranteed that we can match against an older haplotype anyway?

hyanwong

This all LGTM, and given that's it's not documented, I think we should merge and released tsinfer 0.3. We can always improve iteratively, like the NODE_HAS_OLDER_PROXY flag I suggested above.

My main reservation is actually the naming. I was originally confused by the term "extend", which could be tend to mean extending the ts along the genome. I'm not sure it's immediately obvious what "extending an existing inferred TS" means. Can we think of alternative terminology, perhaps? This needn't stop merging, however.

In the easy case where the set of sites is the same

jeromekelleher · 2022-10-24T09:45:47Z

Great, thanks @hyanwong. I've updated the documentation a bit.

I agree on the terminology, it's not great. It's not worth changing now though, as I think it'll be subsumed in to the more general "match" operation that's we're hoping to use eventually.

Also agreed on the exemplars thing, we'll want to figure out a better way of doing that in general.

jeromekelleher force-pushed the initial-extend branch from 6fe7d97 to 3cc5157 Compare June 15, 2022 14:19

jeromekelleher force-pushed the initial-extend branch 3 times, most recently from 9fb1774 to b299d66 Compare June 17, 2022 15:11

jeromekelleher force-pushed the initial-extend branch from b299d66 to c86bef9 Compare June 27, 2022 14:26

jeromekelleher force-pushed the initial-extend branch from 170af0a to 61af151 Compare September 30, 2022 15:43

jeromekelleher mentioned this pull request Oct 6, 2022

Add an "ancestral_state_index" parameter to SampleData.add_sites()? #686

Closed

jeromekelleher added this to the Release 0.3.0 milestone Oct 12, 2022

jeromekelleher force-pushed the initial-extend branch from a9a81d0 to 25bb6e6 Compare October 13, 2022 08:47

jeromekelleher marked this pull request as ready for review October 13, 2022 08:47

jeromekelleher mentioned this pull request Oct 17, 2022

Fix HMM for > 2 alleles #437

Open

hyanwong approved these changes Oct 17, 2022

View reviewed changes

hyanwong reviewed Oct 24, 2022

View reviewed changes

hyanwong approved these changes Oct 24, 2022

View reviewed changes

jeromekelleher added 6 commits October 24, 2022 10:40

Experimental undocumented "extend" operation

62f7229

Try out new parameterisation from Duncan

3c7d229

Add ability to extend from given starting point

255a75e

In the easy case where the set of sites is the same

Add support for custom time-increment

c3c9cb3

Add support for time_units

0aa374d

Let downstream code handle provenance

dcf7939

jeromekelleher force-pushed the initial-extend branch from 25bb6e6 to dcf7939 Compare October 24, 2022 09:44

jeromekelleher merged commit 910e0eb into tskit-dev:main Oct 24, 2022

jeromekelleher deleted the initial-extend branch October 24, 2022 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial extend #675

Initial extend #675

jeromekelleher commented Jun 10, 2022

codecov bot commented Jun 16, 2022 •

edited

Loading

jeromekelleher commented Oct 12, 2022

jeromekelleher commented Oct 13, 2022

hyanwong left a comment

hyanwong Oct 17, 2022

hyanwong Oct 24, 2022

hyanwong Oct 24, 2022

hyanwong left a comment

jeromekelleher commented Oct 24, 2022

Initial extend #675

Initial extend #675

Conversation

jeromekelleher commented Jun 10, 2022

codecov bot commented Jun 16, 2022 • edited Loading

Codecov Report

jeromekelleher commented Oct 12, 2022

jeromekelleher commented Oct 13, 2022

hyanwong left a comment

Choose a reason for hiding this comment

hyanwong Oct 17, 2022

Choose a reason for hiding this comment

hyanwong Oct 24, 2022

Choose a reason for hiding this comment

hyanwong Oct 24, 2022

Choose a reason for hiding this comment

hyanwong left a comment

Choose a reason for hiding this comment

jeromekelleher commented Oct 24, 2022

codecov bot commented Jun 16, 2022 •

edited

Loading