More flexible definition of a tree root #462

jeromekelleher · 2020-02-07T17:51:12Z

The roots of a tree are currently defined as the set of unique path ends starting from samples. This is unhelpful in some cases with lots of missing data and we want to get the "real" roots, which actually subtend some topology.

This makes the concept of a root more flexible, and changes it to "the set of unique path ends starting from samples, that subtend at least k samples".

Still a WIP, I'll document more when the plumbing is in place.

codecov · 2020-02-07T18:10:36Z

Codecov Report

Merging #462 into master will decrease coverage by 0.04%.
The diff coverage is 94.46%.

@@            Coverage Diff             @@
##           master     #462      +/-   ##
==========================================
- Coverage   87.32%   87.28%   -0.05%     
==========================================
  Files          21       21              
  Lines       15423    15452      +29     
  Branches     2998     3004       +6     
==========================================
+ Hits        13468    13487      +19     
- Misses        969      974       +5     
- Partials      986      991       +5

Flag	Coverage Δ
#c_tests	`88.5% <94.46%> (-0.05%)`	⬇️
#python_c_tests	`90.39% <95.65%> (-0.06%)`	⬇️
#python_tests	`99.18% <100%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
c/tskit/haplotype_matching.c	`90.24% <100%> (ø)`	⬆️
c/tskit/stats.c	`71.48% <100%> (ø)`	⬆️
python/tskit/trees.py	`98.6% <100%> (-0.17%)`	⬇️
c/tskit/trees.c	`90.93% <94.29%> (-0.15%)`	⬇️
python/_tskitmodule.c	`83.85% <94.44%> (-0.02%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 143a348...b4b81ea. Read the comment docs.

petrelharp · 2020-02-09T17:42:11Z

It makes sense to be able to get roots without also getting the missing data. Here's my suggestion for the API. I think the way to deal with this is to add an argument, remove_missing, defaulting to False, that if True will omit the missing data from the calculation. Then this argument could be added to the (possibly many) other methods needing such an option. Under the hood, it could work by this "at least k samples" thing. But it'd be nice to have a unified interface for dealing with missingness.

Why default to False? It's a general principle that people should consciously make the choice to remove missing data, and not have it be hidden. Otherwise, it's too easy to miss possible problems with the data. This is by analogy to R, in which for instance if you want to take the mean of a vector with NAs in it, you do mean(x, na.rm=TRUE).

jeromekelleher · 2020-02-10T13:48:06Z

Good call on the general missing data API @petrelharp, that's an excellent idea. I'll open an issue to track the idea once this PR has been sorted out.

jeromekelleher · 2020-02-11T15:32:39Z

I could use some input here please @petrelharp.

As the new root tracking code requires sample counts, there doesn't seem like much point in maintaining the old TSK_SAMPLE_COUNTS option for trees. So, I added a new flag TSK_NO_SAMPLE_COUNTS which means that root tracking (and sample counts) is on by default, but if TSK_NO_SAMPLE_COUNTS is set, then we don't count samples and also don't track roots. So, left_root will always be -1. This seems fine, as there are times when you just want to do upward traversals, and the root tracking/sample counting code does cost something. If C API users knowingly turn it off, then I think that's OK. We won't break anyone's C code with this, as roots were always tracked before and now a new option is required to turn it off.

There's a problem with the Python API though. Currently it's set to that when sample_counts=False, we won't track roots and therefore the nodes iterator will be empty for any trees in the tree sequence. This is a (probably silent) breaking change, because if someone has specified sample_counts=False in their code to try and make it faster and they're doing tree traversals, then suddenly they won't be visiting any nodes.

I doubt there's many (if any) people using sample_counts=False, but still this seems like an evil way to break someone's code. I can think of a few options:

Always count samples/track roots and make sample_counts option a no-op in Python. (This really wouldn't affect anyone I think: if you're doing traversals in Python, you're not working with big trees and therefore there'll be no perf difference.)
Add a new keyword track_roots=True, and only turn off sample counting if this is False (again, ignoring the sample_counts option)
Keep the semantics I have now, bump the version number to 0.3.0, note this as a breaking change in the changelog and put in various warnings in the docs.

I'm leaning towards (1) I think. The low-level tree options really only matter when you're doing stuff in C, where we want this kind of control. It makes no difference when working with the trees in Python.

petrelharp · 2020-02-11T17:22:27Z

I'm on board with (1) as well. It's a bit confusing already figuring out which options to turn on or off to get the info you want with trees, and this would make it simpler, which is good. (I assume we'd deprecate the sample_counts option as well?)

jeromekelleher · 2020-02-11T18:07:58Z

I'm on board with (1) as well. It's a bit confusing already figuring out which options to turn on or off to get the info you want with trees, and this would make it simpler, which is good. (I assume we'd deprecate the sample_counts option as well?)

OK, good. Yes, we'd deprecate it (probably throw a warning, I guess).

Also refactored some high-level tests.

After deprecating SAMPLE_COUNTS.

jeromekelleher · 2020-02-12T18:33:41Z

OK, merging. sample_counts is now a deprecated no-op in the Python API.

hyanwong · 2020-02-14T10:41:34Z

@jeromekelleher is there a way in python to set a default root_threshold for a whole ts, not just a single tree? I.e. when set, any tree returned from that ts would have the default root_threshold set. It would be very useful to me, and I can't see an option at the moment...

jeromekelleher · 2020-02-17T09:30:09Z

@jeromekelleher is there a way in python to set a default root_threshold for a whole ts, not just a single tree? I.e. when set, any tree returned from that ts would have the default root_threshold set. It would be very useful to me, and I can't see an option at the moment...

I don't think we want to set this as a property of the ts, it seems like a tricky bit of state to be adding that could easily catch people out. I didn't add a root_threshold parameter to the trees iterator because this seemed like quite a niche application and we should use it for a bit before putting it into the main interface.

hyanwong · 2020-02-17T11:04:30Z

OK, thanks. Now I think about it, you are 100% right about adding state to the TS. But an addition to the trees() iterator would be useful, eventually.

jeromekelleher added 5 commits February 12, 2020 15:21

Move tsk_tree_kc_distance to correct location.

67cd714

Basic algorithm for defined root threshold.

e4eb253

Add initial support for root threshold in C.

2158b58

Basic Python interface for root threshold.

2ad7975

Also refactored some high-level tests.

Changed to TSK_NO_SAMPLE_COUNTS

86d27ba

jeromekelleher force-pushed the root-threshold branch from 5736b64 to d0a406c Compare February 12, 2020 15:21

Deprecated sample_counts.

d3a76ad

jeromekelleher force-pushed the root-threshold branch from d0a406c to dab6d08 Compare February 12, 2020 17:06

Bumped the C API version number.

b4b81ea

After deprecating SAMPLE_COUNTS.

jeromekelleher force-pushed the root-threshold branch from dab6d08 to b4b81ea Compare February 12, 2020 17:30

jeromekelleher merged commit 693c047 into tskit-dev:master Feb 12, 2020

jeromekelleher deleted the root-threshold branch February 12, 2020 18:33

hyanwong mentioned this pull request Feb 14, 2020

Add is_isolated and document use #443

Merged

hyanwong mentioned this pull request Feb 15, 2020

Simplify root iterations & ts checking using new root_threshold tskit-dev/tsdate#96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More flexible definition of a tree root #462

More flexible definition of a tree root #462

Uh oh!

jeromekelleher commented Feb 7, 2020

Uh oh!

codecov bot commented Feb 7, 2020 •

edited

Loading

Uh oh!

petrelharp commented Feb 9, 2020

Uh oh!

jeromekelleher commented Feb 10, 2020

Uh oh!

jeromekelleher commented Feb 11, 2020

Uh oh!

petrelharp commented Feb 11, 2020

Uh oh!

jeromekelleher commented Feb 11, 2020

Uh oh!

jeromekelleher commented Feb 12, 2020

Uh oh!

hyanwong commented Feb 14, 2020

Uh oh!

jeromekelleher commented Feb 17, 2020

Uh oh!

hyanwong commented Feb 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

More flexible definition of a tree root #462

More flexible definition of a tree root #462

Uh oh!

Conversation

jeromekelleher commented Feb 7, 2020

Uh oh!

codecov bot commented Feb 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

petrelharp commented Feb 9, 2020

Uh oh!

jeromekelleher commented Feb 10, 2020

Uh oh!

jeromekelleher commented Feb 11, 2020

Uh oh!

petrelharp commented Feb 11, 2020

Uh oh!

jeromekelleher commented Feb 11, 2020

Uh oh!

jeromekelleher commented Feb 12, 2020

Uh oh!

hyanwong commented Feb 14, 2020

Uh oh!

jeromekelleher commented Feb 17, 2020

Uh oh!

hyanwong commented Feb 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Feb 7, 2020 •

edited

Loading