Skip to content

Conversation

@jeromekelleher
Copy link
Member

The roots of a tree are currently defined as the set of unique path ends starting from samples. This is unhelpful in some cases with lots of missing data and we want to get the "real" roots, which actually subtend some topology.

This makes the concept of a root more flexible, and changes it to "the set of unique path ends starting from samples, that subtend at least k samples".

Still a WIP, I'll document more when the plumbing is in place.

@codecov
Copy link

codecov bot commented Feb 7, 2020

Codecov Report

Merging #462 into master will decrease coverage by 0.04%.
The diff coverage is 94.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #462      +/-   ##
==========================================
- Coverage   87.32%   87.28%   -0.05%     
==========================================
  Files          21       21              
  Lines       15423    15452      +29     
  Branches     2998     3004       +6     
==========================================
+ Hits        13468    13487      +19     
- Misses        969      974       +5     
- Partials      986      991       +5
Flag Coverage Δ
#c_tests 88.5% <94.46%> (-0.05%) ⬇️
#python_c_tests 90.39% <95.65%> (-0.06%) ⬇️
#python_tests 99.18% <100%> (-0.07%) ⬇️
Impacted Files Coverage Δ
c/tskit/haplotype_matching.c 90.24% <100%> (ø) ⬆️
c/tskit/stats.c 71.48% <100%> (ø) ⬆️
python/tskit/trees.py 98.6% <100%> (-0.17%) ⬇️
c/tskit/trees.c 90.93% <94.29%> (-0.15%) ⬇️
python/_tskitmodule.c 83.85% <94.44%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 143a348...b4b81ea. Read the comment docs.

@petrelharp
Copy link
Contributor

It makes sense to be able to get roots without also getting the missing data. Here's my suggestion for the API. I think the way to deal with this is to add an argument, remove_missing, defaulting to False, that if True will omit the missing data from the calculation. Then this argument could be added to the (possibly many) other methods needing such an option. Under the hood, it could work by this "at least k samples" thing. But it'd be nice to have a unified interface for dealing with missingness.

Why default to False? It's a general principle that people should consciously make the choice to remove missing data, and not have it be hidden. Otherwise, it's too easy to miss possible problems with the data. This is by analogy to R, in which for instance if you want to take the mean of a vector with NAs in it, you do mean(x, na.rm=TRUE).

@jeromekelleher
Copy link
Member Author

Good call on the general missing data API @petrelharp, that's an excellent idea. I'll open an issue to track the idea once this PR has been sorted out.

@jeromekelleher
Copy link
Member Author

I could use some input here please @petrelharp.

As the new root tracking code requires sample counts, there doesn't seem like much point in maintaining the old TSK_SAMPLE_COUNTS option for trees. So, I added a new flag TSK_NO_SAMPLE_COUNTS which means that root tracking (and sample counts) is on by default, but if TSK_NO_SAMPLE_COUNTS is set, then we don't count samples and also don't track roots. So, left_root will always be -1. This seems fine, as there are times when you just want to do upward traversals, and the root tracking/sample counting code does cost something. If C API users knowingly turn it off, then I think that's OK. We won't break anyone's C code with this, as roots were always tracked before and now a new option is required to turn it off.

There's a problem with the Python API though. Currently it's set to that when sample_counts=False, we won't track roots and therefore the nodes iterator will be empty for any trees in the tree sequence. This is a (probably silent) breaking change, because if someone has specified sample_counts=False in their code to try and make it faster and they're doing tree traversals, then suddenly they won't be visiting any nodes.

I doubt there's many (if any) people using sample_counts=False, but still this seems like an evil way to break someone's code. I can think of a few options:

  1. Always count samples/track roots and make sample_counts option a no-op in Python. (This really wouldn't affect anyone I think: if you're doing traversals in Python, you're not working with big trees and therefore there'll be no perf difference.)

  2. Add a new keyword track_roots=True, and only turn off sample counting if this is False (again, ignoring the sample_counts option)

  3. Keep the semantics I have now, bump the version number to 0.3.0, note this as a breaking change in the changelog and put in various warnings in the docs.

I'm leaning towards (1) I think. The low-level tree options really only matter when you're doing stuff in C, where we want this kind of control. It makes no difference when working with the trees in Python.

@petrelharp
Copy link
Contributor

I'm on board with (1) as well. It's a bit confusing already figuring out which options to turn on or off to get the info you want with trees, and this would make it simpler, which is good. (I assume we'd deprecate the sample_counts option as well?)

@jeromekelleher
Copy link
Member Author

I'm on board with (1) as well. It's a bit confusing already figuring out which options to turn on or off to get the info you want with trees, and this would make it simpler, which is good. (I assume we'd deprecate the sample_counts option as well?)

OK, good. Yes, we'd deprecate it (probably throw a warning, I guess).

After deprecating SAMPLE_COUNTS.
@jeromekelleher
Copy link
Member Author

OK, merging. sample_counts is now a deprecated no-op in the Python API.

@jeromekelleher jeromekelleher merged commit 693c047 into tskit-dev:master Feb 12, 2020
@jeromekelleher jeromekelleher deleted the root-threshold branch February 12, 2020 18:33
@hyanwong
Copy link
Member

@jeromekelleher is there a way in python to set a default root_threshold for a whole ts, not just a single tree? I.e. when set, any tree returned from that ts would have the default root_threshold set. It would be very useful to me, and I can't see an option at the moment...

@jeromekelleher
Copy link
Member Author

@jeromekelleher is there a way in python to set a default root_threshold for a whole ts, not just a single tree? I.e. when set, any tree returned from that ts would have the default root_threshold set. It would be very useful to me, and I can't see an option at the moment...

I don't think we want to set this as a property of the ts, it seems like a tricky bit of state to be adding that could easily catch people out. I didn't add a root_threshold parameter to the trees iterator because this seemed like quite a niche application and we should use it for a bit before putting it into the main interface.

@hyanwong
Copy link
Member

OK, thanks. Now I think about it, you are 100% right about adding state to the TS. But an addition to the trees() iterator would be useful, eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants