-
Notifications
You must be signed in to change notification settings - Fork 78
Add is_isolated and document use #443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #443 +/- ##
==========================================
- Coverage 94.73% 86.81% -7.92%
==========================================
Files 17 20 +3
Lines 9662 14243 +4581
Branches 789 2769 +1980
==========================================
+ Hits 9153 12365 +3212
- Misses 477 964 +487
- Partials 32 914 +882
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. We should add an extra test in the tests/test_highlevel.py module. We don't need missing data: any node in the tree sequence that's not in tree.nodes() should be isolated. This can be done as part of the rest of the tree tests.
docs/data-model.rst
Outdated
| 4 above is still considered as being present in the tree. This means, for example, | ||
| that the node is still returned by the :meth:`Tree.nodes` and :meth:`Tree.samples` | ||
| methods. We can therefore identify isolated nodes, as well as nodes linked to the | ||
| main tree topology, when iterating using these methods: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is implying that we have one root: we often have many roots in forward simulations. Is there some way to rephrase that's not talking about "the tree topology"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I was trying to find the right wording for "nodes in one of the conventional trees". Do you think it's reasonable to call these "nodes in a topology" (without implying that there is only one topology per tree), so e.g.
We can therefore identify isolated nodes, as well as nodes in a topology,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's much point in talking about the negative --- isn't it enough to say that we're identifying the isolated nodes? The reader can reasonably infer that they also identify non-isolated nodes, which saves us getting into a confusing technical discussion of what non-isolated nodes are.
docs/data-model.rst
Outdated
|
|
||
| >>> [u for u in tree.nodes() if tree.is_isolated(u)] # list isolated_nodes | ||
| [4] | ||
| >>> [u for u in tree.nodes() if not tree.is_isolated(u)] # list nodes in topology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably redundant, I think? These are non-isolated nodes, so pretty obvious, and avoids semantic issues with the description in your comment (see previous comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned in #442 (comment), it's possibly more useful to list the non-isolated nodes than the isolated ones. That's why I left this example in, really, even though it's a bit redundant. I could change he comment to simply "list non-isolated nodes" (although that makes it sound really redundant!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it is useful to point out that if someone wants to print out the topologically connected nodes, they can use the is_isolated method. It's not going to be obvious to most people who deal with trees that, when looking at trees in a tree sequence, there will be nodes which are not connected to the topology, but instead separate (isolated). Unless you know already about how missing data is encoded, this must come as a bit of a surprise. It certainly fooled me: I expected to be able to iterate over the nodes in a tree topology by doing tree.nodes(), and had to dig a bit to realise I had to actively exclude the "missing" ones.
Perhaps I'm confused by the work "missing". Nodes with "missing data" should not be thought of as nodes that are "missing" - they are still present in the output trees. It's simply that the STATE is missing (unknown).
All of which is to say that I find it useful to show an example of how to leave out "missing data" samples, so I've left the second example in, but commented that this gives "topologically connected nodes". Feel free to rephrase or delete as you see fit.
|
Still outstanding comments here @hyanwong, can you update please? I think we should keep the discussion to a minimum and not dwell too much on the relationship with missing data. |
43b3ce4 to
3a769d9
Compare
|
I think this can be merged @jeromekelleher - I wondered whether #462 made any difference to my additions to the "missing data" section, but I think not. However, we might want to add something in the missing data section about |
|
I think we probably want this function now that we've cleared up any ambiguity with the data model - would you mind bringing this up to date please @hyanwong? |
|
How is this? |
1549c69 to
db418b7
Compare
| self.assertTrue(0.0 < missing_to < 1.0) | ||
| return tables.tree_sequence(), missing_from, missing_to | ||
|
|
||
| def test_is_isolated(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a very simple method - but it's always good to test a couple of error cases, e.g. a negative index, a index bigger than the number of nodes, an index of the wrong type, just to check we get the right error messages.
|
@benjeffery 's point about testing made me realise there is an ambiguity here. What about non-sample nodes? They are often missing from a particular tree. I had been assuming that the adjective "isolated" only ever applied to sample nodes, hence "isolated_as_missing", but I see that the test we use returns I'm conflicted here. It seems useful to be able to identify non-sample nodes that are missing from a tree, and to have a way to describe them. But using the term "isolated" to apply only to sample nodes makes it clearer what you are talking about, is less verbose, and also aligned with our much-debated |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor tweak and we're good to go.
I think retaining "isolated" to be a purely topological definition is correct, both samples and non-samples can be isolated. We want to retain a clear distinction with "missing" and saying only samples can be isolated muddies the waters for me. |
Agreed |
a142062 to
a19bdc6
Compare
|
@hyanwong Just one failing test from when non-samples couldn't be isolated, then I think this is good to go. |
a19bdc6 to
0406c0e
Compare
Oh yes, sorry! Forgot that I put that sentinel in to remind me that a decision needed to be made. Now fixed. |
0406c0e to
dea36c0
Compare
|
I think this is ready to merge @benjeffery ? |
dea36c0 to
06f07e2
Compare
benjeffery
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@hyanwong Force pushed to resolve conflict |
06f07e2 to
f138cf5
Compare
Fixes #442