Skip to content

Conversation

@hyanwong
Copy link
Member

@hyanwong hyanwong commented Dec 5, 2019

Fixes #442

@codecov
Copy link

codecov bot commented Dec 5, 2019

Codecov Report

Merging #443 into master will decrease coverage by 7.91%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #443      +/-   ##
==========================================
- Coverage   94.73%   86.81%   -7.92%     
==========================================
  Files          17       20       +3     
  Lines        9662    14243    +4581     
  Branches      789     2769    +1980     
==========================================
+ Hits         9153    12365    +3212     
- Misses        477      964     +487     
- Partials       32      914     +882     
Flag Coverage Δ
#c_tests 87.90% <100.00%> (?)
#python_c_tests 90.65% <100.00%> (?)
#python_tests 99.24% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/trees.py 98.76% <100.00%> (+0.62%) ⬆️
python/_tskitmodule.c 83.80% <0.00%> (-7.54%) ⬇️
python/tskit/cli.py 96.29% <0.00%> (ø)
python/tskit/vcf.py 100.00% <0.00%> (ø)
python/tskit/util.py 100.00% <0.00%> (ø)
python/tskit/stats.py 100.00% <0.00%> (ø)
python/tskit/formats.py 100.00% <0.00%> (ø)
python/tskit/__init__.py 100.00% <0.00%> (ø)
python/tskit/_version.py 100.00% <0.00%> (ø)
python/tskit/exceptions.py 100.00% <0.00%> (ø)
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8252da5...f138cf5. Read the comment docs.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. We should add an extra test in the tests/test_highlevel.py module. We don't need missing data: any node in the tree sequence that's not in tree.nodes() should be isolated. This can be done as part of the rest of the tree tests.

4 above is still considered as being present in the tree. This means, for example,
that the node is still returned by the :meth:`Tree.nodes` and :meth:`Tree.samples`
methods. We can therefore identify isolated nodes, as well as nodes linked to the
main tree topology, when iterating using these methods:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implying that we have one root: we often have many roots in forward simulations. Is there some way to rephrase that's not talking about "the tree topology"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I was trying to find the right wording for "nodes in one of the conventional trees". Do you think it's reasonable to call these "nodes in a topology" (without implying that there is only one topology per tree), so e.g.

We can therefore identify isolated nodes, as well as nodes in a topology,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's much point in talking about the negative --- isn't it enough to say that we're identifying the isolated nodes? The reader can reasonably infer that they also identify non-isolated nodes, which saves us getting into a confusing technical discussion of what non-isolated nodes are.


>>> [u for u in tree.nodes() if tree.is_isolated(u)] # list isolated_nodes
[4]
>>> [u for u in tree.nodes() if not tree.is_isolated(u)] # list nodes in topology
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably redundant, I think? These are non-isolated nodes, so pretty obvious, and avoids semantic issues with the description in your comment (see previous comment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned in #442 (comment), it's possibly more useful to list the non-isolated nodes than the isolated ones. That's why I left this example in, really, even though it's a bit redundant. I could change he comment to simply "list non-isolated nodes" (although that makes it sound really redundant!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it is useful to point out that if someone wants to print out the topologically connected nodes, they can use the is_isolated method. It's not going to be obvious to most people who deal with trees that, when looking at trees in a tree sequence, there will be nodes which are not connected to the topology, but instead separate (isolated). Unless you know already about how missing data is encoded, this must come as a bit of a surprise. It certainly fooled me: I expected to be able to iterate over the nodes in a tree topology by doing tree.nodes(), and had to dig a bit to realise I had to actively exclude the "missing" ones.

Perhaps I'm confused by the work "missing". Nodes with "missing data" should not be thought of as nodes that are "missing" - they are still present in the output trees. It's simply that the STATE is missing (unknown).

All of which is to say that I find it useful to show an example of how to leave out "missing data" samples, so I've left the second example in, but commented that this gives "topologically connected nodes". Feel free to rephrase or delete as you see fit.

@jeromekelleher
Copy link
Member

Still outstanding comments here @hyanwong, can you update please? I think we should keep the discussion to a minimum and not dwell too much on the relationship with missing data.

@hyanwong hyanwong force-pushed the is-isolated branch 2 times, most recently from 43b3ce4 to 3a769d9 Compare January 13, 2020 14:19
@hyanwong
Copy link
Member Author

I think this can be merged @jeromekelleher - I wondered whether #462 made any difference to my additions to the "missing data" section, but I think not.

However, we might want to add something in the missing data section about root_threshold anyway.

@jeromekelleher
Copy link
Member

I think we probably want this function now that we've cleared up any ambiguity with the data model - would you mind bringing this up to date please @hyanwong?

@hyanwong
Copy link
Member Author

How is this?

@hyanwong hyanwong force-pushed the is-isolated branch 2 times, most recently from 1549c69 to db418b7 Compare August 27, 2020 13:46
self.assertTrue(0.0 < missing_to < 1.0)
return tables.tree_sequence(), missing_from, missing_to

def test_is_isolated(self):
Copy link
Member

@benjeffery benjeffery Aug 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a very simple method - but it's always good to test a couple of error cases, e.g. a negative index, a index bigger than the number of nodes, an index of the wrong type, just to check we get the right error messages.

@hyanwong
Copy link
Member Author

hyanwong commented Aug 28, 2020

@benjeffery 's point about testing made me realise there is an ambiguity here. What about non-sample nodes? They are often missing from a particular tree. I had been assuming that the adjective "isolated" only ever applied to sample nodes, hence "isolated_as_missing", but I see that the test we use returns is_isolated(n) = True where n is an internal (non-sample) node which is missing from this tree, but present in others. I guess if we only want to use "isolated" to refer to a sample node, we could check if the node is in the tree, and is_isolated() could error out if not.

I'm conflicted here. It seems useful to be able to identify non-sample nodes that are missing from a tree, and to have a way to describe them. But using the term "isolated" to apply only to sample nodes makes it clearer what you are talking about, is less verbose, and also aligned with our much-debated ISOLATED_AS_MISSING notation.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor tweak and we're good to go.

@benjeffery
Copy link
Member

@benjeffery 's point about testing made me realise there is an ambiguity here. What about non-sample nodes? They are often missing from a particular tree. I had been assuming that the adjective "isolated" only ever applied to sample nodes, hence "isolated_as_missing", but I see that the test we use returns is_isolated(n) = True where n is an internal (non-sample) node which is missing from this tree, but present in others. I guess if we only want to use "isolated" to refer to a sample node, we could check if the node is in the tree, and is_isolated() could error out if not.

I'm conflicted here. It seems useful to be able to identify non-sample nodes that are missing from a tree, and to have a way to describe them. But using the term "isolated" to apply only to sample nodes makes it clearer what you are talking about, is less verbose, and also aligned with our much-debated ISOLATED_AS_MISSING notation.

I think retaining "isolated" to be a purely topological definition is correct, both samples and non-samples can be isolated. We want to retain a clear distinction with "missing" and saying only samples can be isolated muddies the waters for me.

@jeromekelleher
Copy link
Member

We want to retain a clear distinction with "missing" and saying only samples can be isolated muddies the waters for me.

Agreed

@benjeffery
Copy link
Member

@hyanwong Just one failing test from when non-samples couldn't be isolated, then I think this is good to go.

@hyanwong
Copy link
Member Author

hyanwong commented Sep 3, 2020

@hyanwong Just one failing test from when non-samples couldn't be isolated, then I think this is good to go.

Oh yes, sorry! Forgot that I put that sentinel in to remind me that a decision needed to be made. Now fixed.

@hyanwong
Copy link
Member Author

I think this is ready to merge @benjeffery ?

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benjeffery
Copy link
Member

@hyanwong Force pushed to resolve conflict

@mergify mergify bot merged commit cca8c98 into tskit-dev:master Sep 11, 2020
@hyanwong hyanwong deleted the is-isolated branch November 4, 2020 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add omit_missing to Tree.nodes()?

3 participants