-
Notifications
You must be signed in to change notification settings - Fork 79
Allow newicks to be output without branch lengths #931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
📖 Docs for this PR can be previewed here |
47be2a9 to
dbf4ee5
Compare
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few changes needed. The double underscore was deliberate I think, as a very-private indeed embarrassing recursive function.
dbf4ee5 to
d30e188
Compare
|
I've addressed all the changes now. |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better not to do the node_labels workaround, actually, so we don't have to test that the labels are the same in various corner cases.
| root = self.root | ||
| if not include_branch_lengths and node_labels is None: | ||
| # Force the python generator for simplicity, by specifying the default labels | ||
| node_labels = {i: str(i + 1) for i in self.leaves()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, leaves() or samples()? I think it should be samples(). Leaves is tricky, as we can have non-sample leaves.
What does the C engine do? We should add a test for this, since we are doing this workaround.
Maybe it would be better to not do this, and to change the condition on the line below instead to
if node_labels is None and include_branch_lengths:
Then it's completely explicit and we don't need any comments explaining what we're doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We claim it does leaves only (regardless of whether they are samples), and that seems sensible to me. Some newick readers only bother with names for tips anyway. In particular, the docs currently say:
By default, leaf nodes are labelled with their numerical ID + 1, and internal nodes are not labelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it's completely explicit and we don't need any comments explaining what we're doing.
We would still need to set the node labels to the leaves only when node_labels == None for the python Newick generating code. But happy to move the logic around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, what you have sounds good then. We should add a test to verify that we get the same labels with and without include_branch_lengths in a corner case when we have non-sample leaves, though.
62d9018 to
3903936
Compare
|
This is ready for final review, @jeromekelleher. Note that to use the leaves != samples examples (specifically |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but let's not do the test refactoring.
python/tests/test_phylo_formats.py
Outdated
|
|
||
| def verify_newick_topology(self, tree, root=None, node_labels=None): | ||
| @staticmethod | ||
| def length_parser(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a static method for this, do we? A static method implies this function is useful outside the immediate test context, whereas this is really just for newick parsing.
Seems like a good use for a lambda function:
# By default the newick lib outputs length 0.0 if no branch lengths are present
newick_tree = newick.loads(ns, length_parser=lambda x: None if x is None else float(x))[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to use lambda
| yield tables.tree_sequence() | ||
|
|
||
|
|
||
| def get_internal_samples_examples(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to promote this code up to utils.py. The pattern of yielding a bunch of different examples isn't a great one as it leads to long-running tests where it's hard to distinguish which test has failed. I would prefer to leave this module as-is, and continue to factor tests out of it and into more appropriate places, rather than have this deprecated testing pattern diffuse out into other modules.
Creating a ts with a non-sample leaf is pretty easy:
tables = msprime.simulate(n, random_seed=1).dump_tables()
root = len(tables.nodes) - 1
leaf = tables.nodes.add_row(flags=0, time=0)
tables.edges.add_row(0, 1, root, leaf)
tables.sort()
ts = tables.tree_sequence()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK. I do think we want a nice range of trees & tree sequences that we call upon from different test functions, rather than writing them all out again (e.g. the example above doesn't test internal samples, which would have been another good check). But this particular refactoring seems not to be the right way to go about it.
|
Can you bring this up to date please @hyanwong? A few small changes and we're good to merge I think. |
3903936 to
09c4dfc
Compare
In the latest push, I've split the |
|
The pytest way to do this is a parameterised fixture: @pytest.fixture(scope="module", params=[(None, None), (5,None), (5//2, n+5//2)], ids=['All', 'Internal', 'Mixed'])
def ts_with_varying_sample_nodes(sample_node_slice):
ts = msprime.simulate(n, random_seed=10, mutation_rate=5)
assert ts.num_mutations > 0
tables = ts.dump_tables()
nodes = tables.nodes
flags = nodes.flags
# Set a mixture of internal and leaf samples.
flags[:] = 0
flags[slice(sample_node_slice)] = tskit.NODE_IS_SAMPLE
nodes.flags = flags
return tables.tree_sequence()Then your tests use the fixture: def test_samples_differ_from_leaves(self, ts_with_varying_sample_nodes):
for t in ts.trees():
self.verify_newick_topology(t) |
|
Great, thanks @benjeffery - should we do this for the bevvy of tests in test_highlevel.py - or at least make the various test sequences available for all files in |
|
Suite-wide fixtures are in |
09c4dfc to
253f26e
Compare
As discussed on slack. This only changes the Python newick generator. I also took the opportunity to make all the parameters keyword-only (which is a breaking change), but it's really not obvious to me what
tree.newick(8)means - I think we should forcetree.newick(precision=8).Incidentally, I was wondering if
__build_newickshould be_build_newick(single underscore) for consistency. But perhaps the double underscore was deliberate?