Skip to content

Conversation

@hyanwong
Copy link
Member

As discussed on slack. This only changes the Python newick generator. I also took the opportunity to make all the parameters keyword-only (which is a breaking change), but it's really not obvious to me what tree.newick(8) means - I think we should force tree.newick(precision=8).

Incidentally, I was wondering if __build_newick should be _build_newick (single underscore) for consistency. But perhaps the double underscore was deliberate?

@AdminBot-tskit
Copy link
Collaborator

📖 Docs for this PR can be previewed here

@hyanwong hyanwong force-pushed the no-newick-branch-lengths branch from 47be2a9 to dbf4ee5 Compare October 24, 2020 14:52
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few changes needed. The double underscore was deliberate I think, as a very-private indeed embarrassing recursive function.

@hyanwong hyanwong force-pushed the no-newick-branch-lengths branch from dbf4ee5 to d30e188 Compare October 24, 2020 22:27
@hyanwong
Copy link
Member Author

hyanwong commented Oct 25, 2020

I've addressed all the changes now.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better not to do the node_labels workaround, actually, so we don't have to test that the labels are the same in various corner cases.

root = self.root
if not include_branch_lengths and node_labels is None:
# Force the python generator for simplicity, by specifying the default labels
node_labels = {i: str(i + 1) for i in self.leaves()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, leaves() or samples()? I think it should be samples(). Leaves is tricky, as we can have non-sample leaves.

What does the C engine do? We should add a test for this, since we are doing this workaround.

Maybe it would be better to not do this, and to change the condition on the line below instead to

if node_labels is None and include_branch_lengths:

Then it's completely explicit and we don't need any comments explaining what we're doing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We claim it does leaves only (regardless of whether they are samples), and that seems sensible to me. Some newick readers only bother with names for tips anyway. In particular, the docs currently say:

By default, leaf nodes are labelled with their numerical ID + 1, and internal nodes are not labelled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it's completely explicit and we don't need any comments explaining what we're doing.

We would still need to set the node labels to the leaves only when node_labels == None for the python Newick generating code. But happy to move the logic around.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, what you have sounds good then. We should add a test to verify that we get the same labels with and without include_branch_lengths in a corner case when we have non-sample leaves, though.

@hyanwong hyanwong force-pushed the no-newick-branch-lengths branch 3 times, most recently from 62d9018 to 3903936 Compare October 27, 2020 21:26
@hyanwong
Copy link
Member Author

This is ready for final review, @jeromekelleher. Note that to use the leaves != samples examples (specifically get_internal_samples_examples()) in test_highlevel.py, I've moved a load of tree-sequence-creating routines (e.g. get_example_tree_sequences) from test_highlevel.py to tsutil.py. I think that's probably a better place for them, as then they can all be used in different unit test files.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but let's not do the test refactoring.


def verify_newick_topology(self, tree, root=None, node_labels=None):
@staticmethod
def length_parser(x):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a static method for this, do we? A static method implies this function is useful outside the immediate test context, whereas this is really just for newick parsing.

Seems like a good use for a lambda function:

 # By default the newick lib outputs length 0.0 if no branch lengths are present
newick_tree = newick.loads(ns, length_parser=lambda x: None if x is None else float(x))[0]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to use lambda

yield tables.tree_sequence()


def get_internal_samples_examples():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to promote this code up to utils.py. The pattern of yielding a bunch of different examples isn't a great one as it leads to long-running tests where it's hard to distinguish which test has failed. I would prefer to leave this module as-is, and continue to factor tests out of it and into more appropriate places, rather than have this deprecated testing pattern diffuse out into other modules.

Creating a ts with a non-sample leaf is pretty easy:

tables = msprime.simulate(n, random_seed=1).dump_tables()
root = len(tables.nodes) - 1
leaf = tables.nodes.add_row(flags=0, time=0)
tables.edges.add_row(0, 1, root, leaf)
tables.sort()
ts = tables.tree_sequence()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK. I do think we want a nice range of trees & tree sequences that we call upon from different test functions, rather than writing them all out again (e.g. the example above doesn't test internal samples, which would have been another good check). But this particular refactoring seems not to be the right way to go about it.

@jeromekelleher
Copy link
Member

Can you bring this up to date please @hyanwong? A few small changes and we're good to merge I think.

@hyanwong hyanwong force-pushed the no-newick-branch-lengths branch from 3903936 to 09c4dfc Compare November 2, 2020 13:27
@hyanwong
Copy link
Member Author

hyanwong commented Nov 2, 2020

Can you bring this up to date please @hyanwong? A few small changes and we're good to merge I think.

In the latest push, I've split the get_internal_samples_examples() code into 3 separate functions (all_nodes_samples_example only_internal_samples_example mixed_node_samples_example)and put them at the start of the class, but this does lead to quite a lot of code duplication. Suggestions for improving this are welcome - it feels a bit wrong to me.

@benjeffery
Copy link
Member

The pytest way to do this is a parameterised fixture:

@pytest.fixture(scope="module", params=[(None, None), (5,None), (5//2, n+5//2)], ids=['All', 'Internal', 'Mixed'])
def ts_with_varying_sample_nodes(sample_node_slice):
    ts = msprime.simulate(n, random_seed=10, mutation_rate=5)
    assert ts.num_mutations > 0
    tables = ts.dump_tables()
    nodes = tables.nodes
    flags = nodes.flags
    # Set a mixture of internal and leaf samples.
    flags[:] = 0
    flags[slice(sample_node_slice)] = tskit.NODE_IS_SAMPLE
    nodes.flags = flags
    return tables.tree_sequence()

Then your tests use the fixture:

def test_samples_differ_from_leaves(self, ts_with_varying_sample_nodes):
        for t in ts.trees():
            self.verify_newick_topology(t)

@hyanwong
Copy link
Member Author

hyanwong commented Nov 2, 2020

Great, thanks @benjeffery - should we do this for the bevvy of tests in test_highlevel.py - or at least make the various test sequences available for all files in tests? Presumably that way we can stick them in tsutils.py and grab them from there in the various test_XXX files?

@benjeffery
Copy link
Member

Suite-wide fixtures are in conftest.py and where possible should be scope="session" to avoid running msprime a million times.

@AdminBot-tskit AdminBot-tskit force-pushed the no-newick-branch-lengths branch from 09c4dfc to 253f26e Compare November 2, 2020 17:28
@mergify mergify bot merged commit d373b79 into tskit-dev:main Nov 2, 2020
@hyanwong hyanwong deleted the no-newick-branch-lengths branch November 4, 2020 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants