Skip to content

Conversation

@brianzhang01
Copy link
Member

This is a first commit for #389 . When this is in, we can talk about changing the tree drawing code itself.

I hope tests pass! On my machine, I was getting one failing test.

ERROR: test_nonrectangular_input (tests.test_util.TestNumpyArrayCasting)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/brian/code/python/tskit/python/tests/test_util.py", line 113, in test_nonrectangular_input
    util.safe_np_int_cast(bad_input, dtype)
  File "/Users/brian/code/python/tskit/python/tskit/util.py", line 41, in safe_np_int_cast
    int_array = np.array(int_array)
ValueError: setting an array element with a sequence.

----------------------------------------------------------------------

The test expects a TypeError but gets a ValueError instead. I'm using Python 3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 16:30:03) and numpy 1.15.0.

@brianzhang01
Copy link
Member Author

I just commented on issue #389 . Here's a relevant excerpt for discussion.

I've now decided that the right way to do this is to output a minimum lexicographic order by leaves, not samples, so this is resolved. However, another issue is that sometimes there can apparently be multiple roots for TreeSequence / Tree objects. Right now I only do minlex ordering within each root, and the roots are iterated over using for u in roots: (trees.py, nodes() function). Should we add separate logic to also go over the roots in a minlex order?

Also, for the numpy-related test that fails for me, we may want to consider broadening the test to be fine with either a ValueError or TypeError. The test is running on Mac OS X.

@jeromekelleher
Copy link
Member

This looks great, thanks @brianzhang01! We're currently tidying things up to release 0.2.3 tomorrow, so do you mind if we pick this up after that?

One thing that would be nice here is to make a test that explicitly shows what the ordering is for a specific tree, using something like this. We could add a test class called TestTraversalOrder to test_topology.py which tests this for an example tree, and just checks that we get the expected ordering.

We haven't seen the numpy failure on Travis, but sure, checking for a TypeError as well as ValueError would be fine (feel free to add a commit in here do it).

@codecov
Copy link

codecov bot commented Nov 21, 2019

Codecov Report

Merging #411 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #411      +/-   ##
==========================================
+ Coverage   87.16%   87.18%   +0.02%     
==========================================
  Files          21       21              
  Lines       16200    16233      +33     
  Branches     3172     3184      +12     
==========================================
+ Hits        14120    14153      +33     
  Misses       1020     1020              
  Partials     1060     1060              
Flag Coverage Δ
#c_tests 88.36% <100.00%> (+0.03%) ⬆️
#python_c_tests 90.28% <100.00%> (+0.04%) ⬆️
#python_tests 99.19% <100.00%> (+<0.01%) ⬆️
Impacted Files Coverage Δ
python/tskit/trees.py 98.64% <100.00%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64ff36b...1fabda7. Read the comment docs.

@brianzhang01
Copy link
Member Author

I just added the tests with an explicit tree. Yes of course, feel free to punt this until after the release.

I noticed while going through things that in the tests (test_highlevel.py:TestTree, test_traversals and verify_traversals), timeasc and timedesc are treated differently than the other tests because it's a loose criteria -- i.e. there can be multiple orderings that fulfill the criteria. This also meant that I didn't test these two orders in test_topology.py. However, we could consider sharpening the specification for timeasc, where it first tries to sort by increasing time and then by increasing node ID, if two times are the same. It shouldn't be a big change.

Also, there is still an open question from earlier about whether minlex_postorder should also try to go over the multiple roots in a tree in a minlex way, based on the minimum leaf under that root. I'd prefer to not do it because it's not really needed for the drawing code, but it would make more sense.

@jeromekelleher
Copy link
Member

Also, there is still an open question from earlier about whether minlex_postorder should also try to go over the multiple roots in a tree in a minlex way, based on the minimum leaf under that root. I'd prefer to not do it because it's not really needed for the drawing code, but it would make more sense.

This would be just a case of sorting the roots before going through them though, wouldn't it? We might as well in that case.

I noticed while going through things that in the tests (test_highlevel.py:TestTree, test_traversals and verify_traversals), timeasc and timedesc are treated differently than the other tests because it's a loose criteria -- i.e. there can be multiple orderings that fulfill the criteria. This also meant that I didn't test these two orders in test_topology.py. However, we could consider sharpening the specification for timeasc, where it first tries to sort by increasing time and then by increasing node ID, if two times are the same. It shouldn't be a big change.

Well spotted. I guess we should order by node ID as well, as you say. It's not a big change and it's good to be concrete about these things.

@brianzhang01
Copy link
Member Author

Cool, I'll work on adding both of these features. Just a few more notes on the timeasc / timedesc orderings. First of all, if I'm not mistaken, shouldn't sorting by node ID alone suffice to sort by time as well, since I thought the nodes were ordered by increasing time? Second, the current implementations of these are O(N log N) if N is the number of nodes in a tree, which is slower than the other traversal methods which are all O(N) I think. Instead of a full sort, one can also use a priority queue from root, but I think that would also end up being O(N log N).

Third, for timeasc / timedesc, we might also want to reorder the results of searching from each root. For instance, if there are 3 roots, timeasc / timedesc should maybe get all the nodes under all the roots and then sort by time / node ID. As a reminder, here's what the code for nodes() roughly looks like:

        methods = {
            "preorder": self._preorder_traversal,
            "inorder": self._inorder_traversal,
            "postorder": self._postorder_traversal,
            "levelorder": self._levelorder_traversal,
            "breadthfirst": self._levelorder_traversal,
            "timeasc": self._timeasc_traversal,
            "timedesc": self._timedesc_traversal,
            "minlex_postorder": self._minlex_postorder_traversal
        }
        iterator = methods[order]
        for u in roots:
            for v in iterator(u):
                yield v

So for minlex_postorder, and possibly for timeasc / timedesc, we need to add some logic in the nodes() function, not just in the specific methods for each traversal.

@jeromekelleher
Copy link
Member

First of all, if I'm not mistaken, shouldn't sorting by node ID alone suffice to sort by time as well, since I thought the nodes were ordered by increasing time?

No, it's not a requirement that nodes are in time order, they can be mixed up arbitrarily.

Second, the current implementations of these are O(N log N) if N is the number of nodes in a tree, which is slower than the other traversal methods which are all O(N) I think. Instead of a full sort, one can also use a priority queue from root, but I think that would also end up being O(N log N).

Yes, but in practise it's surprisingly fast, see the plots in #246. We tried various clever algorithms which ended up being slower than the simple sort.

Third, for timeasc / timedesc, we might also want to reorder the results of searching from each root. For instance, if there are 3 roots, timeasc / timedesc should maybe get all the nodes under all the roots and then sort by time / node ID. As a reminder, here's what the code for nodes() roughly looks like:

I'm not sure about this one. It's certainly more convenient to do it root-by-root, but I wonder if this is a particularly useful property for the other orderings. It's not going to complicate things much if we pull iterating over the roots into the ordered traversal methods, and if you do want to have separate ordered traversals for each root, then you can always do:

for root in tree.roots:
    for u in tree.nodes(u, order="timeasc"):
        # do your thing.

@brianzhang01
Copy link
Member Author

brianzhang01 commented Dec 12, 2019

So a few changes just went in:

  • Calling nodes() with minlex_postorder now first gets the minlex postorder ordering under each root, then sorts among the roots, before finally outputting a global minlex postorder.
  • timeasc and timedesc traversals sort first using ascending / descending time, but fall back on ascending / descending node ID if two nodes have the same time.
  • The TestTraversalOrder set of tests now includes tests for timeasc and timedesc in the method test_traversal_order (which tests all traversals), as well as a new test function test_minlex_postorder_multiple_roots designed to test the minlex postorder with multiple roots case.

I have left out the last thing we were discussing (whether timeasc and timedesc should be global, not just under each root) because it didn't seem like there was a strong argument for it.

@jeromekelleher
Copy link
Member

Great, thanks @brianzhang01. @hyanwong, can you take a look and review here please?

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks @brianzhang01 and sincere apologies it's take this long to get back to you. I've added some comments above, most of which are fairly straightforward.

To get this merged, we'll want to rebase and squash to bring it up to date and remove some of the commits. We should mention the clarification of the semantics of the time orderings also in the commit history somewhere.

@brianzhang01
Copy link
Member Author

We've agreed for this PR to include a fix for #401 as well

@brianzhang01
Copy link
Member Author

I'll be going through this soon and will push an update in a bit, but I thought @benjeffery might be interested in staying looped in?

@jeromekelleher
Copy link
Member

Thanks @brianzhang01, it would be great to get this one squared away! One thing to note is that we've added some automated code formatting since this started. It's worth having a quick look at the developer docs so you can get pre-commit working.

Feel free to ping us here or on Slack (did you get an invite?) if you hit issues bringing the PR up to date.

@hyanwong
Copy link
Member

hyanwong commented Apr 9, 2020

Sorry - just seen that I was asked to review this in Dec. it seems like most is sorted, though. Do you need any more feedback @brianzhang01 ?

for _, child_minlex_postorder in children_return:
minlex_postorder.extend(child_minlex_postorder)
minlex_postorder.extend([u])
return (children_return[0][0], minlex_postorder)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we can't use sorted(children_return) above because we need to get the first element here as well

)
else:
# The second time visiting a node, we pop and yield it, and
# we update the parent variable
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the copious comments that have been added to this function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks!

a ``TreeSequence``, as it leads to more consistency between adjacent
trees. Note that internal non-leaf nodes are not counted in
assessing the lexicographic order.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with a bulleted list style. However, there is another style used in the documentation that I found: https://tskit.readthedocs.io/en/latest/python-api.html#tskit.TreeSequence.allele_frequency_spectrum. Is it fine to keep it as a bulleted list? Is the documentation too verbose?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better too much detail than too little. The wikipedia pages are very helpful for tree traversal order explanations, so I'm happy those are in there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent, thanks @brianzhang01!

yield from iterator(u)
if order == "minlex_postorder" and len(roots) > 1:
# we need to visit the roots in minlex order as well
# we first visit all the roots and then sort by the min value
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This multi-root case is not documented prominently in the nodes() docstring, should we highlight it? It's unlikely that users would be using the function in the multiple roots case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to worry about it, as you say.

@brianzhang01
Copy link
Member Author

@hyanwong I think it's all right -- there's a lot here to be brought up to speed with. But if Ben wants to have a look I think that could be worthwhile as I've built off some of the edits he recently made.

I'll work on fixing the CI errors.

@hyanwong
Copy link
Member

hyanwong commented Apr 9, 2020

More generally, there are no tests for the order when there are multiple root trees under any of the other sorting methods. Should there be? I guess that would not be strictly part of this PR though?

@brianzhang01
Copy link
Member Author

I'm ready for another review pass on this, @jeromekelleher . I still haven't rebased but will do that once I get an OK on the PR.

@jeromekelleher
Copy link
Member

Sorry I've been slow to get to this @brianzhang01, will look tomorrow.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this looks great, thanks @brianzhang01! I think we just need to rebase and squash and we're good to merge.

@jeromekelleher
Copy link
Member

When you're updating, you need to bring the kastore submodule up to date. I think the simplest way to do this is to run git submodule update. It should disappear from your diff then.

Also, can you update the CHANGELOG to include this new feature please? (Probably easiest to do this after you've rebased to avoid conflicts, you can then add it to your squashed commit using git commit --amend)

@brianzhang01
Copy link
Member Author

brianzhang01 commented Apr 16, 2020

Thanks @jeromekelleher ! I'll hopefully rebase and squash and add a line to the CHANGELOG tonight and then you can merge at your convenience. Will ping on the Slack (which I've been invited to but not yet joined) or email if I have any questions.

You mentioned adding the new behaviour for "time_asc" / "time_desc" into the commit description. Here's a commit description I sketched out. For the first line, should I go more general and say "Update and document traversal orders for nodes in a Tree"? Similarly, should I update the PR description to something more general? For the CHANGELOG description, I will focus on the minimum lexicographic order and leave out the "time_asc" / "time_desc" change, unless told otherwise.


Add minimum lexicographic order traversal for nodes in a Tree

We add a new traversal order in the Tree.nodes() function called "minlex_postorder". "minlex" is short for "minimum lexicographic", and this traversal generates a postorder which out of all possible postorders, visits the leaves in minimum lexicographic order. This is helpful for drawing neighbouring trees in as consistent a way as possible, see #389.

The "time_asc" and "time_desc" traversal orders in Tree.nodes() have also been updated to sort first by time, then fall back to sorting by increasing or decreasing node ID respectively. This helps make the orderings deterministic.

With the updated orderings, we also add explicit tests for the traversal orders of pre-defined trees, including some corner cases like "inorder" with polytomies and "minlex_postorder" with multiple roots.

Finally, we add documentation for all the traversal orders, with links to Wikipedia as appropriate. This closes #401 .

@jeromekelleher
Copy link
Member

jeromekelleher commented Apr 16, 2020

Thanks @brianzhang01. The notes in the commit messages can be pretty cursory, it's just a short record for the future. No need to update the PR notes above.

We add a new traversal order in the Tree.nodes() function called "minlex_postorder",
described in tskit-dev#389. Additional changes in this commit:

* Update "time_asc" and "time_desc" traversal orders to fall back to sorting by ID
* Add explicit tests for traversal orders
* Add documentation for all traversal orders. This closes tskit-dev#401.
@jeromekelleher jeromekelleher merged commit 4e707ea into tskit-dev:master Apr 20, 2020
@jeromekelleher
Copy link
Member

Merged, thanks @brianzhang01 !

@hyanwong
Copy link
Member

@brianzhang01 - is there any reason why we don't have minlex_preorder? So that after visiting a node we visit its children in minlex order? I think that might be useful for my svg output.

@brianzhang01
Copy link
Member Author

@hyanwong Jerome suggested adding different traversals in the .nodes() function. I took a look and saw that the tskit drawing code only used postorder, and since we wanted to just replace that behaviour, I decided to only do minlex_postorder. If you want to add minlex_preorder, feel free to do so.

@hyanwong
Copy link
Member

OK, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants