-
Notifications
You must be signed in to change notification settings - Fork 78
Add minimum lexicographic order traversal for nodes #411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I just commented on issue #389 . Here's a relevant excerpt for discussion.
Also, for the numpy-related test that fails for me, we may want to consider broadening the test to be fine with either a ValueError or TypeError. The test is running on Mac OS X. |
|
This looks great, thanks @brianzhang01! We're currently tidying things up to release 0.2.3 tomorrow, so do you mind if we pick this up after that? One thing that would be nice here is to make a test that explicitly shows what the ordering is for a specific tree, using something like this. We could add a test class called We haven't seen the numpy failure on Travis, but sure, checking for a TypeError as well as ValueError would be fine (feel free to add a commit in here do it). |
Codecov Report
@@ Coverage Diff @@
## master #411 +/- ##
==========================================
+ Coverage 87.16% 87.18% +0.02%
==========================================
Files 21 21
Lines 16200 16233 +33
Branches 3172 3184 +12
==========================================
+ Hits 14120 14153 +33
Misses 1020 1020
Partials 1060 1060
Continue to review full report at Codecov.
|
|
I just added the tests with an explicit tree. Yes of course, feel free to punt this until after the release. I noticed while going through things that in the tests (test_highlevel.py:TestTree, test_traversals and verify_traversals), Also, there is still an open question from earlier about whether minlex_postorder should also try to go over the multiple roots in a tree in a minlex way, based on the minimum leaf under that root. I'd prefer to not do it because it's not really needed for the drawing code, but it would make more sense. |
This would be just a case of sorting the roots before going through them though, wouldn't it? We might as well in that case.
Well spotted. I guess we should order by node ID as well, as you say. It's not a big change and it's good to be concrete about these things. |
|
Cool, I'll work on adding both of these features. Just a few more notes on the timeasc / timedesc orderings. First of all, if I'm not mistaken, shouldn't sorting by node ID alone suffice to sort by time as well, since I thought the nodes were ordered by increasing time? Second, the current implementations of these are O(N log N) if N is the number of nodes in a tree, which is slower than the other traversal methods which are all O(N) I think. Instead of a full sort, one can also use a priority queue from root, but I think that would also end up being O(N log N). Third, for timeasc / timedesc, we might also want to reorder the results of searching from each root. For instance, if there are 3 roots, timeasc / timedesc should maybe get all the nodes under all the roots and then sort by time / node ID. As a reminder, here's what the code for So for minlex_postorder, and possibly for timeasc / timedesc, we need to add some logic in the nodes() function, not just in the specific methods for each traversal. |
No, it's not a requirement that nodes are in time order, they can be mixed up arbitrarily.
Yes, but in practise it's surprisingly fast, see the plots in #246. We tried various clever algorithms which ended up being slower than the simple sort.
I'm not sure about this one. It's certainly more convenient to do it root-by-root, but I wonder if this is a particularly useful property for the other orderings. It's not going to complicate things much if we pull iterating over the roots into the ordered traversal methods, and if you do want to have separate ordered traversals for each root, then you can always do: |
|
So a few changes just went in:
I have left out the last thing we were discussing (whether timeasc and timedesc should be global, not just under each root) because it didn't seem like there was a strong argument for it. |
|
Great, thanks @brianzhang01. @hyanwong, can you take a look and review here please? |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks @brianzhang01 and sincere apologies it's take this long to get back to you. I've added some comments above, most of which are fairly straightforward.
To get this merged, we'll want to rebase and squash to bring it up to date and remove some of the commits. We should mention the clarification of the semantics of the time orderings also in the commit history somewhere.
|
We've agreed for this PR to include a fix for #401 as well |
|
I'll be going through this soon and will push an update in a bit, but I thought @benjeffery might be interested in staying looped in? |
|
Thanks @brianzhang01, it would be great to get this one squared away! One thing to note is that we've added some automated code formatting since this started. It's worth having a quick look at the developer docs so you can get Feel free to ping us here or on Slack (did you get an invite?) if you hit issues bringing the PR up to date. |
|
Sorry - just seen that I was asked to review this in Dec. it seems like most is sorted, though. Do you need any more feedback @brianzhang01 ? |
| for _, child_minlex_postorder in children_return: | ||
| minlex_postorder.extend(child_minlex_postorder) | ||
| minlex_postorder.extend([u]) | ||
| return (children_return[0][0], minlex_postorder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we can't use sorted(children_return) above because we need to get the first element here as well
| ) | ||
| else: | ||
| # The second time visiting a node, we pop and yield it, and | ||
| # we update the parent variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the copious comments that have been added to this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thanks!
| a ``TreeSequence``, as it leads to more consistency between adjacent | ||
| trees. Note that internal non-leaf nodes are not counted in | ||
| assessing the lexicographic order. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with a bulleted list style. However, there is another style used in the documentation that I found: https://tskit.readthedocs.io/en/latest/python-api.html#tskit.TreeSequence.allele_frequency_spectrum. Is it fine to keep it as a bulleted list? Is the documentation too verbose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better too much detail than too little. The wikipedia pages are very helpful for tree traversal order explanations, so I'm happy those are in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent, thanks @brianzhang01!
| yield from iterator(u) | ||
| if order == "minlex_postorder" and len(roots) > 1: | ||
| # we need to visit the roots in minlex order as well | ||
| # we first visit all the roots and then sort by the min value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This multi-root case is not documented prominently in the nodes() docstring, should we highlight it? It's unlikely that users would be using the function in the multiple roots case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to worry about it, as you say.
|
@hyanwong I think it's all right -- there's a lot here to be brought up to speed with. But if Ben wants to have a look I think that could be worthwhile as I've built off some of the edits he recently made. I'll work on fixing the CI errors. |
|
More generally, there are no tests for the order when there are multiple root trees under any of the other sorting methods. Should there be? I guess that would not be strictly part of this PR though? |
|
I'm ready for another review pass on this, @jeromekelleher . I still haven't rebased but will do that once I get an OK on the PR. |
|
Sorry I've been slow to get to this @brianzhang01, will look tomorrow. |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this looks great, thanks @brianzhang01! I think we just need to rebase and squash and we're good to merge.
|
When you're updating, you need to bring the kastore submodule up to date. I think the simplest way to do this is to run Also, can you update the CHANGELOG to include this new feature please? (Probably easiest to do this after you've rebased to avoid conflicts, you can then add it to your squashed commit using |
|
Thanks @jeromekelleher ! I'll hopefully rebase and squash and add a line to the CHANGELOG tonight and then you can merge at your convenience. Will ping on the Slack (which I've been invited to but not yet joined) or email if I have any questions. You mentioned adding the new behaviour for "time_asc" / "time_desc" into the commit description. Here's a commit description I sketched out. For the first line, should I go more general and say "Update and document traversal orders for nodes in a Tree"? Similarly, should I update the PR description to something more general? For the CHANGELOG description, I will focus on the minimum lexicographic order and leave out the "time_asc" / "time_desc" change, unless told otherwise. Add minimum lexicographic order traversal for nodes in a Tree We add a new traversal order in the Tree.nodes() function called "minlex_postorder". "minlex" is short for "minimum lexicographic", and this traversal generates a postorder which out of all possible postorders, visits the leaves in minimum lexicographic order. This is helpful for drawing neighbouring trees in as consistent a way as possible, see #389. The "time_asc" and "time_desc" traversal orders in Tree.nodes() have also been updated to sort first by time, then fall back to sorting by increasing or decreasing node ID respectively. This helps make the orderings deterministic. With the updated orderings, we also add explicit tests for the traversal orders of pre-defined trees, including some corner cases like "inorder" with polytomies and "minlex_postorder" with multiple roots. Finally, we add documentation for all the traversal orders, with links to Wikipedia as appropriate. This closes #401 . |
|
Thanks @brianzhang01. The notes in the commit messages can be pretty cursory, it's just a short record for the future. No need to update the PR notes above. |
We add a new traversal order in the Tree.nodes() function called "minlex_postorder", described in tskit-dev#389. Additional changes in this commit: * Update "time_asc" and "time_desc" traversal orders to fall back to sorting by ID * Add explicit tests for traversal orders * Add documentation for all traversal orders. This closes tskit-dev#401.
|
Merged, thanks @brianzhang01 ! |
|
@brianzhang01 - is there any reason why we don't have minlex_preorder? So that after visiting a node we visit its children in minlex order? I think that might be useful for my svg output. |
|
@hyanwong Jerome suggested adding different traversals in the .nodes() function. I took a look and saw that the tskit drawing code only used postorder, and since we wanted to just replace that behaviour, I decided to only do minlex_postorder. If you want to add minlex_preorder, feel free to do so. |
|
OK, thanks. |
This is a first commit for #389 . When this is in, we can talk about changing the tree drawing code itself.
I hope tests pass! On my machine, I was getting one failing test.
The test expects a
TypeErrorbut gets aValueErrorinstead. I'm using Python 3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 16:30:03) and numpy 1.15.0.