Add minimum lexicographic order traversal for nodes #411

brianzhang01 · 2019-11-21T11:58:05Z

This is a first commit for #389 . When this is in, we can talk about changing the tree drawing code itself.

I hope tests pass! On my machine, I was getting one failing test.

ERROR: test_nonrectangular_input (tests.test_util.TestNumpyArrayCasting)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/brian/code/python/tskit/python/tests/test_util.py", line 113, in test_nonrectangular_input
    util.safe_np_int_cast(bad_input, dtype)
  File "/Users/brian/code/python/tskit/python/tskit/util.py", line 41, in safe_np_int_cast
    int_array = np.array(int_array)
ValueError: setting an array element with a sequence.

----------------------------------------------------------------------

The test expects a TypeError but gets a ValueError instead. I'm using Python 3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 16:30:03) and numpy 1.15.0.

brianzhang01 · 2019-11-21T12:16:27Z

I just commented on issue #389 . Here's a relevant excerpt for discussion.

I've now decided that the right way to do this is to output a minimum lexicographic order by leaves, not samples, so this is resolved. However, another issue is that sometimes there can apparently be multiple roots for TreeSequence / Tree objects. Right now I only do minlex ordering within each root, and the roots are iterated over using for u in roots: (trees.py, nodes() function). Should we add separate logic to also go over the roots in a minlex order?

Also, for the numpy-related test that fails for me, we may want to consider broadening the test to be fine with either a ValueError or TypeError. The test is running on Mac OS X.

jeromekelleher · 2019-11-21T12:17:59Z

This looks great, thanks @brianzhang01! We're currently tidying things up to release 0.2.3 tomorrow, so do you mind if we pick this up after that?

One thing that would be nice here is to make a test that explicitly shows what the ordering is for a specific tree, using something like this. We could add a test class called TestTraversalOrder to test_topology.py which tests this for an example tree, and just checks that we get the expected ordering.

We haven't seen the numpy failure on Travis, but sure, checking for a TypeError as well as ValueError would be fine (feel free to add a commit in here do it).

codecov · 2019-11-21T12:31:38Z

Codecov Report

Merging #411 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #411      +/-   ##
==========================================
+ Coverage   87.16%   87.18%   +0.02%     
==========================================
  Files          21       21              
  Lines       16200    16233      +33     
  Branches     3172     3184      +12     
==========================================
+ Hits        14120    14153      +33     
  Misses       1020     1020              
  Partials     1060     1060

Flag	Coverage Δ
#c_tests	`88.36% <100.00%> (+0.03%)`	⬆️
#python_c_tests	`90.28% <100.00%> (+0.04%)`	⬆️
#python_tests	`99.19% <100.00%> (+<0.01%)`	⬆️

Impacted Files	Coverage Δ
python/tskit/trees.py	`98.64% <100.00%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64ff36b...1fabda7. Read the comment docs.

brianzhang01 · 2019-11-21T15:31:29Z

I just added the tests with an explicit tree. Yes of course, feel free to punt this until after the release.

I noticed while going through things that in the tests (test_highlevel.py:TestTree, test_traversals and verify_traversals), timeasc and timedesc are treated differently than the other tests because it's a loose criteria -- i.e. there can be multiple orderings that fulfill the criteria. This also meant that I didn't test these two orders in test_topology.py. However, we could consider sharpening the specification for timeasc, where it first tries to sort by increasing time and then by increasing node ID, if two times are the same. It shouldn't be a big change.

Also, there is still an open question from earlier about whether minlex_postorder should also try to go over the multiple roots in a tree in a minlex way, based on the minimum leaf under that root. I'd prefer to not do it because it's not really needed for the drawing code, but it would make more sense.

jeromekelleher · 2019-11-22T13:01:22Z

Also, there is still an open question from earlier about whether minlex_postorder should also try to go over the multiple roots in a tree in a minlex way, based on the minimum leaf under that root. I'd prefer to not do it because it's not really needed for the drawing code, but it would make more sense.

This would be just a case of sorting the roots before going through them though, wouldn't it? We might as well in that case.

I noticed while going through things that in the tests (test_highlevel.py:TestTree, test_traversals and verify_traversals), timeasc and timedesc are treated differently than the other tests because it's a loose criteria -- i.e. there can be multiple orderings that fulfill the criteria. This also meant that I didn't test these two orders in test_topology.py. However, we could consider sharpening the specification for timeasc, where it first tries to sort by increasing time and then by increasing node ID, if two times are the same. It shouldn't be a big change.

Well spotted. I guess we should order by node ID as well, as you say. It's not a big change and it's good to be concrete about these things.

brianzhang01 · 2019-11-22T14:29:19Z

Cool, I'll work on adding both of these features. Just a few more notes on the timeasc / timedesc orderings. First of all, if I'm not mistaken, shouldn't sorting by node ID alone suffice to sort by time as well, since I thought the nodes were ordered by increasing time? Second, the current implementations of these are O(N log N) if N is the number of nodes in a tree, which is slower than the other traversal methods which are all O(N) I think. Instead of a full sort, one can also use a priority queue from root, but I think that would also end up being O(N log N).

Third, for timeasc / timedesc, we might also want to reorder the results of searching from each root. For instance, if there are 3 roots, timeasc / timedesc should maybe get all the nodes under all the roots and then sort by time / node ID. As a reminder, here's what the code for nodes() roughly looks like:

        methods = {
            "preorder": self._preorder_traversal,
            "inorder": self._inorder_traversal,
            "postorder": self._postorder_traversal,
            "levelorder": self._levelorder_traversal,
            "breadthfirst": self._levelorder_traversal,
            "timeasc": self._timeasc_traversal,
            "timedesc": self._timedesc_traversal,
            "minlex_postorder": self._minlex_postorder_traversal
        }
        iterator = methods[order]
        for u in roots:
            for v in iterator(u):
                yield v

So for minlex_postorder, and possibly for timeasc / timedesc, we need to add some logic in the nodes() function, not just in the specific methods for each traversal.

jeromekelleher · 2019-11-22T14:54:21Z

First of all, if I'm not mistaken, shouldn't sorting by node ID alone suffice to sort by time as well, since I thought the nodes were ordered by increasing time?

No, it's not a requirement that nodes are in time order, they can be mixed up arbitrarily.

Second, the current implementations of these are O(N log N) if N is the number of nodes in a tree, which is slower than the other traversal methods which are all O(N) I think. Instead of a full sort, one can also use a priority queue from root, but I think that would also end up being O(N log N).

Yes, but in practise it's surprisingly fast, see the plots in #246. We tried various clever algorithms which ended up being slower than the simple sort.

Third, for timeasc / timedesc, we might also want to reorder the results of searching from each root. For instance, if there are 3 roots, timeasc / timedesc should maybe get all the nodes under all the roots and then sort by time / node ID. As a reminder, here's what the code for nodes() roughly looks like:

I'm not sure about this one. It's certainly more convenient to do it root-by-root, but I wonder if this is a particularly useful property for the other orderings. It's not going to complicate things much if we pull iterating over the roots into the ordered traversal methods, and if you do want to have separate ordered traversals for each root, then you can always do:

for root in tree.roots:
    for u in tree.nodes(u, order="timeasc"):
        # do your thing.

brianzhang01 · 2019-12-12T17:25:11Z

So a few changes just went in:

Calling nodes() with minlex_postorder now first gets the minlex postorder ordering under each root, then sorts among the roots, before finally outputting a global minlex postorder.
timeasc and timedesc traversals sort first using ascending / descending time, but fall back on ascending / descending node ID if two nodes have the same time.
The TestTraversalOrder set of tests now includes tests for timeasc and timedesc in the method test_traversal_order (which tests all traversals), as well as a new test function test_minlex_postorder_multiple_roots designed to test the minlex postorder with multiple roots case.

I have left out the last thing we were discussing (whether timeasc and timedesc should be global, not just under each root) because it didn't seem like there was a strong argument for it.

python/tests/test_highlevel.py

jeromekelleher · 2019-12-13T08:30:53Z

Great, thanks @brianzhang01. @hyanwong, can you take a look and review here please?

jeromekelleher

This is great, thanks @brianzhang01 and sincere apologies it's take this long to get back to you. I've added some comments above, most of which are fairly straightforward.

To get this merged, we'll want to rebase and squash to bring it up to date and remove some of the commits. We should mention the clarification of the semantics of the time orderings also in the commit history somewhere.

python/tskit/trees.py

python/tests/__init__.py

brianzhang01 · 2020-04-08T17:48:37Z

We've agreed for this PR to include a fix for #401 as well

brianzhang01 · 2020-04-08T17:50:46Z

I'll be going through this soon and will push an update in a bit, but I thought @benjeffery might be interested in staying looped in?

jeromekelleher · 2020-04-09T08:35:12Z

Thanks @brianzhang01, it would be great to get this one squared away! One thing to note is that we've added some automated code formatting since this started. It's worth having a quick look at the developer docs so you can get pre-commit working.

Feel free to ping us here or on Slack (did you get an invite?) if you hit issues bringing the PR up to date.

hyanwong · 2020-04-09T13:25:11Z

Sorry - just seen that I was asked to review this in Dec. it seems like most is sorted, though. Do you need any more feedback @brianzhang01 ?

brianzhang01 · 2020-04-09T13:21:54Z

python/tests/__init__.py

+            for _, child_minlex_postorder in children_return:
+                minlex_postorder.extend(child_minlex_postorder)
+            minlex_postorder.extend([u])
+            return (children_return[0][0], minlex_postorder)


Note that we can't use sorted(children_return) above because we need to get the first element here as well

brianzhang01 · 2020-04-09T13:22:52Z

python/tskit/trees.py

+                )
+            else:
+                # The second time visiting a node, we pop and yield it, and
+                # we update the parent variable


Note the copious comments that have been added to this function

Very nice, thanks!

brianzhang01 · 2020-04-09T13:24:05Z

python/tskit/trees.py

+              a ``TreeSequence``, as it leads to more consistency between adjacent
+              trees. Note that internal non-leaf nodes are not counted in
+              assessing the lexicographic order.
+


I went with a bulleted list style. However, there is another style used in the documentation that I found: https://tskit.readthedocs.io/en/latest/python-api.html#tskit.TreeSequence.allele_frequency_spectrum. Is it fine to keep it as a bulleted list? Is the documentation too verbose?

Better too much detail than too little. The wikipedia pages are very helpful for tree traversal order explanations, so I'm happy those are in there.

This is excellent, thanks @brianzhang01!

brianzhang01 · 2020-04-09T13:25:07Z

python/tskit/trees.py

-            yield from iterator(u)
+        if order == "minlex_postorder" and len(roots) > 1:
+            # we need to visit the roots in minlex order as well
+            # we first visit all the roots and then sort by the min value


This multi-root case is not documented prominently in the nodes() docstring, should we highlight it? It's unlikely that users would be using the function in the multiple roots case.

I don't think we need to worry about it, as you say.

brianzhang01 · 2020-04-09T13:27:25Z

@hyanwong I think it's all right -- there's a lot here to be brought up to speed with. But if Ben wants to have a look I think that could be worthwhile as I've built off some of the edits he recently made.

I'll work on fixing the CI errors.

hyanwong · 2020-04-09T13:41:47Z

More generally, there are no tests for the order when there are multiple root trees under any of the other sorting methods. Should there be? I guess that would not be strictly part of this PR though?

brianzhang01 · 2020-04-13T11:45:03Z

I'm ready for another review pass on this, @jeromekelleher . I still haven't rebased but will do that once I get an OK on the PR.

jeromekelleher · 2020-04-15T18:33:10Z

Sorry I've been slow to get to this @brianzhang01, will look tomorrow.

jeromekelleher

OK, this looks great, thanks @brianzhang01! I think we just need to rebase and squash and we're good to merge.

jeromekelleher · 2020-04-16T08:46:35Z

When you're updating, you need to bring the kastore submodule up to date. I think the simplest way to do this is to run git submodule update. It should disappear from your diff then.

Also, can you update the CHANGELOG to include this new feature please? (Probably easiest to do this after you've rebased to avoid conflicts, you can then add it to your squashed commit using git commit --amend)

brianzhang01 · 2020-04-16T09:27:55Z

Thanks @jeromekelleher ! I'll hopefully rebase and squash and add a line to the CHANGELOG tonight and then you can merge at your convenience. Will ping on the Slack (which I've been invited to but not yet joined) or email if I have any questions.

You mentioned adding the new behaviour for "time_asc" / "time_desc" into the commit description. Here's a commit description I sketched out. For the first line, should I go more general and say "Update and document traversal orders for nodes in a Tree"? Similarly, should I update the PR description to something more general? For the CHANGELOG description, I will focus on the minimum lexicographic order and leave out the "time_asc" / "time_desc" change, unless told otherwise.

Add minimum lexicographic order traversal for nodes in a Tree

We add a new traversal order in the Tree.nodes() function called "minlex_postorder". "minlex" is short for "minimum lexicographic", and this traversal generates a postorder which out of all possible postorders, visits the leaves in minimum lexicographic order. This is helpful for drawing neighbouring trees in as consistent a way as possible, see #389.

The "time_asc" and "time_desc" traversal orders in Tree.nodes() have also been updated to sort first by time, then fall back to sorting by increasing or decreasing node ID respectively. This helps make the orderings deterministic.

With the updated orderings, we also add explicit tests for the traversal orders of pre-defined trees, including some corner cases like "inorder" with polytomies and "minlex_postorder" with multiple roots.

Finally, we add documentation for all the traversal orders, with links to Wikipedia as appropriate. This closes #401 .

jeromekelleher · 2020-04-16T14:37:03Z

Thanks @brianzhang01. The notes in the commit messages can be pretty cursory, it's just a short record for the future. No need to update the PR notes above.

We add a new traversal order in the Tree.nodes() function called "minlex_postorder", described in tskit-dev#389. Additional changes in this commit: * Update "time_asc" and "time_desc" traversal orders to fall back to sorting by ID * Add explicit tests for traversal orders * Add documentation for all traversal orders. This closes tskit-dev#401.

jeromekelleher · 2020-04-20T13:20:19Z

Merged, thanks @brianzhang01 !

hyanwong · 2020-04-30T09:26:46Z

@brianzhang01 - is there any reason why we don't have minlex_preorder? So that after visiting a node we visit its children in minlex order? I think that might be useful for my svg output.

brianzhang01 · 2020-04-30T11:08:42Z

@hyanwong Jerome suggested adding different traversals in the .nodes() function. I took a look and saw that the tskit drawing code only used postorder, and since we wanted to just replace that behaviour, I decided to only do minlex_postorder. If you want to add minlex_preorder, feel free to do so.

hyanwong · 2020-04-30T11:16:31Z

OK, thanks.

brianzhang01 force-pushed the minlex branch from e4d8081 to b46ae8e Compare November 21, 2019 12:14

brianzhang01 force-pushed the minlex branch from 5bdb24c to 105d3e1 Compare December 12, 2019 17:10

brianzhang01 commented Dec 12, 2019

View reviewed changes

python/tests/test_highlevel.py Show resolved Hide resolved

python/tests/test_highlevel.py Show resolved Hide resolved

jeromekelleher reviewed Jan 10, 2020

View reviewed changes

brianzhang01 mentioned this pull request Mar 2, 2020

Document traversal orders #401

Closed

brianzhang01 commented Apr 9, 2020

View reviewed changes

brianzhang01 requested a review from jeromekelleher April 13, 2020 11:45

jeromekelleher reviewed Apr 16, 2020

View reviewed changes

brianzhang01 force-pushed the minlex branch from cd27197 to 01bba0a Compare April 20, 2020 12:08

brianzhang01 force-pushed the minlex branch from 01bba0a to 1fabda7 Compare April 20, 2020 12:22

jeromekelleher merged commit 4e707ea into tskit-dev:master Apr 20, 2020

jeromekelleher mentioned this pull request Apr 21, 2020

Given a tree sequence, draw marginal trees emphasising similarity between neighbouring trees #389

Closed

Add minimum lexicographic order traversal for nodes #411

Add minimum lexicographic order traversal for nodes #411

Uh oh!

Conversation

brianzhang01 commented Nov 21, 2019

Uh oh!

brianzhang01 commented Nov 21, 2019

Uh oh!

jeromekelleher commented Nov 21, 2019

Uh oh!

codecov bot commented Nov 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

brianzhang01 commented Nov 21, 2019

Uh oh!

jeromekelleher commented Nov 22, 2019

Uh oh!

brianzhang01 commented Nov 22, 2019

Uh oh!

jeromekelleher commented Nov 22, 2019

Uh oh!

brianzhang01 commented Dec 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeromekelleher commented Dec 13, 2019

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brianzhang01 commented Apr 8, 2020

Uh oh!

brianzhang01 commented Apr 8, 2020

Uh oh!

jeromekelleher commented Apr 9, 2020

Uh oh!

hyanwong commented Apr 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brianzhang01 commented Apr 9, 2020

Uh oh!

hyanwong commented Apr 9, 2020

Uh oh!

brianzhang01 commented Apr 13, 2020

Uh oh!

jeromekelleher commented Apr 15, 2020

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

jeromekelleher commented Apr 16, 2020

Uh oh!

brianzhang01 commented Apr 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

codecov bot commented Nov 21, 2019 •

edited

Loading

brianzhang01 commented Dec 12, 2019 •

edited

Loading

brianzhang01 commented Apr 16, 2020 •

edited

Loading

jeromekelleher commented Apr 16, 2020 •

edited

Loading