Preliminary PR for random tree generation #1037

hyanwong · 2020-11-25T22:55:26Z

Description

A draft PR to create random trees. No changelog yet, sorry!

Fixes #1033

PR Checklist:

[x ] Tests that fully cover new/changed functionality.
Documentation including tutorial content if appropriate.
Changelogs, if there are API changes.

AdminBot-tskit · 2020-11-25T22:56:55Z

📖 Docs for this PR can be previewed here

jeromekelleher · 2020-11-26T00:15:27Z

I've made some updates here @hyanwong, mainly just renamed the method to generate_random_binary. My original proposal was to force users to specify arity=2, since i would expect arity=None to uniformly sample from all trees, not just the binary ones. But, this was fiddly, and it seems like the binary case is worth its own method.

It's mostly working, except for a test failure where the internal node labellings of the unranked tree disagrees with the generated one. I don't follow why this is, so I'd like to get to the bottom of it - this isn't super urgent anyway, it's just a nice-to-have at some point.

The disagreement is here: the first tree is the one we generate, and the second one is the result of unranking the rank of the first one:

tests/test_combinatorics.py::TestGenerateRandomBinary::test_rank_unrank_round_trip[8] rank =  (212, 3328)
5.00┊          14     ┊                                                                                                       
    ┊        ┏━━┻━━┓  ┊                                                                                                       
4.00┊       12     ┃  ┊  
    ┊     ┏━━┻━━┓  ┃  ┊                                                                                                       
3.00┊    11     ┃  ┃  ┊  
    ┊   ┏━┻━━┓  ┃  ┃  ┊                                        
2.00┊   9    ┃  ┃  ┃  ┊                                        
    ┊  ┏┻━┓  ┃  ┃  ┃  ┊       
1.00┊  8  ┃ 10  ┃ 13  ┊    
    ┊ ┏┻┓ ┃ ┏┻┓ ┃ ┏┻┓ ┊                                        
0.00┊ 0 5 1 6 7 2 3 4 ┊       
  0.00              1.00                                       
                                                               
5.00┊          14     ┊                                        
    ┊        ┏━━┻━━┓  ┊                                                                                                       
4.00┊       13     ┃  ┊                                                                                                       
    ┊     ┏━━┻━━┓  ┃  ┊        
3.00┊    12     ┃  ┃  ┊                                                                                                           ┊   ┏━┻━━┓  ┃  ┃  ┊                                                                                                       
2.00┊  11    ┃  ┃  ┃  ┊                                                                                                           ┊  ┏┻━┓  ┃  ┃  ┃  ┊                                        
1.00┊ 10  ┃  9  ┃  8  ┊       
    ┊ ┏┻┓ ┃ ┏┻┓ ┃ ┏┻┓ ┊                                        
0.00┊ 0 5 1 6 7 2 3 4 ┊                                                                                                       
  0.00              1.00

Both of these are assigning internal node labels via postorder, but the unrank one has flipped the order of children among parents for some reason. The first tree is clearly "right", so I'd like to understand what the unranked tree is doing here. @daniel-goldstein, and chance you could take a look here please?

daniel-goldstein · 2020-11-26T05:20:47Z

So I believe this has to do with a conflict in the canonical orientation between the ranking code and code to generate random binary trees. If you don't use the the minlex ordering when printing the unranked tree you should get the tree printed in a different layout where the assignment of the internal nodes looks "right". It is unfortunate that there isn't a single canonical orientation, but at least assigning only occurs when producing a tskit.Tree from a RankTree, so we could potentially impose a minlex ordering at that boundary.

hyanwong · 2020-11-26T12:22:51Z

I've made some updates here @hyanwong, mainly just renamed the method to generate_random_binary. My original proposal was to force users to specify arity=2, since i would expect arity=None to uniformly sample from all trees, not just the binary ones. But, this was fiddly, and it seems like the binary case is worth its own method.

I liked your previous idea of specifying arity=2. Why not create the generate_random function anyway, which just returns a NotImplementedError unless arity=2, in which case it uses your specific binary method? You could even implement a temporary hack which just uses unrank for the arity=None case, and raises an error if it is called with more than (say) 8 tips?

jeromekelleher · 2020-11-26T12:27:07Z

I liked your previous idea of specifying arity=2. Why not create the generate_random function anyway, which just returns a NotImplementedError unless arity=2, in which case it uses your specific binary method? You could even implement a temporary hack which just uses unrank for the arity=None case, and raises an error if it is called with more than (say) 8 tips?

Just doesn't seem worth the effort - there's a few hours work involved, and generate_random_binary is useful enough on its own. Getting the details right is time consuming, and I've ended up sinking a fair bit of time into this stuff already...

jeromekelleher · 2020-11-26T12:27:40Z

Thanks @daniel-goldstein, I'll take a look at this later.

jeromekelleher · 2020-11-26T21:56:05Z

Aha, thanks @daniel-goldstein! I think I've fixed the issue now by canonicalising the order of the children within a node by sorting them by (num_leaves, min_label). This is slightly different to the canonical order used in the rank tree, which sorts by (num_leaves, rank, min_label) - do you think what I've done is correct? (Presumably we can get away with a simpler rule because these are binary trees?)

I've marked this as ready for review, but we should keep it back until after 0.3.3 is released; we can update the changelog and merge then.

codecov · 2020-11-26T21:56:44Z

Codecov Report

Merging #1037 (65f9c88) into main (a04a06a) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1037   +/-   ##
=======================================
  Coverage   93.68%   93.69%           
=======================================
  Files          26       26           
  Lines       21005    21030   +25     
  Branches      880      886    +6     
=======================================
+ Hits        19679    19704   +25     
  Misses       1289     1289           
  Partials       37       37

Flag	Coverage Δ
c-tests	`92.49% <ø> (ø)`
lwt-tests	`92.80% <ø> (ø)`
python-c-tests	`94.84% <100.00%> (+0.01%)`	⬆️
python-tests	`98.63% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/tskit/combinatorics.py	`99.20% <100.00%> (+0.02%)`	⬆️
python/tskit/trees.py	`97.47% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a04a06a...65f9c88. Read the comment docs.

python/tskit/trees.py

jeromekelleher · 2020-12-15T16:40:35Z

I think this is ready to go - @hyanwong, can you take a final look please?

hyanwong

LGTM. One typo, and a possible addition to tests (but this may not be necessary, and could be done later anyway)

hyanwong · 2020-12-15T16:44:59Z

python/tests/test_combinatorics.py

+            ranks[random_tree.rank()] += 1
+        # There are N possible binary trees here, we should have seen them
+        # all with high probability after 20 N attempts.
+        assert len(ranks) == N


This is a reasonable test, but I'm not sure it's very sensitive to the uniformity of the distribution of topologies. It may be that we don't care about testing that too much, but lack of testing this has caught out other phylogenetic libraries (e.g. R & Dendropy) so it would, I suppose, be good practice to have something (I think I had something like that in the previous tests, but it can turn out to be rather slow).

It's very difficult to do this sort of test rigorously, robustly and quickly enough to be a unit test. We'd need some sort of validation suite for this. Please open an issue to track if you think it's needed.

Yes, agreed. Probably only worth it if we require a validation suite for other reasons, I suppose. I can't think of any off hand, though.

python/tskit/combinatorics.py

hyanwong · 2020-12-15T16:56:35Z

NB: Name-wise, I sort-of prefer the conciseness of generate_random(self, arity=2), and raising a NotImplementedError if anything other than 2 is specified, but we could make such a thing later if need be.

jeromekelleher · 2020-12-15T17:18:01Z

NB: Name-wise, I sort-of prefer the conciseness of generate_random(self, arity=2), and raising a NotImplementedError if anything other than 2 is specified, but we could make such a thing later if need be.

I started with this, and it turned out to be quite tedious in practise having to type arity=2 every time. This is a big enough special case to warrant it's own method, I think. Anyway, I don't imagine anyone is going to implement an efficient general tree sampler any time soon, so there's no much point in making the interface more clumsy for it.

hyanwong · 2020-12-15T17:20:16Z

I started with this, and it turned out to be quite tedious in practise having to type arity=2 every time. This is a big enough special case to warrant it's own method, I think. Anyway, I don't imagine anyone is going to implement an efficient general tree sampler any time soon, so there's no much point in making the interface more clumsy for it.

Sorry, I meant, with arity=2 being the default, so you don't have to actually type it. That strikes me as the value for arity that's most likely to be expected as a default, no?

jeromekelleher · 2020-12-15T17:23:09Z

No, I disagree with that. I wouldn't expect a general random_tree method to just choose from the binary trees by default. That would be weird.

hyanwong · 2020-12-15T17:26:13Z

No, I disagree with that. I wouldn't expect a general random_tree method to just choose from the binary trees by default. That would be weird.

Ah, I see, you mean if arity was none you would expect to pick from all trees, of any arity, with equal probability. Yes, I can see that as a more sensible default if arity is not specified (I assumed you would require an integer arity all the time, even for a default value). So don't mind me.

hyanwong mentioned this pull request Nov 25, 2020

Add Tree.generate_random() method #1033

Closed

jeromekelleher force-pushed the random-tree branch from 7b48f3a to 3cd7b09 Compare November 26, 2020 21:50

jeromekelleher marked this pull request as ready for review November 26, 2020 21:52

jeromekelleher reviewed Nov 26, 2020

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

Preliminary PR for random tree generation

8778293

jeromekelleher force-pushed the random-tree branch 2 times, most recently from d5d6d63 to 890bd77 Compare December 15, 2020 16:39

jeromekelleher approved these changes Dec 15, 2020

View reviewed changes

hyanwong commented Dec 15, 2020

View reviewed changes

Rename to generate_random_binary and improve tests.

65f9c88

jeromekelleher force-pushed the random-tree branch from 890bd77 to 65f9c88 Compare December 15, 2020 17:27

jeromekelleher added the AUTOMERGE-REQUESTED label Dec 15, 2020

mergify bot merged commit c9b2c92 into tskit-dev:main Dec 15, 2020

mergify bot removed the AUTOMERGE-REQUESTED label Dec 15, 2020

Preliminary PR for random tree generation #1037

Preliminary PR for random tree generation #1037

Uh oh!

Conversation

hyanwong commented Nov 25, 2020 • edited by jeromekelleher Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR Checklist:

Uh oh!

AdminBot-tskit commented Nov 25, 2020

Uh oh!

jeromekelleher commented Nov 26, 2020

Uh oh!

daniel-goldstein commented Nov 26, 2020

Uh oh!

hyanwong commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Nov 26, 2020

Uh oh!

jeromekelleher commented Nov 26, 2020

Uh oh!

codecov bot commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

jeromekelleher commented Dec 15, 2020

Uh oh!

hyanwong left a comment

Choose a reason for hiding this comment

Uh oh!

hyanwong Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Dec 15, 2020

Choose a reason for hiding this comment

Uh oh!

hyanwong Dec 15, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hyanwong commented Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Dec 15, 2020

Uh oh!

hyanwong commented Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Dec 15, 2020

Uh oh!

hyanwong commented Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hyanwong commented Nov 25, 2020 •

edited by jeromekelleher

Loading

hyanwong commented Nov 26, 2020 •

edited

Loading

jeromekelleher commented Nov 26, 2020 •

edited

Loading

codecov bot commented Nov 26, 2020 •

edited

Loading

hyanwong Dec 15, 2020 •

edited

Loading

hyanwong commented Dec 15, 2020 •

edited

Loading

hyanwong commented Dec 15, 2020 •

edited

Loading

hyanwong commented Dec 15, 2020 •

edited

Loading