Skip to content

Conversation

@hyanwong
Copy link
Member

@hyanwong hyanwong commented Nov 25, 2020

Description

A draft PR to create random trees. No changelog yet, sorry!

Fixes #1033

PR Checklist:

  • [x ] Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • Changelogs, if there are API changes.

@AdminBot-tskit
Copy link
Collaborator

📖 Docs for this PR can be previewed here

@jeromekelleher
Copy link
Member

I've made some updates here @hyanwong, mainly just renamed the method to generate_random_binary. My original proposal was to force users to specify arity=2, since i would expect arity=None to uniformly sample from all trees, not just the binary ones. But, this was fiddly, and it seems like the binary case is worth its own method.

It's mostly working, except for a test failure where the internal node labellings of the unranked tree disagrees with the generated one. I don't follow why this is, so I'd like to get to the bottom of it - this isn't super urgent anyway, it's just a nice-to-have at some point.

The disagreement is here: the first tree is the one we generate, and the second one is the result of unranking the rank of the first one:

tests/test_combinatorics.py::TestGenerateRandomBinary::test_rank_unrank_round_trip[8] rank =  (212, 3328)
5.00┊          14     ┊                                                                                                       
    ┊        ┏━━┻━━┓  ┊                                                                                                       
4.00┊       12     ┃  ┊  
    ┊     ┏━━┻━━┓  ┃  ┊                                                                                                       
3.00┊    11     ┃  ┃  ┊  
    ┊   ┏━┻━━┓  ┃  ┃  ┊                                        
2.00┊   9    ┃  ┃  ┃  ┊                                        
    ┊  ┏┻━┓  ┃  ┃  ┃  ┊       
1.00┊  8  ┃ 10  ┃ 13  ┊    
    ┊ ┏┻┓ ┃ ┏┻┓ ┃ ┏┻┓ ┊                                        
0.00┊ 0 5 1 6 7 2 3 4 ┊       
  0.00              1.00                                       
                                                               
5.00┊          14     ┊                                        
    ┊        ┏━━┻━━┓  ┊                                                                                                       
4.00┊       13     ┃  ┊                                                                                                       
    ┊     ┏━━┻━━┓  ┃  ┊        
3.00┊    12     ┃  ┃  ┊                                                                                                           ┊   ┏━┻━━┓  ┃  ┃  ┊                                                                                                       
2.00┊  11    ┃  ┃  ┃  ┊                                                                                                           ┊  ┏┻━┓  ┃  ┃  ┃  ┊                                        
1.00┊ 10  ┃  9  ┃  8  ┊       
    ┊ ┏┻┓ ┃ ┏┻┓ ┃ ┏┻┓ ┊                                        
0.00┊ 0 5 1 6 7 2 3 4 ┊                                                                                                       
  0.00              1.00              

Both of these are assigning internal node labels via postorder, but the unrank one has flipped the order of children among parents for some reason. The first tree is clearly "right", so I'd like to understand what the unranked tree is doing here. @daniel-goldstein, and chance you could take a look here please?

@daniel-goldstein
Copy link
Member

So I believe this has to do with a conflict in the canonical orientation between the ranking code and code to generate random binary trees. If you don't use the the minlex ordering when printing the unranked tree you should get the tree printed in a different layout where the assignment of the internal nodes looks "right". It is unfortunate that there isn't a single canonical orientation, but at least assigning only occurs when producing a tskit.Tree from a RankTree, so we could potentially impose a minlex ordering at that boundary.

@hyanwong
Copy link
Member Author

hyanwong commented Nov 26, 2020

I've made some updates here @hyanwong, mainly just renamed the method to generate_random_binary. My original proposal was to force users to specify arity=2, since i would expect arity=None to uniformly sample from all trees, not just the binary ones. But, this was fiddly, and it seems like the binary case is worth its own method.

I liked your previous idea of specifying arity=2. Why not create the generate_random function anyway, which just returns a NotImplementedError unless arity=2, in which case it uses your specific binary method? You could even implement a temporary hack which just uses unrank for the arity=None case, and raises an error if it is called with more than (say) 8 tips?

@jeromekelleher
Copy link
Member

jeromekelleher commented Nov 26, 2020

I liked your previous idea of specifying arity=2. Why not create the generate_random function anyway, which just returns a NotImplementedError unless arity=2, in which case it uses your specific binary method? You could even implement a temporary hack which just uses unrank for the arity=None case, and raises an error if it is called with more than (say) 8 tips?

Just doesn't seem worth the effort - there's a few hours work involved, and generate_random_binary is useful enough on its own. Getting the details right is time consuming, and I've ended up sinking a fair bit of time into this stuff already...

@jeromekelleher
Copy link
Member

Thanks @daniel-goldstein, I'll take a look at this later.

@jeromekelleher jeromekelleher marked this pull request as ready for review November 26, 2020 21:52
@jeromekelleher
Copy link
Member

Aha, thanks @daniel-goldstein! I think I've fixed the issue now by canonicalising the order of the children within a node by sorting them by (num_leaves, min_label). This is slightly different to the canonical order used in the rank tree, which sorts by (num_leaves, rank, min_label) - do you think what I've done is correct? (Presumably we can get away with a simpler rule because these are binary trees?)

I've marked this as ready for review, but we should keep it back until after 0.3.3 is released; we can update the changelog and merge then.

@codecov
Copy link

codecov bot commented Nov 26, 2020

Codecov Report

Merging #1037 (65f9c88) into main (a04a06a) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1037   +/-   ##
=======================================
  Coverage   93.68%   93.69%           
=======================================
  Files          26       26           
  Lines       21005    21030   +25     
  Branches      880      886    +6     
=======================================
+ Hits        19679    19704   +25     
  Misses       1289     1289           
  Partials       37       37           
Flag Coverage Δ
c-tests 92.49% <ø> (ø)
lwt-tests 92.80% <ø> (ø)
python-c-tests 94.84% <100.00%> (+0.01%) ⬆️
python-tests 98.63% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/combinatorics.py 99.20% <100.00%> (+0.02%) ⬆️
python/tskit/trees.py 97.47% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a04a06a...65f9c88. Read the comment docs.

@jeromekelleher jeromekelleher force-pushed the random-tree branch 2 times, most recently from d5d6d63 to 890bd77 Compare December 15, 2020 16:39
@jeromekelleher
Copy link
Member

I think this is ready to go - @hyanwong, can you take a final look please?

Copy link
Member Author

@hyanwong hyanwong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One typo, and a possible addition to tests (but this may not be necessary, and could be done later anyway)

ranks[random_tree.rank()] += 1
# There are N possible binary trees here, we should have seen them
# all with high probability after 20 N attempts.
assert len(ranks) == N
Copy link
Member Author

@hyanwong hyanwong Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reasonable test, but I'm not sure it's very sensitive to the uniformity of the distribution of topologies. It may be that we don't care about testing that too much, but lack of testing this has caught out other phylogenetic libraries (e.g. R & Dendropy) so it would, I suppose, be good practice to have something (I think I had something like that in the previous tests, but it can turn out to be rather slow).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very difficult to do this sort of test rigorously, robustly and quickly enough to be a unit test. We'd need some sort of validation suite for this. Please open an issue to track if you think it's needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. Probably only worth it if we require a validation suite for other reasons, I suppose. I can't think of any off hand, though.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 15, 2020

NB: Name-wise, I sort-of prefer the conciseness of generate_random(self, arity=2), and raising a NotImplementedError if anything other than 2 is specified, but we could make such a thing later if need be.

@jeromekelleher
Copy link
Member

NB: Name-wise, I sort-of prefer the conciseness of generate_random(self, arity=2), and raising a NotImplementedError if anything other than 2 is specified, but we could make such a thing later if need be.

I started with this, and it turned out to be quite tedious in practise having to type arity=2 every time. This is a big enough special case to warrant it's own method, I think. Anyway, I don't imagine anyone is going to implement an efficient general tree sampler any time soon, so there's no much point in making the interface more clumsy for it.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 15, 2020

I started with this, and it turned out to be quite tedious in practise having to type arity=2 every time. This is a big enough special case to warrant it's own method, I think. Anyway, I don't imagine anyone is going to implement an efficient general tree sampler any time soon, so there's no much point in making the interface more clumsy for it.

Sorry, I meant, with arity=2 being the default, so you don't have to actually type it. That strikes me as the value for arity that's most likely to be expected as a default, no?

@jeromekelleher
Copy link
Member

No, I disagree with that. I wouldn't expect a general random_tree method to just choose from the binary trees by default. That would be weird.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 15, 2020

No, I disagree with that. I wouldn't expect a general random_tree method to just choose from the binary trees by default. That would be weird.

Ah, I see, you mean if arity was none you would expect to pick from all trees, of any arity, with equal probability. Yes, I can see that as a more sensible default if arity is not specified (I assumed you would require an integer arity all the time, even for a default value). So don't mind me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Tree.generate_random() method

4 participants