Generate balanced tree #944

hyanwong · 2020-10-27T18:47:59Z

As per discussion in #934 , it would be nice to be able to generate both "fully balanced" (for num_leaves = 2 / 4/ 8/ 16, etc) and "semi-balanced" trees using tskit.Tree.generate_balanced(num_leaves, ...). The current PR gives one way to do this, but for semi-balanced trees it fails the unit tests because it doesn't always generate trees that are topologically equivalent to tskit.Tree.unrank((tskit.combinatorics.num_shapes(num_leaves)-1, 0), num_leaves), for instance for num_leaves=6:

>>> print(tskit.Tree.generate_balanced(6).draw_text())
     10    
   ┏━━┻━━┓ 
   9     ┃ 
 ┏━┻━┓   ┃ 
 6   7   8 
┏┻┓ ┏┻┓ ┏┻┓
0 1 2 3 4 5

>>> print(tskit.Tree.unrank((tskit.combinatorics.num_shapes(6)-1, 0), 6).draw_text())
    10     
  ┏━━┻━━┓  
  7     9  
┏━┻┓  ┏━┻┓ 
┃  6  ┃  8 
┃ ┏┻┓ ┃ ┏┻┓
0 1 2 3 4 5

This is probably because the unranked tree is generated using recursion (which fails with too great a recursion depth for large num_leaves), whereas the version here simply does a level-order traversal. It's not clear to me if one of these two trees is more "equally balanced" than another (they both have the same colless_tree_imbalance metric), so I don't know which we prefer.

Re comparing to unranked, it's also not obvious to me how the internal node numbering in unranked works (and numbering order isn't guaranteed anyway, according to the docs), so it might take some work to get semi-balanced trees to match between the two methods.

AdminBot-tskit · 2020-10-27T18:49:22Z

📖 Docs for this PR can be previewed here

jeromekelleher · 2020-10-28T10:56:55Z

I think the unrank(N - 1) tree is more balanced here, as there's equal numbers of nodes on either side of the root. The node numbering thing would be tricky all right, but we could weaken the requirement a bit and just say that the KC distance between the result of this function and unrank(N - 1) is equal to 0.

It's not the end of the world if we have a recursive algorithm to do this, it's not exactly performance critical anyway (unrank(N - 1) is vastly more work ). I doubt we'd need to do this in any case - aren't we just successively splitting the range(n) list?

hyanwong · 2020-10-28T11:06:39Z

I think the unrank(N - 1) tree is more balanced here, as there's equal numbers of nodes on either side of the root.

OK. I guess we should have a published balance measure that we point to, though?

The node numbering thing would be tricky all right, but we could weaken the requirement a bit and just say that the KC distance between the result of this function and unrank(N - 1) is equal to 0.

Ah, but it's not just the internal nodes. The tip numbering in unrank is a bit obscure to me, e.g. in the following tree, it's not either the 0th or the 16th tip that is asymmetrical:

>>> print(tskit.Tree.unrank((tskit.combinatorics.num_shapes(17)-1, 0), 17).draw_text())
               32                       
       ┏━━━━━━━━┻━━━━━━━━━┓             
       ┃                 31             
       ┃            ┏━━━━━┻━━━━━┓       
      23            ┃          30       
   ┏━━━┻━━━┓        ┃       ┏━━━┻━━━┓   
  19      22       26       ┃      29   
 ┏━┻━┓   ┏━┻━┓   ┏━━┻━┓     ┃    ┏━━┻━┓ 
17  18  20  21  24   25    27    ┃   28 
┏┻┓ ┏┻┓ ┏┻┓ ┏┻┓ ┏┻┓  ┏┻━┓  ┏┻━┓  ┃  ┏━┻┓
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

We could try to work around that by comparing the (unlabelled) rank of the generate_balanced Tree.

It's not the end of the world if we have a recursive algorithm to do this, it's not exactly performance critical anyway (unrank(N - 1) is vastly more work ).

I hit Python recursion depth problems when doing this using unrank on my 1500 tip trees that I inferred. So I'd prefer to avoid that.

I doubt we'd need to do this in any case - aren't we just successively splitting the range(n) list?

Ah, possibly. I haven't thought about it in great detail.

jeromekelleher · 2020-10-28T11:54:12Z

OK. I guess we should have a published balance measure that we point to, though?

Well you said it had the same balanced-ness metric as the version you made? So, why choose something different to unrank(N - 1)?

I hit Python recursion depth problems when doing this using unrank on my 1500 tip trees that I inferred. So I'd prefer to avoid that.

Yes, that's what I meant - unrank is vastly more work than a direct algorithm, which should have depth log(n) and therefore very unlikely to hit any recursion limits.

Unrank is really expensive for large n.

jeromekelleher · 2020-10-28T11:58:05Z

Ah, but it's not just the internal nodes. The tip numbering in unrank is a bit obscure to me, e.g. in the following tree, it's not either the 0th or the 16th tip that is asymmetrical:

This is a good example, and yes, this isn't obvious. Maybe we could get @daniel-goldstein's thoughts here. Daniel, do you think unrank(N - 1) should be our choice for the balanced tree for a given number of leaves? If so, can you explain what the node labelling is doing?

jeromekelleher · 2020-10-28T11:58:38Z

Can you update this PR so it's a clean diff please @hyanwong?

daniel-goldstein · 2020-10-28T12:57:28Z

I personally do think the last rank should be the most balanced, like Jerome said because it's the most balanced from the perspective of each internal node.

Ah, but it's not just the internal nodes. The tip numbering in unrank is a bit obscure to me, e.g. in the following tree, it's not either the 0th or the 16th tip that is asymmetrical:

>>> print(tskit.Tree.unrank((tskit.combinatorics.num_shapes(17)-1, 0), 17).draw_text())
               32                       
       ┏━━━━━━━━┻━━━━━━━━━┓             
       ┃                 31             
       ┃            ┏━━━━━┻━━━━━┓       
      23            ┃          30       
   ┏━━━┻━━━┓        ┃       ┏━━━┻━━━┓   
  19      22       26       ┃      29   
 ┏━┻━┓   ┏━┻━┓   ┏━━┻━┓     ┃    ┏━━┻━┓ 
17  18  20  21  24   25    27    ┃   28 
┏┻┓ ┏┻┓ ┏┻┓ ┏┻┓ ┏┻┓  ┏┻━┓  ┏┻━┓  ┃  ┏━┻┓
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

When unranking, the labels allocated to each subtree are sorted according to whatever label rank (which when given 0 is identity), and then chunked and passed down to children from left to right. This is why (14) is given to the left subtree and (15,16) given to the right. It could be isomorphic, and more reasonable, to allocate labels to subtrees in order of number of leaves in each subtree descending, so instead pass (14,15) to the largest subtree and (16) to the remaining, but I would have to think about this more closely.

The reason the single leaf is not on the right-most edge of the tree is that the canonical ordering on tree shapes requires children of an internal node be ordered by number of leaves, and then rank, ascending. Note that if you add a leaf to your 6-leaf tree example you get a similar topology.

jeromekelleher · 2020-10-28T14:03:13Z

Nice, thanks @daniel-goldstein. Can you think of a simple algorithm for generating the unrank(-1) tree directly off the top of your head?

daniel-goldstein · 2020-10-28T17:02:51Z

Hm do you care about the way the interior nodes are labelled? For ranking those aren't really important and are assigned in the simple way of splitting the list of interior node labels and handing the lower half to the left and greater to the right. So not sure just from looking at it what's the easiest way of turning it iterative. Otherwise you can use a simple queue starting with just the leaves and repeatedly pop 2 and enqueue the parent (and do a little extra work for the 3 leaves at the end) until you're done. Also if this functionality is meant to be illustrative and easy to understand the way Yan labelled his interior nodes feels simpler to me.

jeromekelleher · 2020-10-28T19:02:42Z

It'd be nice to have the internal nodes have the same labels, but I guess it's not that important. Basically we want something which will have kc distance of zero from unrank(-1).

A recursive algorithm would be fine if it's a lot simpler, this isn't really going to be used in production code for large n I would guess. (If it is, we can always make a C version)

hyanwong · 2020-10-29T10:22:48Z

OK. I guess we should have a published balance measure that we point to, though?

Well you said it had the same balanced-ness metric as the version you made? So, why choose something different to unrank(N - 1)?

FWIW both the Colless and Sackin balance metrics rank these two as the same, whereas the Total Cophenetic Index gives the unrank tree a more balanced score (indeed, the maximum for 6 tips)

jeromekelleher · 2020-10-29T10:58:56Z

Very nice, that's another really good reason for using the Tree.unrank((-1, 0), n) tree.

Regarding the node labelling, is there any particular reason for using the zero'th labelling? It would be even more pleasing an symmetrical if we used the Tree.unrank((-1, -1), n)

jeromekelleher · 2020-10-29T11:20:08Z

Hmm, no, looks like the zeroth labelling is probably simplest:

>>> N = tskit.combinatorics.num_shapes(6)
>>> M = tskit.combinatorics.num_labellings(N - 1, 6)
>>> print(tskit.Tree.unrank((N - 1, M - 1), 6).draw_text()

    10     
  ┏━━┻━━┓  
  7     9  
 ┏┻━┓  ┏┻━┓
 6  ┃  8  ┃
┏┻┓ ┃ ┏┻┓ ┃
0 4 5 1 2 3

jeromekelleher · 2020-11-25T14:36:02Z

closing this as superseded by #1026

Trial for creating a balanced tree

4ce5538

hyanwong force-pushed the generate-balanced-tree branch from a35ba71 to 4ce5538 Compare October 28, 2020 12:00

jeromekelleher mentioned this pull request Oct 29, 2020

Support negative indexing in Tree.unrank #949

Closed

jeromekelleher mentioned this pull request Nov 21, 2020

More combinatorial functions #1026

Merged

3 tasks

jeromekelleher closed this Nov 25, 2020

Generate balanced tree #944

Generate balanced tree #944

Uh oh!

Conversation

hyanwong commented Oct 27, 2020

Uh oh!

AdminBot-tskit commented Oct 27, 2020

Uh oh!

jeromekelleher commented Oct 28, 2020

Uh oh!

hyanwong commented Oct 28, 2020

Uh oh!

jeromekelleher commented Oct 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Oct 28, 2020

Uh oh!

jeromekelleher commented Oct 28, 2020

Uh oh!

daniel-goldstein commented Oct 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Oct 28, 2020

Uh oh!

daniel-goldstein commented Oct 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Oct 28, 2020

Uh oh!

hyanwong commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Oct 29, 2020

Uh oh!

jeromekelleher commented Oct 29, 2020

Uh oh!

jeromekelleher commented Nov 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jeromekelleher commented Oct 28, 2020 •

edited

Loading

daniel-goldstein commented Oct 28, 2020 •

edited

Loading

daniel-goldstein commented Oct 28, 2020 •

edited

Loading

hyanwong commented Oct 29, 2020 •

edited

Loading