Skip to content

Conversation

@hyanwong
Copy link
Member

As per discussion in #934 , it would be nice to be able to generate both "fully balanced" (for num_leaves = 2 / 4/ 8/ 16, etc) and "semi-balanced" trees using tskit.Tree.generate_balanced(num_leaves, ...). The current PR gives one way to do this, but for semi-balanced trees it fails the unit tests because it doesn't always generate trees that are topologically equivalent to tskit.Tree.unrank((tskit.combinatorics.num_shapes(num_leaves)-1, 0), num_leaves), for instance for num_leaves=6:

>>> print(tskit.Tree.generate_balanced(6).draw_text())
     10    
   ┏━━┻━━┓ 
   9     ┃ 
 ┏━┻━┓   ┃ 
 6   7   8 
┏┻┓ ┏┻┓ ┏┻┓
0 1 2 3 4 5

>>> print(tskit.Tree.unrank((tskit.combinatorics.num_shapes(6)-1, 0), 6).draw_text())
    10     
  ┏━━┻━━┓  
  7     9  
┏━┻┓  ┏━┻┓ 
┃  6  ┃  8 
┃ ┏┻┓ ┃ ┏┻┓
0 1 2 3 4 5

This is probably because the unranked tree is generated using recursion (which fails with too great a recursion depth for large num_leaves), whereas the version here simply does a level-order traversal. It's not clear to me if one of these two trees is more "equally balanced" than another (they both have the same colless_tree_imbalance metric), so I don't know which we prefer.

Re comparing to unranked, it's also not obvious to me how the internal node numbering in unranked works (and numbering order isn't guaranteed anyway, according to the docs), so it might take some work to get semi-balanced trees to match between the two methods.

@AdminBot-tskit
Copy link
Collaborator

📖 Docs for this PR can be previewed here

@jeromekelleher
Copy link
Member

I think the unrank(N - 1) tree is more balanced here, as there's equal numbers of nodes on either side of the root. The node numbering thing would be tricky all right, but we could weaken the requirement a bit and just say that the KC distance between the result of this function and unrank(N - 1) is equal to 0.

It's not the end of the world if we have a recursive algorithm to do this, it's not exactly performance critical anyway (unrank(N - 1) is vastly more work ). I doubt we'd need to do this in any case - aren't we just successively splitting the range(n) list?

@hyanwong
Copy link
Member Author

I think the unrank(N - 1) tree is more balanced here, as there's equal numbers of nodes on either side of the root.

OK. I guess we should have a published balance measure that we point to, though?

The node numbering thing would be tricky all right, but we could weaken the requirement a bit and just say that the KC distance between the result of this function and unrank(N - 1) is equal to 0.

Ah, but it's not just the internal nodes. The tip numbering in unrank is a bit obscure to me, e.g. in the following tree, it's not either the 0th or the 16th tip that is asymmetrical:

>>> print(tskit.Tree.unrank((tskit.combinatorics.num_shapes(17)-1, 0), 17).draw_text())
               32                       
       ┏━━━━━━━━┻━━━━━━━━━┓             
       ┃                 31             
       ┃            ┏━━━━━┻━━━━━┓       
      23            ┃          30       
   ┏━━━┻━━━┓        ┃       ┏━━━┻━━━┓   
  19      22       26       ┃      29   
 ┏━┻━┓   ┏━┻━┓   ┏━━┻━┓     ┃    ┏━━┻━┓ 
17  18  20  21  24   25    27    ┃   28 
┏┻┓ ┏┻┓ ┏┻┓ ┏┻┓ ┏┻┓  ┏┻━┓  ┏┻━┓  ┃  ┏━┻┓
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

We could try to work around that by comparing the (unlabelled) rank of the generate_balanced Tree.

It's not the end of the world if we have a recursive algorithm to do this, it's not exactly performance critical anyway (unrank(N - 1) is vastly more work ).

I hit Python recursion depth problems when doing this using unrank on my 1500 tip trees that I inferred. So I'd prefer to avoid that.

I doubt we'd need to do this in any case - aren't we just successively splitting the range(n) list?

Ah, possibly. I haven't thought about it in great detail.

@jeromekelleher
Copy link
Member

jeromekelleher commented Oct 28, 2020

OK. I guess we should have a published balance measure that we point to, though?

Well you said it had the same balanced-ness metric as the version you made? So, why choose something different to unrank(N - 1)?

I hit Python recursion depth problems when doing this using unrank on my 1500 tip trees that I inferred. So I'd prefer to avoid that.

Yes, that's what I meant - unrank is vastly more work than a direct algorithm, which should have depth log(n) and therefore very unlikely to hit any recursion limits.

Unrank is really expensive for large n.

@jeromekelleher
Copy link
Member

Ah, but it's not just the internal nodes. The tip numbering in unrank is a bit obscure to me, e.g. in the following tree, it's not either the 0th or the 16th tip that is asymmetrical:

This is a good example, and yes, this isn't obvious. Maybe we could get @daniel-goldstein's thoughts here. Daniel, do you think unrank(N - 1) should be our choice for the balanced tree for a given number of leaves? If so, can you explain what the node labelling is doing?

@jeromekelleher
Copy link
Member

Can you update this PR so it's a clean diff please @hyanwong?

@hyanwong hyanwong force-pushed the generate-balanced-tree branch from a35ba71 to 4ce5538 Compare October 28, 2020 12:00
@daniel-goldstein
Copy link
Member

daniel-goldstein commented Oct 28, 2020

I personally do think the last rank should be the most balanced, like Jerome said because it's the most balanced from the perspective of each internal node.

Ah, but it's not just the internal nodes. The tip numbering in unrank is a bit obscure to me, e.g. in the following tree, it's not either the 0th or the 16th tip that is asymmetrical:

>>> print(tskit.Tree.unrank((tskit.combinatorics.num_shapes(17)-1, 0), 17).draw_text())
               32                       
       ┏━━━━━━━━┻━━━━━━━━━┓             
       ┃                 31             
       ┃            ┏━━━━━┻━━━━━┓       
      23            ┃          30       
   ┏━━━┻━━━┓        ┃       ┏━━━┻━━━┓   
  19      22       26       ┃      29   
 ┏━┻━┓   ┏━┻━┓   ┏━━┻━┓     ┃    ┏━━┻━┓ 
17  18  20  21  24   25    27    ┃   28 
┏┻┓ ┏┻┓ ┏┻┓ ┏┻┓ ┏┻┓  ┏┻━┓  ┏┻━┓  ┃  ┏━┻┓
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

When unranking, the labels allocated to each subtree are sorted according to whatever label rank (which when given 0 is identity), and then chunked and passed down to children from left to right. This is why (14) is given to the left subtree and (15,16) given to the right. It could be isomorphic, and more reasonable, to allocate labels to subtrees in order of number of leaves in each subtree descending, so instead pass (14,15) to the largest subtree and (16) to the remaining, but I would have to think about this more closely.

The reason the single leaf is not on the right-most edge of the tree is that the canonical ordering on tree shapes requires children of an internal node be ordered by number of leaves, and then rank, ascending. Note that if you add a leaf to your 6-leaf tree example you get a similar topology.

@jeromekelleher
Copy link
Member

Nice, thanks @daniel-goldstein. Can you think of a simple algorithm for generating the unrank(-1) tree directly off the top of your head?

@daniel-goldstein
Copy link
Member

daniel-goldstein commented Oct 28, 2020

Hm do you care about the way the interior nodes are labelled? For ranking those aren't really important and are assigned in the simple way of splitting the list of interior node labels and handing the lower half to the left and greater to the right. So not sure just from looking at it what's the easiest way of turning it iterative. Otherwise you can use a simple queue starting with just the leaves and repeatedly pop 2 and enqueue the parent (and do a little extra work for the 3 leaves at the end) until you're done. Also if this functionality is meant to be illustrative and easy to understand the way Yan labelled his interior nodes feels simpler to me.

@jeromekelleher
Copy link
Member

It'd be nice to have the internal nodes have the same labels, but I guess it's not that important. Basically we want something which will have kc distance of zero from unrank(-1).

A recursive algorithm would be fine if it's a lot simpler, this isn't really going to be used in production code for large n I would guess. (If it is, we can always make a C version)

@hyanwong
Copy link
Member Author

hyanwong commented Oct 29, 2020

OK. I guess we should have a published balance measure that we point to, though?

Well you said it had the same balanced-ness metric as the version you made? So, why choose something different to unrank(N - 1)?

FWIW both the Colless and Sackin balance metrics rank these two as the same, whereas the Total Cophenetic Index gives the unrank tree a more balanced score (indeed, the maximum for 6 tips)

@jeromekelleher
Copy link
Member

Very nice, that's another really good reason for using the Tree.unrank((-1, 0), n) tree.

Regarding the node labelling, is there any particular reason for using the zero'th labelling? It would be even more pleasing an symmetrical if we used the Tree.unrank((-1, -1), n)

@jeromekelleher
Copy link
Member

Hmm, no, looks like the zeroth labelling is probably simplest:

>>> N = tskit.combinatorics.num_shapes(6)
>>> M = tskit.combinatorics.num_labellings(N - 1, 6)
>>> print(tskit.Tree.unrank((N - 1, M - 1), 6).draw_text()
    10     
  ┏━━┻━━┓  
  7     9  
 ┏┻━┓  ┏┻━┓
 6  ┃  8  ┃
┏┻┓ ┃ ┏┻┓ ┃
0 4 5 1 2 3

@jeromekelleher jeromekelleher mentioned this pull request Nov 21, 2020
3 tasks
@jeromekelleher
Copy link
Member

closing this as superseded by #1026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants