-
Notifications
You must be signed in to change notification settings - Fork 79
Generate balanced tree #944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
📖 Docs for this PR can be previewed here |
|
I think the It's not the end of the world if we have a recursive algorithm to do this, it's not exactly performance critical anyway ( |
OK. I guess we should have a published balance measure that we point to, though?
Ah, but it's not just the internal nodes. The tip numbering in We could try to work around that by comparing the (unlabelled) rank of the
I hit Python recursion depth problems when doing this using
Ah, possibly. I haven't thought about it in great detail. |
Well you said it had the same balanced-ness metric as the version you made? So, why choose something different to
Yes, that's what I meant - Unrank is really expensive for large n. |
This is a good example, and yes, this isn't obvious. Maybe we could get @daniel-goldstein's thoughts here. Daniel, do you think |
|
Can you update this PR so it's a clean diff please @hyanwong? |
a35ba71 to
4ce5538
Compare
|
I personally do think the last rank should be the most balanced, like Jerome said because it's the most balanced from the perspective of each internal node.
When unranking, the labels allocated to each subtree are sorted according to whatever label rank (which when given 0 is identity), and then chunked and passed down to children from left to right. This is why (14) is given to the left subtree and (15,16) given to the right. It could be isomorphic, and more reasonable, to allocate labels to subtrees in order of number of leaves in each subtree descending, so instead pass (14,15) to the largest subtree and (16) to the remaining, but I would have to think about this more closely. The reason the single leaf is not on the right-most edge of the tree is that the canonical ordering on tree shapes requires children of an internal node be ordered by number of leaves, and then rank, ascending. Note that if you add a leaf to your 6-leaf tree example you get a similar topology. |
|
Nice, thanks @daniel-goldstein. Can you think of a simple algorithm for generating the |
|
Hm do you care about the way the interior nodes are labelled? For ranking those aren't really important and are assigned in the simple way of splitting the list of interior node labels and handing the lower half to the left and greater to the right. So not sure just from looking at it what's the easiest way of turning it iterative. Otherwise you can use a simple queue starting with just the leaves and repeatedly pop 2 and enqueue the parent (and do a little extra work for the 3 leaves at the end) until you're done. Also if this functionality is meant to be illustrative and easy to understand the way Yan labelled his interior nodes feels simpler to me. |
|
It'd be nice to have the internal nodes have the same labels, but I guess it's not that important. Basically we want something which will have kc distance of zero from unrank(-1). A recursive algorithm would be fine if it's a lot simpler, this isn't really going to be used in production code for large n I would guess. (If it is, we can always make a C version) |
FWIW both the Colless and Sackin balance metrics rank these two as the same, whereas the Total Cophenetic Index gives the |
|
Very nice, that's another really good reason for using the Regarding the node labelling, is there any particular reason for using the zero'th labelling? It would be even more pleasing an symmetrical if we used the |
|
Hmm, no, looks like the zeroth labelling is probably simplest: >>> N = tskit.combinatorics.num_shapes(6)
>>> M = tskit.combinatorics.num_labellings(N - 1, 6)
>>> print(tskit.Tree.unrank((N - 1, M - 1), 6).draw_text() |
|
closing this as superseded by #1026 |
As per discussion in #934 , it would be nice to be able to generate both "fully balanced" (for num_leaves = 2 / 4/ 8/ 16, etc) and "semi-balanced" trees using
tskit.Tree.generate_balanced(num_leaves, ...). The current PR gives one way to do this, but for semi-balanced trees it fails the unit tests because it doesn't always generate trees that are topologically equivalent totskit.Tree.unrank((tskit.combinatorics.num_shapes(num_leaves)-1, 0), num_leaves), for instance for num_leaves=6:This is probably because the
unrankedtree is generated using recursion (which fails with too great a recursion depth for large num_leaves), whereas the version here simply does a level-order traversal. It's not clear to me if one of these two trees is more "equally balanced" than another (they both have the same colless_tree_imbalance metric), so I don't know which we prefer.Re comparing to
unranked, it's also not obvious to me how the internal node numbering inunrankedworks (and numbering order isn't guaranteed anyway, according to the docs), so it might take some work to get semi-balanced trees to match between the two methods.