-
Notifications
You must be signed in to change notification settings - Fork 79
Sort individuals #1199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort individuals #1199
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1199 +/- ##
==========================================
- Coverage 93.72% 93.71% -0.01%
==========================================
Files 26 26
Lines 21535 21618 +83
Branches 909 909
==========================================
+ Hits 20184 20260 +76
- Misses 1312 1319 +7
Partials 39 39
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
Looks very nice, eminently C-able! |
d36efe5 to
edd2430
Compare
|
Was hoping to have this done yesterday, but taking longer than expected - the initial implementation's sort wasn't stable, that's fixed now. Currently fixing up tests in python side. |
It's the gift that keeps on giving! Well done for spotting early. |
|
Nicely done. So the sort here is decreasing by "minimum number of hops to a childless descendant, and then smallest position in the table among the closest childless descendant". |
|
@petrelharp Yes! That's a great way of stating it. Should have this polished off today. |
|
While working on the python tests I realised we also need to tweak the canonicalisation as the sort there no longer meets the requirements of a tree sequence. |
|
We also have the classic msprime chicken and egg as all the sort testing code uses msprime that doesn't store the pedigree. Will add some |
But |
Ah, so I think the sort of individuals will be valid then. However, their sort order will depend on the ordering of the parents in each individual though - is that ok for canonicalisation or do we need to have a canonical ordering of the parent arrays? |
Oh, right - canonicalize uses I guess that now |
Ah, that's unfortunate... |
Yes, this is true, I guess it can just sort the individuals as a last step? It doesn't seem like Also I can't see where |
|
Ok, how about this:
|
|
As for the order: modifying what you've got here it'd be easy to efficiently compute the total number of descendants of each individual (by recursively adding up the values of all children), so then we could have (num_descendants, smallest referring node) as a canonical sort key. I dunno whether to swap that in here or just do it in canonicalize - what you have here is nice because it doesn't use |
Well, one argument would be one is O(n) and the other is O(n log n). I don't have a strong opinion about it though. |
|
As another option, I think if I don't have enough information to know how to prioritise the speed of sort vs the complexity of having two sorts. Sorting is a very common operation though so I err to having that be O(n). If it makes sense I'd like to finish this PR as-is and then fix the canonical issues in a separate one? |
Um, I guess this is true? I'm a little unsure about this, given that the outcome of this sort depends on the order of parents.
That makes sense, but I think we should make sure we have good path forward: we'd like to at least make sure that if you canonicalise, a subsequent sort doesn't change anything, for instance. And, that the order induced by canonicalise does not depend on the original order of the parent table, just on the order of the node table. And probably also not on the order that the different parents of a given individual are specified. I'll think some more about this (although I'm ready to be done with sorting... =) |
|
Agree with all the above, but:
I wouldn't get too hung up on this; once the output of sort is guaranteed to be legal input for tree sequences that's all that really matters. Canonical must be canonical though, there can be only one output from |
Ok, wise words there. Here's my proposal, then:
I've given it a fair bit of thought and come up with no better solution. |
|
Sounds like a plan, thanks @petrelharp. So, I guess we merge the sort stuff here and file some issues to track the other two (which we resolve before release)? |
|
I've made a good start at |
|
Needs a squash and a changelog, but I finally think this is ready for review. I had to do the subset individual re-mapping (#1223 ) to get the tests to pass. |
c/tskit/tables.c
Outdated
| tsk_id_t i; | ||
| tsk_individual_table_t copy; | ||
| tsk_individual_t individual; | ||
| tsk_individual_t copy_individual; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't use individual and copy_individual at the same time, so maybe they should both be individual?
| copy.migrations[mig_id].dest, | ||
| copy.migrations[mig_id].time, | ||
| copy.migrations[mig_id].metadata, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean to put the migration sort stuff in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, otherwise a comparison of the python and c sorting fails.
petrelharp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Very small comments, and: how about throwing in a test that inserts some NULL parents in addition to the actual parents? (Or, maybe we should just disallow them.)
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, a few minor comments. The algorithm is quite beautiful!
|
|
||
|
|
||
| def cmp_migration(i, j, tables): | ||
| ret = tables.migrations.time[i] - tables.migrations.time[j] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't really matter I'd imagine, but
migration_i = tables.migrations[i]
migration_j = tables.migrations[j]
ret = migration_i.time - migration_j.time
# etc
would avoid repeatedly building and returning numpy arrays for each of the columns.
I think we should allow NULL parents - there will be times when having parents |
0da7f51 to
c8c6432
Compare
c8c6432 to
e065d39
Compare
e065d39 to
c26caec
Compare
Sort individuals. Fixes #1197 #1138.
WIP!
Currently only a python implementation (deliberately written in a style to aid translation to C). Needs more tests etc, but thought it worth sharing progress.