Sort individuals #1199

benjeffery · 2021-02-15T16:33:14Z

Sort individuals. Fixes #1197 #1138.

WIP!

Currently only a python implementation (deliberately written in a style to aid translation to C). Needs more tests etc, but thought it worth sharing progress.

codecov · 2021-02-15T16:52:27Z

Codecov Report

Merging #1199 (c26caec) into main (8ac758f) will decrease coverage by 0.00%.
The diff coverage is 91.46%.

@@            Coverage Diff             @@
##             main    #1199      +/-   ##
==========================================
- Coverage   93.72%   93.71%   -0.01%     
==========================================
  Files          26       26              
  Lines       21535    21618      +83     
  Branches      909      909              
==========================================
+ Hits        20184    20260      +76     
- Misses       1312     1319       +7     
  Partials       39       39

Flag	Coverage Δ
c-tests	`92.50% <91.46%> (-0.01%)`	⬇️
lwt-tests	`92.97% <ø> (ø)`
python-c-tests	`94.91% <ø> (+<0.01%)`	⬆️
python-tests	`98.62% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
c/tskit/core.h	`100.00% <ø> (ø)`
c/tskit/tables.c	`90.84% <91.13%> (+<0.01%)`	⬆️
c/tskit/core.c	`97.05% <100.00%> (+0.02%)`	⬆️
python/lwt_interface/tskit_lwt_interface.h	`95.05% <0.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8ac758f...c26caec. Read the comment docs.

jeromekelleher · 2021-02-15T17:16:21Z

Looks very nice, eminently C-able!

benjeffery · 2021-02-23T08:58:12Z

Was hoping to have this done yesterday, but taking longer than expected - the initial implementation's sort wasn't stable, that's fixed now. Currently fixing up tests in python side.

jeromekelleher · 2021-02-23T09:35:08Z

the initial implementation's sort wasn't stable,

It's the gift that keeps on giving! Well done for spotting early.

petrelharp · 2021-02-26T05:17:48Z

Nicely done. So the sort here is decreasing by "minimum number of hops to a childless descendant, and then smallest position in the table among the closest childless descendant".

benjeffery · 2021-02-26T10:01:09Z

@petrelharp Yes! That's a great way of stating it. Should have this polished off today.

benjeffery · 2021-02-26T12:12:53Z

While working on the python tests I realised we also need to tweak the canonicalisation as the sort there no longer meets the requirements of a tree sequence.

benjeffery · 2021-02-26T12:22:00Z

We also have the classic msprime chicken and egg as all the sort testing code uses msprime that doesn't store the pedigree. Will add some wf_sim tree sequences to the sort tests for now.

petrelharp · 2021-02-26T15:06:01Z

While working on the python tests I realised we also need to tweak the canonicalisation as the sort there no longer meets the requirements of a tree sequence.

But canonicalize just uses the standard sort with an enhanced mutation sort so it'll get this change?

benjeffery · 2021-02-26T15:14:20Z

While working on the python tests I realised we also need to tweak the canonicalisation as the sort there no longer meets the requirements of a tree sequence.

But canonicalize just uses the standard sort with an enhanced mutation sort so it'll get this change?

Ah, so I think the sort of individuals will be valid then. However, their sort order will depend on the ordering of the parents in each individual though - is that ok for canonicalisation or do we need to have a canonical ordering of the parent arrays?

petrelharp · 2021-02-26T15:23:03Z

Ah, so I think the sort of individuals will be valid then. However, their sort order will depend on the ordering of the parents in each individual though - is that ok for canonicalisation or do we need to have a canonical ordering of the parent arrays?

Oh, right - canonicalize uses subset to put the parents in order by node. That happens before sort, so I guess that with this change, canonicalise will sort individuals by "minimum number of hops to a childless descendant, and then smallest node among the closest childless descendant". It's still canonical, given the node ordering, I think? This will need some modifications to canonicalise's docs, anyhow.

I guess that now subset can produce invalid tree sequences, though, since it reorders individuals by their nodes.

jeromekelleher · 2021-02-26T15:34:24Z

I guess that now subset can produce invalid tree sequences, though, since it reorders individuals by their nodes.

Ah, that's unfortunate...

benjeffery · 2021-02-26T15:35:51Z

I guess that now subset can produce invalid tree sequences, though, since it reorders individuals by their nodes.

Yes, this is true, I guess it can just sort the individuals as a last step?

It doesn't seem like subset reorders the parent array here, so the final sort order is still dependant on the order the parents were specified in.

Also I can't see where subset remaps the ids in individual.parent?

petrelharp · 2021-02-26T16:35:47Z

Ok, how about this:

have subset leave the individual table ordering alone (so, still subset individuals, but don't re-order them), and
come up with a canonical ordering for individuals, that either happens here, or if it's difficult, that canonicalise does.

petrelharp · 2021-02-26T16:40:38Z

As for the order: modifying what you've got here it'd be easy to efficiently compute the total number of descendants of each individual (by recursively adding up the values of all children), so then we could have (num_descendants, smallest referring node) as a canonical sort key. I dunno whether to swap that in here or just do it in canonicalize - what you have here is nice because it doesn't use qsort, but then why would we bother implementing two different sorts in different places?

jeromekelleher · 2021-02-26T16:59:21Z

I dunno whether to swap that in here or just do it in canonicalize - what you have here is nice because it doesn't use qsort, but then why would we bother implementing two different sorts in different places?

Well, one argument would be one is O(n) and the other is O(n log n). I don't have a strong opinion about it though.

benjeffery · 2021-02-26T18:03:26Z

As another option, I think if subset also sorts the parents in individual.parents then applying this sort will be canonical.

I don't have enough information to know how to prioritise the speed of sort vs the complexity of having two sorts. Sorting is a very common operation though so I err to having that be O(n).

If it makes sense I'd like to finish this PR as-is and then fix the canonical issues in a separate one?

petrelharp · 2021-02-26T21:04:16Z

I think if subset also sorts the parents in individual.parents then applying this sort will be canonical.

Um, I guess this is true? I'm a little unsure about this, given that the outcome of this sort depends on the order of parents.

If it makes sense I'd like to finish this PR as-is and then fix the canonical issues in a separate one?

That makes sense, but I think we should make sure we have good path forward: we'd like to at least make sure that if you canonicalise, a subsequent sort doesn't change anything, for instance. And, that the order induced by canonicalise does not depend on the original order of the parent table, just on the order of the node table. And probably also not on the order that the different parents of a given individual are specified.

I'll think some more about this (although I'm ready to be done with sorting... =)

jeromekelleher · 2021-02-27T08:26:19Z

Agree with all the above, but:

: we'd like to at least make sure that if you canonicalise, a subsequent sort doesn't change anything, for instance.

I wouldn't get too hung up on this; once the output of sort is guaranteed to be legal input for tree sequences that's all that really matters. Canonical must be canonical though, there can be only one output from canonicalise.

petrelharp · 2021-02-28T03:31:22Z

I wouldn't get too hung up on this; once the output of sort is guaranteed to be legal input for tree sequences that's all that really matters. Canonical must be canonical though, there can be only one output from canonicalise.

Ok, wise words there. Here's my proposal, then:

make it so subset doesn't reorder the individual table (this is easy and makes it easier to understand anyhow)
let sort do whatever (so: this PR is fine)
change canonicalize so that it sorts individuals on (num descendants, id of first referring node) - this will be a bit annoying but straightforward

I've given it a fair bit of thought and come up with no better solution.

jeromekelleher · 2021-02-28T08:44:53Z

Sounds like a plan, thanks @petrelharp.

So, I guess we merge the sort stuff here and file some issues to track the other two (which we resolve before release)?

petrelharp · 2021-02-28T18:58:01Z

I've made a good start at subset and canonicalise (in #1226), and turned up another issue (union); I think I'll wait until this is merged to finish since I'm starting to step on the toes of this PR.

benjeffery · 2021-03-01T14:33:37Z

Needs a squash and a changelog, but I finally think this is ready for review. I had to do the subset individual re-mapping (#1223 ) to get the tests to pass.

petrelharp · 2021-03-01T14:48:12Z

c/tskit/tables.c

+    tsk_id_t i;
+    tsk_individual_table_t copy;
+    tsk_individual_t individual;
+    tsk_individual_t copy_individual;


you don't use individual and copy_individual at the same time, so maybe they should both be individual?

petrelharp · 2021-03-01T14:53:31Z

python/tests/tsutil.py

+            copy.migrations[mig_id].dest,
+            copy.migrations[mig_id].time,
+            copy.migrations[mig_id].metadata,
+        )


did you mean to put the migration sort stuff in here?

Yes, otherwise a comparison of the python and c sorting fails.

petrelharp

Looks good! Very small comments, and: how about throwing in a test that inserts some NULL parents in addition to the actual parents? (Or, maybe we should just disallow them.)

jeromekelleher

Very nice, a few minor comments. The algorithm is quite beautiful!

c/tskit/tables.c

jeromekelleher · 2021-03-02T08:41:25Z

python/tests/tsutil.py



+def cmp_migration(i, j, tables):
+    ret = tables.migrations.time[i] - tables.migrations.time[j]


It doesn't really matter I'd imagine, but

migration_i = tables.migrations[i] migration_j = tables.migrations[j] ret = migration_i.time - migration_j.time # etc

would avoid repeatedly building and returning numpy arrays for each of the columns.

python/tests/tsutil.py

jeromekelleher · 2021-03-02T08:46:13Z

Looks good! Very small comments, and: how about throwing in a test that inserts some NULL parents in addition to the actual parents? (Or, maybe we should just disallow them.)

I think we should allow NULL parents - there will be times when having parents [-1, -1] will be handy and others when parents=[] is more useful. We should allow both, I think.

benjeffery force-pushed the sort-individuals branch from d36efe5 to edd2430 Compare February 21, 2021 00:51

petrelharp mentioned this pull request Feb 26, 2021

subset (and maybe other methods?) need to remap individual.parent #1223

Closed

petrelharp mentioned this pull request Feb 28, 2021

make subset and canonicalise deal with individual parents #1226

Merged

benjeffery marked this pull request as ready for review March 1, 2021 14:31

petrelharp reviewed Mar 1, 2021

View reviewed changes

petrelharp approved these changes Mar 1, 2021

View reviewed changes

jeromekelleher approved these changes Mar 2, 2021

View reviewed changes

benjeffery force-pushed the sort-individuals branch from 0da7f51 to c8c6432 Compare March 3, 2021 11:37

benjeffery added the AUTOMERGE-REQUESTED label Mar 3, 2021

AdminBot-tskit force-pushed the sort-individuals branch from c8c6432 to e065d39 Compare March 3, 2021 12:03

benjeffery mentioned this pull request Mar 4, 2021

Consider ordering requirements for individual table. #1138

Closed

Sort individuals

c26caec

AdminBot-tskit force-pushed the sort-individuals branch from e065d39 to c26caec Compare March 4, 2021 13:48

mergify bot merged commit c17f179 into tskit-dev:main Mar 4, 2021

mergify bot removed the AUTOMERGE-REQUESTED label Mar 4, 2021

benjeffery deleted the sort-individuals branch March 4, 2021 15:50

petrelharp mentioned this pull request Mar 4, 2021

sort so mutation parents are above children #1232

Closed

This was referenced Jul 8, 2021

tskit tree format integration juntyr/necsim-rust#18

Merged

Correct docs to point out that individuals are sorted #1562

Merged

petrelharp mentioned this pull request Sep 2, 2021

Unstable individual table topological sort #1653

Open



		def cmp_migration(i, j, tables):
		ret = tables.migrations.time[i] - tables.migrations.time[j]

Sort individuals #1199

Sort individuals #1199

Uh oh!

Conversation

benjeffery commented Feb 15, 2021

Uh oh!

codecov bot commented Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented Feb 15, 2021

Uh oh!

benjeffery commented Feb 23, 2021

Uh oh!

jeromekelleher commented Feb 23, 2021

Uh oh!

petrelharp commented Feb 26, 2021

Uh oh!

benjeffery commented Feb 26, 2021

Uh oh!

benjeffery commented Feb 26, 2021

Uh oh!

benjeffery commented Feb 26, 2021

Uh oh!

petrelharp commented Feb 26, 2021

Uh oh!

benjeffery commented Feb 26, 2021

Uh oh!

petrelharp commented Feb 26, 2021

Uh oh!

jeromekelleher commented Feb 26, 2021

Uh oh!

benjeffery commented Feb 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petrelharp commented Feb 26, 2021

Uh oh!

petrelharp commented Feb 26, 2021

Uh oh!

jeromekelleher commented Feb 26, 2021

Uh oh!

benjeffery commented Feb 26, 2021

Uh oh!

petrelharp commented Feb 26, 2021

Uh oh!

jeromekelleher commented Feb 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petrelharp commented Feb 28, 2021

Uh oh!

jeromekelleher commented Feb 28, 2021

Uh oh!

petrelharp commented Feb 28, 2021

Uh oh!

benjeffery commented Mar 1, 2021

Uh oh!

petrelharp Mar 1, 2021

Choose a reason for hiding this comment

Uh oh!

petrelharp Mar 1, 2021

Choose a reason for hiding this comment

Uh oh!

benjeffery Mar 1, 2021

Choose a reason for hiding this comment

Uh oh!

petrelharp left a comment

Choose a reason for hiding this comment

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeromekelleher Mar 2, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Feb 15, 2021 •

edited

Loading

benjeffery commented Feb 26, 2021 •

edited

Loading

jeromekelleher commented Feb 27, 2021 •

edited

Loading