Skip to content

Conversation

@benjeffery
Copy link
Member

Sort individuals. Fixes #1197 #1138.

WIP!

Currently only a python implementation (deliberately written in a style to aid translation to C). Needs more tests etc, but thought it worth sharing progress.

@codecov
Copy link

codecov bot commented Feb 15, 2021

Codecov Report

Merging #1199 (c26caec) into main (8ac758f) will decrease coverage by 0.00%.
The diff coverage is 91.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1199      +/-   ##
==========================================
- Coverage   93.72%   93.71%   -0.01%     
==========================================
  Files          26       26              
  Lines       21535    21618      +83     
  Branches      909      909              
==========================================
+ Hits        20184    20260      +76     
- Misses       1312     1319       +7     
  Partials       39       39              
Flag Coverage Δ
c-tests 92.50% <91.46%> (-0.01%) ⬇️
lwt-tests 92.97% <ø> (ø)
python-c-tests 94.91% <ø> (+<0.01%) ⬆️
python-tests 98.62% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
c/tskit/core.h 100.00% <ø> (ø)
c/tskit/tables.c 90.84% <91.13%> (+<0.01%) ⬆️
c/tskit/core.c 97.05% <100.00%> (+0.02%) ⬆️
python/lwt_interface/tskit_lwt_interface.h 95.05% <0.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8ac758f...c26caec. Read the comment docs.

@jeromekelleher
Copy link
Member

Looks very nice, eminently C-able!

@benjeffery
Copy link
Member Author

Was hoping to have this done yesterday, but taking longer than expected - the initial implementation's sort wasn't stable, that's fixed now. Currently fixing up tests in python side.

@jeromekelleher
Copy link
Member

the initial implementation's sort wasn't stable,

It's the gift that keeps on giving! Well done for spotting early.

@petrelharp
Copy link
Contributor

Nicely done. So the sort here is decreasing by "minimum number of hops to a childless descendant, and then smallest position in the table among the closest childless descendant".

@benjeffery
Copy link
Member Author

@petrelharp Yes! That's a great way of stating it. Should have this polished off today.

@benjeffery
Copy link
Member Author

While working on the python tests I realised we also need to tweak the canonicalisation as the sort there no longer meets the requirements of a tree sequence.

@benjeffery
Copy link
Member Author

We also have the classic msprime chicken and egg as all the sort testing code uses msprime that doesn't store the pedigree. Will add some wf_sim tree sequences to the sort tests for now.

@petrelharp
Copy link
Contributor

While working on the python tests I realised we also need to tweak the canonicalisation as the sort there no longer meets the requirements of a tree sequence.

But canonicalize just uses the standard sort with an enhanced mutation sort so it'll get this change?

@benjeffery
Copy link
Member Author

While working on the python tests I realised we also need to tweak the canonicalisation as the sort there no longer meets the requirements of a tree sequence.

But canonicalize just uses the standard sort with an enhanced mutation sort so it'll get this change?

Ah, so I think the sort of individuals will be valid then. However, their sort order will depend on the ordering of the parents in each individual though - is that ok for canonicalisation or do we need to have a canonical ordering of the parent arrays?

@petrelharp
Copy link
Contributor

Ah, so I think the sort of individuals will be valid then. However, their sort order will depend on the ordering of the parents in each individual though - is that ok for canonicalisation or do we need to have a canonical ordering of the parent arrays?

Oh, right - canonicalize uses subset to put the parents in order by node. That happens before sort, so I guess that with this change, canonicalise will sort individuals by "minimum number of hops to a childless descendant, and then smallest node among the closest childless descendant". It's still canonical, given the node ordering, I think? This will need some modifications to canonicalise's docs, anyhow.

I guess that now subset can produce invalid tree sequences, though, since it reorders individuals by their nodes.

@jeromekelleher
Copy link
Member

I guess that now subset can produce invalid tree sequences, though, since it reorders individuals by their nodes.

Ah, that's unfortunate...

@benjeffery
Copy link
Member Author

benjeffery commented Feb 26, 2021

I guess that now subset can produce invalid tree sequences, though, since it reorders individuals by their nodes.

Yes, this is true, I guess it can just sort the individuals as a last step?

It doesn't seem like subset reorders the parent array here, so the final sort order is still dependant on the order the parents were specified in.

Also I can't see where subset remaps the ids in individual.parent?

@petrelharp
Copy link
Contributor

Ok, how about this:

  • have subset leave the individual table ordering alone (so, still subset individuals, but don't re-order them), and
  • come up with a canonical ordering for individuals, that either happens here, or if it's difficult, that canonicalise does.

@petrelharp
Copy link
Contributor

As for the order: modifying what you've got here it'd be easy to efficiently compute the total number of descendants of each individual (by recursively adding up the values of all children), so then we could have (num_descendants, smallest referring node) as a canonical sort key. I dunno whether to swap that in here or just do it in canonicalize - what you have here is nice because it doesn't use qsort, but then why would we bother implementing two different sorts in different places?

@jeromekelleher
Copy link
Member

I dunno whether to swap that in here or just do it in canonicalize - what you have here is nice because it doesn't use qsort, but then why would we bother implementing two different sorts in different places?

Well, one argument would be one is O(n) and the other is O(n log n). I don't have a strong opinion about it though.

@benjeffery
Copy link
Member Author

As another option, I think if subset also sorts the parents in individual.parents then applying this sort will be canonical.

I don't have enough information to know how to prioritise the speed of sort vs the complexity of having two sorts. Sorting is a very common operation though so I err to having that be O(n).

If it makes sense I'd like to finish this PR as-is and then fix the canonical issues in a separate one?

@petrelharp
Copy link
Contributor

I think if subset also sorts the parents in individual.parents then applying this sort will be canonical.

Um, I guess this is true? I'm a little unsure about this, given that the outcome of this sort depends on the order of parents.

If it makes sense I'd like to finish this PR as-is and then fix the canonical issues in a separate one?

That makes sense, but I think we should make sure we have good path forward: we'd like to at least make sure that if you canonicalise, a subsequent sort doesn't change anything, for instance. And, that the order induced by canonicalise does not depend on the original order of the parent table, just on the order of the node table. And probably also not on the order that the different parents of a given individual are specified.

I'll think some more about this (although I'm ready to be done with sorting... =)

@jeromekelleher
Copy link
Member

jeromekelleher commented Feb 27, 2021

Agree with all the above, but:

: we'd like to at least make sure that if you canonicalise, a subsequent sort doesn't change anything, for instance.

I wouldn't get too hung up on this; once the output of sort is guaranteed to be legal input for tree sequences that's all that really matters. Canonical must be canonical though, there can be only one output from canonicalise.

@petrelharp
Copy link
Contributor

I wouldn't get too hung up on this; once the output of sort is guaranteed to be legal input for tree sequences that's all that really matters. Canonical must be canonical though, there can be only one output from canonicalise.

Ok, wise words there. Here's my proposal, then:

  1. make it so subset doesn't reorder the individual table (this is easy and makes it easier to understand anyhow)
  2. let sort do whatever (so: this PR is fine)
  3. change canonicalize so that it sorts individuals on (num descendants, id of first referring node) - this will be a bit annoying but straightforward

I've given it a fair bit of thought and come up with no better solution.

@jeromekelleher
Copy link
Member

Sounds like a plan, thanks @petrelharp.

So, I guess we merge the sort stuff here and file some issues to track the other two (which we resolve before release)?

@petrelharp
Copy link
Contributor

I've made a good start at subset and canonicalise (in #1226), and turned up another issue (union); I think I'll wait until this is merged to finish since I'm starting to step on the toes of this PR.

@benjeffery benjeffery marked this pull request as ready for review March 1, 2021 14:31
@benjeffery
Copy link
Member Author

Needs a squash and a changelog, but I finally think this is ready for review. I had to do the subset individual re-mapping (#1223 ) to get the tests to pass.

c/tskit/tables.c Outdated
tsk_id_t i;
tsk_individual_table_t copy;
tsk_individual_t individual;
tsk_individual_t copy_individual;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't use individual and copy_individual at the same time, so maybe they should both be individual?

copy.migrations[mig_id].dest,
copy.migrations[mig_id].time,
copy.migrations[mig_id].metadata,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to put the migration sort stuff in here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, otherwise a comparison of the python and c sorting fails.

Copy link
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Very small comments, and: how about throwing in a test that inserts some NULL parents in addition to the actual parents? (Or, maybe we should just disallow them.)

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, a few minor comments. The algorithm is quite beautiful!



def cmp_migration(i, j, tables):
ret = tables.migrations.time[i] - tables.migrations.time[j]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter I'd imagine, but

migration_i = tables.migrations[i]
migration_j = tables.migrations[j]
ret = migration_i.time - migration_j.time
# etc

would avoid repeatedly building and returning numpy arrays for each of the columns.

@jeromekelleher
Copy link
Member

Looks good! Very small comments, and: how about throwing in a test that inserts some NULL parents in addition to the actual parents? (Or, maybe we should just disallow them.)

I think we should allow NULL parents - there will be times when having parents [-1, -1] will be handy and others when parents=[] is more useful. We should allow both, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sort should ensure ordering of individual parents (also, mutation parents?)

3 participants