-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updates to union, subset, and sort #1108
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1108 +/- ##
==========================================
- Coverage 93.73% 93.70% -0.04%
==========================================
Files 26 26
Lines 21254 21404 +150
Branches 902 902
==========================================
+ Hits 19922 20056 +134
- Misses 1294 1310 +16
Partials 38 38
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
c/tskit/tables.c
Outdated
} | ||
|
||
static int | ||
tsk_table_sorter_sort_mutations_canonical(tsk_table_sorter_t *self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has a lot of code overlap with tsk_table_sorter_sort_mutations
, and we could more cleanly just make canonical
a flag that affects its behavior.
c/tskit/tables.c
Outdated
} | ||
subset_options | ||
|= (TSK_SUBSET_KEEP_UNUSED_POPULATIONS | TSK_SUBSET_KEEP_UNUSED_INDIVIDUALS | ||
| TSK_SUBSET_KEEP_UNUSED_SITES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am tempted to make a single macro that is TSK_SUBSET_KEEP_UNUSED = TSK_SUBSET_KEEP_UNUSED_POPULATIONS | TSK_SUBSET_KEEP_UNUSED_INDIVIDUALS | TSK_SUBSET_KEEP_UNUSED_SITES
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however, I think someone's going to want to use only some of these options at some point, so I vote to keep them separate.
c/tskit/tables.c
Outdated
@@ -9206,36 +9385,56 @@ tsk_table_collection_subset( | |||
} | |||
|
|||
// mutations and sites | |||
i = 0; | |||
// make a first pass through to build the mutation_map so that | |||
// mutation parent can be remapped even if the table is not in order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This next bit got more complicated, but subset
is as a result a lot more robust, not requiring any sort of sorting whatsoever.
c/tskit/tables.h
Outdated
@@ -684,6 +686,9 @@ typedef struct { | |||
/** @brief Do not run integrity checks before performing an operation. */ | |||
#define TSK_NO_CHECK_INTEGRITY (1u << 29) | |||
|
|||
/** @brief Do not reorder the population table. */ | |||
#define TSK_UNCHANGED_POPULATIONS (1u << 28) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might want to be a common option - and myabe even default, since changing population references could be a big gotcha.
python/tskit/tables.py
Outdated
@@ -2648,6 +2668,9 @@ def deduplicate_sites(self): | |||
duplicate ``position`` (and keeping only the *first* entry for each | |||
site), and renumbering the ``site`` column of the mutation table | |||
appropriately. This requires the site table to be sorted by position. | |||
|
|||
This method does not sort the tables afterwards, so mutations may no longer | |||
be sorted by time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may need to do something about this, but it's a job for a different PR.
Ok - this is ready for someone to take a look at! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good @petrelharp, although I haven't gone through everything in detail. A couple of high-level points, though.
python/tskit/trees.py
Outdated
nodes, | ||
record_provenance=True, | ||
unchanged_populations=False, | ||
keep_unused_populations=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to mirror the options that we have in simplify
, which are filter_populations=True
, filter_individuals=True
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we can't exactly mirror simplify
. Here we've got three options:
- remove unused pops
- reorder pops, with unused ones at the end (
keep_unused_populations=True
) - keep pops the same (
unchanged_populations=True
)
This is because for canonicalize
we need to reorder things, but not discard information - that's option (2). Similarly, the option for individuals
toggles between options (1) and (2). But the filter_X
flags for simplify
toggle between options (1) and (3).
Maybe the best thing to do is to mirror simplify
, and provide an extra flag that does option (2), e.g., TSK_CANONICALIZE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone ahead and done this change.
python/tskit/tables.py
Outdated
@@ -2558,7 +2558,7 @@ def map_ancestors(self, *args, **kwargs): | |||
# A deprecated alias for link_ancestors() | |||
return self.link_ancestors(*args, **kwargs) | |||
|
|||
def sort(self, edge_start=0): | |||
def sort(self, edge_start=0, canonical=False, unchanged_populations=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the right way to do this. Wouldn't it be better to add a separate function canonicalise
, which does the sorting and optional filtering of unreferenced stuff? Is there a good reason to not make a separate method? It would make the C implementation clearer I think, as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking that canonicalize
would have a separate implementation, or would end up calling sort
? I did it this way because there'd be a lot of code overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm imagining that it would look something like (ignore crappy indentation)
int TSK_WARN_UNUSED
tsk_table_collection_canonicalise(tsk_table_collection_t *self, tsk_flags_t options)
{
int ret = 0;
tsk_id_t k;
tsk_flags_t subset_options = options & TSK_UNCHANGED_POPULATIONS;
tsk_id_t *nodes = NULL;
tsk_table_sorter_t sorter;
ret = tsk_table_sorter_init(&sorter, self, options);
if (ret != 0) {
goto out;
}
nodes = malloc(self->nodes.num_rows * sizeof(*nodes));
if (nodes == NULL) {
ret = TSK_ERR_NO_MEMORY;
goto out;
}
for (k = 0; k < (tsk_id_t) self->nodes.num_rows; k++) {
nodes[k] = k;
}
subset_options
|= (TSK_SUBSET_KEEP_UNUSED_POPULATIONS | TSK_SUBSET_KEEP_UNUSED_INDIVIDUALS
| TSK_SUBSET_KEEP_UNUSED_SITES);
ret = tsk_table_collection_subset(
self, nodes, self->nodes.num_rows, subset_options);
if (ret != 0) {
goto out;
}
sorter.sort_mutations = tsk_table_sorter_sort_mutations_canonical;
ret = tsk_table_sorter_run(&sorter, start);
if (ret != 0) {
goto out;
}
out:
tsk_safe_free(nodes);
tsk_table_sorter_free(&sorter);
return ret;
}
I think we can add a flag to the tsk_table_sorter_t
struct to decide whether we're sorting mutations in canonical order or just the standard order (i.e., I agree with you that the current duplication in tsk_table_sorter_sort_mutations_canonical isn't very nice).
And there's no changes to the current tsk_table_collection_sort
.
How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I had this in mind as an alternative also. I will make the change (and the other suggestions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add a flag to the tsk_table_sorter_t struct to decide whether we're sorting mutations in canonical order or just the standard order (i.e., I agree with you that the current duplication in tsk_table_sorter_sort_mutations_canonical isn't very nice).
Ok, I've put back sort
and made canonicalise
, but I've kept tsk_table_sorter_sort_mutations_canonical
separate from tsk_table_sorter_sort_mutations
because the two functions use different types for sorted_mutations
, and I don't know how to write the code in a way that removes the duplications.
c/tskit/tables.c
Outdated
if (keep_individuals) { | ||
for (k = 0; k < (tsk_id_t) tables.individuals.num_rows; k++) { | ||
if (individual_map[k] == TSK_NULL) { | ||
ret = tsk_individual_table_get_row(&tables.individuals, k, &ind); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use the unsafe
versions of the the get_row functions here since we're in charge of the ID values.
The idea of #1076 is to add this to the public Python API so that downstream projects like msprime etc can use it in their test cases, and get informative errors when the tables aren't equal. We don't want to do this at the C level specifically because we want to provide good errors. A good approach might be to first try to see if the tables are equal using the low-level C call, and then if they are not, do some table-by-table and maybe row-by-row comparisons in Python to figure out exactly where the tables differ. |
Is a good idea and definitely useful, but I'm not sure about the name. It's in a different tense to the rest of the options (or something). The rest of the options like say TSK_FILTER_SITES describe actions, but this is describing a desired state. Something like |
c/tskit/tables.h
Outdated
@@ -717,6 +722,11 @@ typedef struct { | |||
/* Flags for table init. */ | |||
#define TSK_NO_METADATA (1 << 0) | |||
|
|||
/* Flags for subset() */ | |||
#define TSK_SUBSET_KEEP_UNUSED_POPULATIONS (1 << 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could reuse the TSK_FILTER_POPULATIONS options etc for simplify?
Ok - up-to-date now. I'm not sure how to de-duplicate the mutation sorting code without making the standard case less efficient. Have a look? |
Sorry I've been slow to get back to this @petrelharp, I've been tied up with PopGroup. Will take a good look tomorrow! |
No worries! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a good look at the Python end of this and it looks good @petrelharp, but I think I'm confused about the entry point here. This started out as you wanting to do something specific, and has now turned into us defining a canonical form for tree sequences (which is great!). I'm not sure that the canonical form that I have in mind (i.e., no redundancy, at least by default) actually does what you want, though.
You want to keep population IDs stable through some union/subset etc operations, right?
python/tests/tsutil.py
Outdated
) | ||
k += 1 | ||
tables.sort() | ||
tables.compute_mutation_parents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing (), currently doesn't to anything.
python/tests/tsutil.py
Outdated
return tables.tree_sequence() | ||
|
||
|
||
def insert_discrete_time_mutations(ts, num_times=4): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick description of what the function does would be helpful
""" | ||
rng = random.Random(seed) | ||
orig = tables.copy() | ||
tables.nodes.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to raise an error here if migrations
is non-empty (or implement?)
python/tskit/tables.py
Outdated
filter_individuals=True, | ||
filter_sites=True, | ||
canonicalise=False, | ||
): | ||
""" | ||
Modifies the tables in place to contain only the entries referring to | ||
the provided list of nodes, with nodes reordered according to the order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More precisely "list of node IDs" - someone might thing these are Node objects.
python/tskit/tables.py
Outdated
""" | ||
Modifies the tables in place to contain only the entries referring to | ||
the provided list of nodes, with nodes reordered according to the order | ||
they appear in the list. See :meth:`TreeSequence.subset` for a more | ||
detailed description. | ||
|
||
Note: the tables can be completely unsorted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "There are no sortedness requirements for the tables".
python/tskit/tables.py
Outdated
:param list nodes: The list of nodes for which to retain information. This | ||
may be a numpy array (or array-like) object (dtype=np.int32). | ||
:param bool record_provenance: Whether to record a provenance entry | ||
in the provenance table for this operation. | ||
:param bool filter_populations: Whether to remove populations not referenced by | ||
retained nodes. If False, the population table will remain unchanged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will be reordered, though, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No! I changed the behavior, to match simplify
.
python/tskit/tables.py
Outdated
retained nodes. If False, the individuals table will remain unchanged. | ||
:param bool filter_sites: Whether to remove sites not referenced by | ||
retained mutations. If False, the site table will remain unchanged. | ||
:param bool canonicalise: If True, retains all unused entries, putting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the canonical form should include any redundant information (see note above), so I don't think this is the right name for the parameter. Seems like we're getting into a bit of a mess here just to keep unreferenced objects in the data model. Is this something that we really need?
Ok, the question is: should Here's some use cases:
I do like the idea of dropping unreferenced things, since we aren't actually canonically sorting these. But for (2-4) it does seem good to leave them in. So, maybe it should be optional, defaulting to Right now, Or, maybe these test cases are unconvincing? |
I got the rest of those changes in, btw. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies it's taken so long to get to this @petrelharp. I'm still confused though, unfortunately. My takes:
- The
TableCollection.canonicalise()
method is a useful operation which we should support. By default it should remove redundant information, but we should add akeep_unreferenced=False
option to allow a user to keep unreferenced information at the end of the tables. It's probably not worth havingkeep_unreferenced_populations
,keep_unreferenced_individuals
etc, but we can imagine doing this, if needed. Unreferenced stuff should definitely go at the end of the table, though, as otherwise we're not canonicalising in any meaninful way. - The
subset(nodes)
operation no longer requires sorted input - great! - The
subset(nodes)
operation now has a bunch of options and complicated semantics I don't understand 😿 - We might be able to replace the complicated semantics of
subset(nodes)
withreorder_nodes(nodes)
operation which rejiggles the node table a bit but doesn't change any other tables?
I've made some comments above, but I think they probably reflect my confusion in that some are probably contradictory - sorry about that.
c/tskit/tables.h
Outdated
#define TSK_FILTER_SITES (1 << 0) | ||
#define TSK_FILTER_POPULATIONS (1 << 1) | ||
#define TSK_FILTER_INDIVIDUALS (1 << 2) | ||
#define TSK_REDUCE_TO_SITE_TOPOLOGY (1 << 3) | ||
#define TSK_KEEP_UNARY (1 << 4) | ||
#define TSK_KEEP_INPUT_ROOTS (1 << 5) | ||
#define TSK_CANONICALISE (1 << 6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems problematic to me to add a TSK_CANONICALISE
flag to any operation, since (conceptually) all it will be doing is calling canonicalise()
. Rather than,
tables.operation(canonicalise=True)
wouldn't it be more flexible (and involve less code duplication for us) to do
tables.operation()
tables.canonicalise()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not what this flag does - this flag is the thing we use within tsk_table_collection_canonicalise. (I will explain later.)
c/tskit/tables.h
Outdated
@@ -2629,6 +2632,22 @@ TSK_NO_CHECK_INTEGRITY | |||
int tsk_table_collection_sort( | |||
tsk_table_collection_t *self, const tsk_bookmark_t *start, tsk_flags_t options); | |||
|
|||
/** | |||
@brief Puts the tables into canonical order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer "canonical form" rather than "canonical order" but 🤷
c/tskit/tables.h
Outdated
3. Edges: if both parent and child are retained nodes. | ||
4. Mutations: if the mutation's node is a retained node. | ||
5. Sites: if any mutations remain at the site after removing mutations. | ||
2. Individuals, if `TSK_FILTER_INDIVIDUALS`: if referred to by a retained node, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the documentation for the flags makes sense now. How about:
- Individuals: In the order they are referred to by retained nodes, with any remaining unreferenced individuals appended afterwards in their original order. If the
TSK_FILTER_INDIVIDUALS
flag is provided, unreferenced individuals are not retained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, that's not what this does at the moment...
c/tskit/tables.h
Outdated
order as in the original tables. | ||
order as in the original tables. If any of `TSK_FILTER_INDIVIDUALS`, | ||
`TSK_FILTER_POPULATIONS`, or `TSK_FILTER_SITES` are *not* provided, | ||
then the respective tables will be *unchanged*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you want the other tables entirely unchanged? Ah, that's what I've been confused about. Right, you don't want to reorder the populations because you want stable population/individual IDs.
Well, what if we add a flag TSK_STABLE_IDs
or something to subset
then which does the least amount of ID reshuffling as possible, which still doing the basic node-subset operation? This could be mutually exclusive to the FILTER_X operations, or we could drop them entirely if they're not useful.
Or maybe your original idea of a reorder_nodes
function is better? I'm finding it hard to understand the desired semantics of the subset
function at this point, so maybe another function is a better idea?
TODO:
And, hm, let's see. Dredging this back up from my memory... We have three behaviors for
I do think we need to do all of these things. You're suggesting separating out (2) as a different method, Does that make things clear? What do you think? |
This is very helpful, thanks @petrelharp. Also very tricky! It would be nice if we could separate out the operations of Then, we have a So, you're operations above become:
I don't think it's that much less efficient to split up the subset() and ordering() concerns really. I guess one concern though is that we've released subset() under the current semantics, so we shouldn't change them too lightly. |
That also seems fine, except for changing the behavior of |
I don't know, it's all very tricky. On the one hand it's nice that What if there was just two options to subset then? By default we reorder and filter unreferenced; if KEEP_UNREFERENCED we keep unreferenced entitites at the end of the tables; and if MAINTAIN_REFERENCES we keep all other references besides nodes and edges stable. The Would this cover all the requirements? @mufernando, any input here? |
Hey, what if we made a method called |
gee, this is way more complicated than I initially thought!
I agree with this, we defined I'm not exactly clear on what the MAINTAIN_REFERENCES would do though. |
I'll have a look and try something out. Maybe if I just put some more effort into the documentation, it'll all be clear and natural as-is. =) |
Another data point on how we ended up with the current set of options: |
Looking above, @jeromekelleher, I see that you maybe weren't entirely happy with this being what "canonicalise" means. If you had something else in mind, we could reserve that keyword for some other yet-to-be-defined operation, and implement this as |
Looking back, one option is to remove (3) entirely: we don't have a particular use case for "keep individual/population/site tables unchanged". Here's why I implemented it, though: I wanted to write a test where I (1) used So: I'd be willing to remove the I think the reason this is difficult is because of how things are named in So, here's a proposal:
In this case, the three operations above are:
|
Update:
We don't need to retain unreferenced things for its use in
|
I like the new signature for |
Apologies for sending things in the wrong direction with the misguided FILTER_X options, btw. |
(Note: subset has not been released with flags, so this will be an API change.) |
11a7cee
to
554dee8
Compare
I think this is ready to go, and passing all the tests, despite what github and codecov say (!?!?). Want to take a final (?) look, @jeromekelleher? Note: I thought about calling the option |
2656168
to
841c11a
Compare
I've gone through this in detail @petrelharp, and I happy to merge. I've rebased to fixup recent breakages, and made a few small changes. Can you have a look at the last two commits please to make sure you're OK with the updates? Some notes:
|
LGTM! Thanks for fixing those things up. |
I can squash & rebase later, but not right now - go ahead if you want. |
92c359d
to
841c11a
Compare
Whoops, sorry, almost had a git mishap there. |
Um, I guess I could click the "rebase and merge" button, but I'm nervous to since last time. (I guess we need to adjust our threshold on the codecov again, eh?) |
I think it's fine having multiple commits on something big like this. The codecov levels aren't part of the mergify rules, so we don't have to worry about them. We should really use the AUTOMERGE-REQUESTED label here on tskit for merging stuff rather than the buttons (I've not been using it recently on msprime because there's too much stuff going on, and there isn't really anyone to do the reviewing; and someone can't approve their own PRs) |
assert equal method (closes tskit-dev#1076) add keep_unused arguments to subset py sort and tsutil to disorder ts verifying C sort against Py sort shuffle tables sphinx fixup for subset options extend subset to work on unsorted tables canonicalalise (closes tskit-dev#705) test unrestricted union
Fix some other merge issues.
- Move C code to more logical locations. - Update new method signatures to avoid hard-coding defaults. - Various minor rephrasings. - Fill in gap in C test coverage.
841c11a
to
7176062
Compare
We could add a label to merge without review if you are the PR author? |
It's not worth the bother - things will settle down soon enough in msprime, and it's bad practise anyway. |
There's a few things in this PR:
.union
(closes tidy up union #1095)shuffle_tables
andassert_table_collection_equals
methods totests/tsutil.py
Notes:
Canonical sort: This is an option, e.g.
tables.sort(canonical=True)
andtsk_table_collection_sort(tables, TSK_SORT_CANONICAL)
. In addition to providing a method to canonicalize tables, this will ensure that mutation parents come before their children, something we didn't previously have. This first runssubset
to do the reordering of individuals and populations. It turned out our original proposal to sort mutations canonically (sort by mutation.parent) was just wrong, so instead we are sorting by number of "mutation descendants". (This is necessary when mutation times are not unique.) I was tempted to add this to the standard.sort()
, but given that sort is a bottleneck in practice and we have no use case which produces tables with mutations out of parent-sorted order, I did not.keep_unused options: To allow
subset
to be used to reorder tables and not discard information, I had to addkeep_unused_X
arguments to it, forX
insites
,individuals
, andpopulations
. This seems fine, although I'm tempted to (a) either just have a singlekeep_unused
argument that does all three, or (b) remove these from the python subset and write a separate python method calledreorder_ndoes( )
that's the python front-end.Unchanged populations: Something I ran across in the process is that it's annoying to keep populations consistent. I added
unchanged_populations
arguments (and a correspondingTSK_UNCHANGED_POPULATIONS
flag) to both.subset( )
and.sort( )
that don't reorder populations. However, we should probably writereorder_populations
(andreorder_individuals
) methods to make this sort of thing easier to deal with.assert table collections equal: I thought that this would close #1076, but now I wonder if that was meant to be a full method (including in the C library), rather than just in
tsutil
?deduplicate_sites: turns out this method, with the addition of mutation times, needs a sort before and after it, strictly speaking.
This clearly needs a squash, but I think (hope?) everything works and is (possibly) good to go.