updates to union, subset, and sort #1108

petrelharp · 2021-01-02T02:45:52Z

There's a few things in this PR:

implements canonical sort (closes make sort stricter #705 and incorporates canonical sorting #715)
removes the restriction on shared history for .union (closes tidy up union #1095)
extends subset to work on completely unsorted tables
added shuffle_tables and assert_table_collection_equals methods to tests/tsutil.py
added more tests for subset, union, and sort generally

Notes:

Canonical sort: This is an option, e.g. tables.sort(canonical=True) and tsk_table_collection_sort(tables, TSK_SORT_CANONICAL). In addition to providing a method to canonicalize tables, this will ensure that mutation parents come before their children, something we didn't previously have. This first runs subset to do the reordering of individuals and populations. It turned out our original proposal to sort mutations canonically (sort by mutation.parent) was just wrong, so instead we are sorting by number of "mutation descendants". (This is necessary when mutation times are not unique.) I was tempted to add this to the standard .sort(), but given that sort is a bottleneck in practice and we have no use case which produces tables with mutations out of parent-sorted order, I did not.

keep_unused options: To allow subset to be used to reorder tables and not discard information, I had to add keep_unused_X arguments to it, for X in sites, individuals, and populations. This seems fine, although I'm tempted to (a) either just have a single keep_unused argument that does all three, or (b) remove these from the python subset and write a separate python method called reorder_ndoes( ) that's the python front-end.

Unchanged populations: Something I ran across in the process is that it's annoying to keep populations consistent. I added unchanged_populations arguments (and a corresponding TSK_UNCHANGED_POPULATIONS flag) to both .subset( ) and .sort( ) that don't reorder populations. However, we should probably write reorder_populations (and reorder_individuals) methods to make this sort of thing easier to deal with.

assert table collections equal: I thought that this would close #1076, but now I wonder if that was meant to be a full method (including in the C library), rather than just in tsutil?

deduplicate_sites: turns out this method, with the addition of mutation times, needs a sort before and after it, strictly speaking.

This clearly needs a squash, but I think (hope?) everything works and is (possibly) good to go.

codecov · 2021-01-02T04:32:35Z

Codecov Report

Merging #1108 (7176062) into main (4f51148) will decrease coverage by 0.03%.
The diff coverage is 90.10%.

@@            Coverage Diff             @@
##             main    #1108      +/-   ##
==========================================
- Coverage   93.73%   93.70%   -0.04%     
==========================================
  Files          26       26              
  Lines       21254    21404     +150     
  Branches      902      902              
==========================================
+ Hits        19922    20056     +134     
- Misses       1294     1310      +16     
  Partials       38       38

Flag	Coverage Δ
c-tests	`92.45% <87.66%> (-0.07%)`	⬇️
lwt-tests	`92.97% <ø> (ø)`
python-c-tests	`94.90% <100.00%> (+0.01%)`	⬆️
python-tests	`98.63% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
c/tskit/core.c	`96.97% <ø> (-0.03%)`	⬇️
c/tskit/core.h	`100.00% <ø> (ø)`
c/tskit/tables.c	`90.73% <87.66%> (-0.09%)`	⬇️
python/_tskitmodule.c	`91.42% <100.00%> (+0.04%)`	⬆️
python/tskit/tables.py	`99.60% <100.00%> (+<0.01%)`	⬆️
python/tskit/trees.py	`97.42% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f51148...7176062. Read the comment docs.

petrelharp · 2021-01-02T05:26:04Z

c/tskit/tables.c

+}
+
+static int
+tsk_table_sorter_sort_mutations_canonical(tsk_table_sorter_t *self)


This has a lot of code overlap with tsk_table_sorter_sort_mutations, and we could more cleanly just make canonical a flag that affects its behavior.

petrelharp · 2021-01-02T05:27:13Z

c/tskit/tables.c

+        }
+        subset_options
+            |= (TSK_SUBSET_KEEP_UNUSED_POPULATIONS | TSK_SUBSET_KEEP_UNUSED_INDIVIDUALS
+                | TSK_SUBSET_KEEP_UNUSED_SITES);


I am tempted to make a single macro that is TSK_SUBSET_KEEP_UNUSED = TSK_SUBSET_KEEP_UNUSED_POPULATIONS | TSK_SUBSET_KEEP_UNUSED_INDIVIDUALS | TSK_SUBSET_KEEP_UNUSED_SITES.

however, I think someone's going to want to use only some of these options at some point, so I vote to keep them separate.

petrelharp · 2021-01-02T05:28:58Z

c/tskit/tables.c

@@ -9206,36 +9385,56 @@ tsk_table_collection_subset(
    }

    // mutations and sites
-    i = 0;
+    // make a first pass through to build the mutation_map so that
+    // mutation parent can be remapped even if the table is not in order


This next bit got more complicated, but subset is as a result a lot more robust, not requiring any sort of sorting whatsoever.

petrelharp · 2021-01-02T05:31:27Z

c/tskit/tables.h

@@ -684,6 +686,9 @@ typedef struct {
 /** @brief Do not run integrity checks before performing an operation. */
 #define TSK_NO_CHECK_INTEGRITY (1u << 29)

+/** @brief Do not reorder the population table. */
+#define TSK_UNCHANGED_POPULATIONS (1u << 28)


I think this might want to be a common option - and myabe even default, since changing population references could be a big gotcha.

petrelharp · 2021-01-02T05:34:13Z

python/tskit/tables.py

@@ -2648,6 +2668,9 @@ def deduplicate_sites(self):
        duplicate ``position`` (and keeping only the *first* entry for each
        site), and renumbering the ``site`` column of the mutation table
        appropriately.  This requires the site table to be sorted by position.
+
+        This method does not sort the tables afterwards, so mutations may no longer
+        be sorted by time.


I think we may need to do something about this, but it's a job for a different PR.

petrelharp · 2021-01-02T05:34:59Z

Ok - this is ready for someone to take a look at!

jeromekelleher

Overall looks good @petrelharp, although I haven't gone through everything in detail. A couple of high-level points, though.

jeromekelleher · 2021-01-04T12:14:46Z

python/tskit/trees.py

+        nodes,
+        record_provenance=True,
+        unchanged_populations=False,
+        keep_unused_populations=False,


It would be good to mirror the options that we have in simplify, which are filter_populations=True, filter_individuals=True, etc.

Unfortunately, we can't exactly mirror simplify. Here we've got three options:

remove unused pops

reorder pops, with unused ones at the end (keep_unused_populations=True)

keep pops the same (unchanged_populations=True)

This is because for canonicalize we need to reorder things, but not discard information - that's option (2). Similarly, the option for individuals toggles between options (1) and (2). But the filter_X flags for simplify toggle between options (1) and (3).

Maybe the best thing to do is to mirror simplify, and provide an extra flag that does option (2), e.g., TSK_CANONICALIZE.

I've gone ahead and done this change.

jeromekelleher · 2021-01-04T12:20:43Z

python/tskit/tables.py

@@ -2558,7 +2558,7 @@ def map_ancestors(self, *args, **kwargs):
        # A deprecated alias for link_ancestors()
        return self.link_ancestors(*args, **kwargs)

-    def sort(self, edge_start=0):
+    def sort(self, edge_start=0, canonical=False, unchanged_populations=False):


I'm not sure this is the right way to do this. Wouldn't it be better to add a separate function canonicalise, which does the sorting and optional filtering of unreferenced stuff? Is there a good reason to not make a separate method? It would make the C implementation clearer I think, as well.

Are you thinking that canonicalize would have a separate implementation, or would end up calling sort? I did it this way because there'd be a lot of code overlap.

I'm imagining that it would look something like (ignore crappy indentation)

int TSK_WARN_UNUSED tsk_table_collection_canonicalise(tsk_table_collection_t *self, tsk_flags_t options) { int ret = 0; tsk_id_t k; tsk_flags_t subset_options = options & TSK_UNCHANGED_POPULATIONS; tsk_id_t *nodes = NULL; tsk_table_sorter_t sorter; ret = tsk_table_sorter_init(&sorter, self, options); if (ret != 0) { goto out; } nodes = malloc(self->nodes.num_rows * sizeof(*nodes)); if (nodes == NULL) { ret = TSK_ERR_NO_MEMORY; goto out; } for (k = 0; k < (tsk_id_t) self->nodes.num_rows; k++) { nodes[k] = k; } subset_options |= (TSK_SUBSET_KEEP_UNUSED_POPULATIONS | TSK_SUBSET_KEEP_UNUSED_INDIVIDUALS | TSK_SUBSET_KEEP_UNUSED_SITES); ret = tsk_table_collection_subset( self, nodes, self->nodes.num_rows, subset_options); if (ret != 0) { goto out; } sorter.sort_mutations = tsk_table_sorter_sort_mutations_canonical; ret = tsk_table_sorter_run(&sorter, start); if (ret != 0) { goto out; } out: tsk_safe_free(nodes); tsk_table_sorter_free(&sorter); return ret; }

I think we can add a flag to the tsk_table_sorter_t struct to decide whether we're sorting mutations in canonical order or just the standard order (i.e., I agree with you that the current duplication in tsk_table_sorter_sort_mutations_canonical isn't very nice).

And there's no changes to the current tsk_table_collection_sort.

How does that sound?

Sounds good, I had this in mind as an alternative also. I will make the change (and the other suggestions).

I think we can add a flag to the tsk_table_sorter_t struct to decide whether we're sorting mutations in canonical order or just the standard order (i.e., I agree with you that the current duplication in tsk_table_sorter_sort_mutations_canonical isn't very nice).

Ok, I've put back sort and made canonicalise, but I've kept tsk_table_sorter_sort_mutations_canonical separate from tsk_table_sorter_sort_mutations because the two functions use different types for sorted_mutations, and I don't know how to write the code in a way that removes the duplications.

jeromekelleher · 2021-01-04T12:24:18Z

c/tskit/tables.c

+    if (keep_individuals) {
+        for (k = 0; k < (tsk_id_t) tables.individuals.num_rows; k++) {
+            if (individual_map[k] == TSK_NULL) {
+                ret = tsk_individual_table_get_row(&tables.individuals, k, &ind);


Can use the unsafe versions of the the get_row functions here since we're in charge of the ID values.

jeromekelleher · 2021-01-04T12:30:07Z

assert table collections equal: I thought that this would close #1076, but now I wonder if that was meant to be a full method (including in the C library), rather than just in tsutil?

The idea of #1076 is to add this to the public Python API so that downstream projects like msprime etc can use it in their test cases, and get informative errors when the tables aren't equal. We don't want to do this at the C level specifically because we want to provide good errors. A good approach might be to first try to see if the tables are equal using the low-level C call, and then if they are not, do some table-by-table and maybe row-by-row comparisons in Python to figure out exactly where the tables differ.

jeromekelleher · 2021-01-04T12:35:04Z

TSK_UNCHANGED_POPULATIONS

Is a good idea and definitely useful, but I'm not sure about the name. It's in a different tense to the rest of the options (or something). The rest of the options like say TSK_FILTER_SITES describe actions, but this is describing a desired state. Something like TSK_NO_REORDER_POPULATIONS or something might be more consistent?

jeromekelleher · 2021-01-04T12:36:16Z

c/tskit/tables.h

@@ -717,6 +722,11 @@ typedef struct {
 /* Flags for table init. */
 #define TSK_NO_METADATA (1 << 0)

+/* Flags for subset() */
+#define TSK_SUBSET_KEEP_UNUSED_POPULATIONS (1 << 0)


We could reuse the TSK_FILTER_POPULATIONS options etc for simplify?

petrelharp · 2021-01-06T05:01:39Z

Ok - up-to-date now. I'm not sure how to de-duplicate the mutation sorting code without making the standard case less efficient. Have a look?

jeromekelleher · 2021-01-06T17:35:12Z

Sorry I've been slow to get back to this @petrelharp, I've been tied up with PopGroup. Will take a good look tomorrow!

petrelharp · 2021-01-06T17:36:17Z

No worries!

jeromekelleher

I've had a good look at the Python end of this and it looks good @petrelharp, but I think I'm confused about the entry point here. This started out as you wanting to do something specific, and has now turned into us defining a canonical form for tree sequences (which is great!). I'm not sure that the canonical form that I have in mind (i.e., no redundancy, at least by default) actually does what you want, though.

You want to keep population IDs stable through some union/subset etc operations, right?

python/tests/tsutil.py

jeromekelleher · 2021-01-07T12:26:20Z

python/tests/tsutil.py

+                    )
+                    k += 1
+    tables.sort()
+    tables.compute_mutation_parents


Missing (), currently doesn't to anything.

jeromekelleher · 2021-01-07T12:27:15Z

python/tests/tsutil.py

+    return tables.tree_sequence()
+
+
+def insert_discrete_time_mutations(ts, num_times=4):


A quick description of what the function does would be helpful

jeromekelleher · 2021-01-07T12:42:44Z

python/tests/tsutil.py

+    """
+    rng = random.Random(seed)
+    orig = tables.copy()
+    tables.nodes.clear()


Need to raise an error here if migrations is non-empty (or implement?)

python/tskit/tables.py

jeromekelleher · 2021-01-07T13:10:26Z

python/tskit/tables.py

+        filter_individuals=True,
+        filter_sites=True,
+        canonicalise=False,
+    ):
        """
        Modifies the tables in place to contain only the entries referring to
        the provided list of nodes, with nodes reordered according to the order


More precisely "list of node IDs" - someone might thing these are Node objects.

jeromekelleher · 2021-01-07T13:11:23Z

python/tskit/tables.py

        """
        Modifies the tables in place to contain only the entries referring to
        the provided list of nodes, with nodes reordered according to the order
        they appear in the list. See :meth:`TreeSequence.subset` for a more
        detailed description.

+        Note: the tables can be completely unsorted.


Maybe "There are no sortedness requirements for the tables".

jeromekelleher · 2021-01-07T13:16:40Z

python/tskit/tables.py

        :param list nodes: The list of nodes for which to retain information. This
            may be a numpy array (or array-like) object (dtype=np.int32).
        :param bool record_provenance: Whether to record a provenance entry
            in the provenance table for this operation.
+        :param bool filter_populations: Whether to remove populations not referenced by
+            retained nodes. If False, the population table will remain unchanged.


They will be reordered, though, right?

No! I changed the behavior, to match simplify.

jeromekelleher · 2021-01-07T13:17:18Z

python/tskit/tables.py

+            retained nodes. If False, the individuals table will remain unchanged.
+        :param bool filter_sites: Whether to remove sites not referenced by
+            retained mutations. If False, the site table will remain unchanged.
+        :param bool canonicalise: If True, retains all unused entries, putting


I don't think the canonical form should include any redundant information (see note above), so I don't think this is the right name for the parameter. Seems like we're getting into a bit of a mess here just to keep unreferenced objects in the data model. Is this something that we really need?

petrelharp · 2021-01-08T05:35:24Z

Ok, the question is: should canonicalise remove unreferenced individuals and populations? (see comment and comment. Gee, I'm not sure.

Here's some use cases:

Testing for shared overlap in union: we want to know that everything relevant to the nodes in question is equivalent. So, removing unreferenced nodes/individuals is OK.
Use in test suites that the C and python versions do exactly the same thing (e.g., for subset or simplify). In these cases, I could imagine it allowing bugs to slip through the cracks (eg if we accidentally leave some extra populations in but they're not referenced?) but it seems kinda unlikely.
Use in test suites for operations that should be "no change"; like t2 = ts.copy(); no_op(t2); t1.canonicalise(); t2.canonicalse(); assert t1 == t2. Dropping unreferenced inds/pops would mean we couldn't test no_op on tables with unreferenced things.
Putting totally out-of-order tables in order so mutation parents come before children. Here it seems bad to drop unreferenced things, although arguably this functionality should be provided by a different method.

I do like the idea of dropping unreferenced things, since we aren't actually canonically sorting these. But for (2-4) it does seem good to leave them in. So, maybe it should be optional, defaulting to False?

Right now, subset has an option to retain unreferenced nodes and individuals, I could rename this from canonicalise to keep_unused_individuals and keep_unused_populations, and add the same options to TableCollection.canonicalise that controls this behavior?

Or, maybe these test cases are unconvincing?

petrelharp · 2021-01-08T05:39:40Z

I got the rest of those changes in, btw.

jeromekelleher

Apologies it's taken so long to get to this @petrelharp. I'm still confused though, unfortunately. My takes:

The TableCollection.canonicalise() method is a useful operation which we should support. By default it should remove redundant information, but we should add a keep_unreferenced=False option to allow a user to keep unreferenced information at the end of the tables. It's probably not worth having keep_unreferenced_populations, keep_unreferenced_individuals etc, but we can imagine doing this, if needed. Unreferenced stuff should definitely go at the end of the table, though, as otherwise we're not canonicalising in any meaninful way.
The subset(nodes) operation no longer requires sorted input - great!
The subset(nodes) operation now has a bunch of options and complicated semantics I don't understand 😿
We might be able to replace the complicated semantics of subset(nodes) with reorder_nodes(nodes) operation which rejiggles the node table a bit but doesn't change any other tables?

I've made some comments above, but I think they probably reflect my confusion in that some are probably contradictory - sorry about that.

jeromekelleher · 2021-01-25T11:47:21Z

c/tskit/tables.h

 #define TSK_FILTER_SITES (1 << 0)
 #define TSK_FILTER_POPULATIONS (1 << 1)
 #define TSK_FILTER_INDIVIDUALS (1 << 2)
 #define TSK_REDUCE_TO_SITE_TOPOLOGY (1 << 3)
 #define TSK_KEEP_UNARY (1 << 4)
 #define TSK_KEEP_INPUT_ROOTS (1 << 5)
+#define TSK_CANONICALISE (1 << 6)


Seems problematic to me to add a TSK_CANONICALISE flag to any operation, since (conceptually) all it will be doing is calling canonicalise(). Rather than,

tables.operation(canonicalise=True)

wouldn't it be more flexible (and involve less code duplication for us) to do

tables.operation() tables.canonicalise()

That's not what this flag does - this flag is the thing we use within tsk_table_collection_canonicalise. (I will explain later.)

jeromekelleher · 2021-01-25T11:48:10Z

c/tskit/tables.h

@@ -2629,6 +2632,22 @@ TSK_NO_CHECK_INTEGRITY
 int tsk_table_collection_sort(
    tsk_table_collection_t *self, const tsk_bookmark_t *start, tsk_flags_t options);

+/**
+@brief Puts the tables into canonical order.


I'd prefer "canonical form" rather than "canonical order" but 🤷

jeromekelleher · 2021-01-25T11:50:44Z

c/tskit/tables.h

-3. Edges: if both parent and child are retained nodes.
-4. Mutations: if the mutation's node is a retained node.
-5. Sites: if any mutations remain at the site after removing mutations.
+2. Individuals, if `TSK_FILTER_INDIVIDUALS`: if referred to by a retained node,


I don't think the documentation for the flags makes sense now. How about:

Individuals: In the order they are referred to by retained nodes, with any remaining unreferenced individuals appended afterwards in their original order. If the TSK_FILTER_INDIVIDUALS flag is provided, unreferenced individuals are not retained.

sounds good

oh wait, that's not what this does at the moment...

jeromekelleher · 2021-01-25T12:04:51Z

c/tskit/tables.h

-order as in the original tables.
+order as in the original tables. If any of `TSK_FILTER_INDIVIDUALS`,
+`TSK_FILTER_POPULATIONS`, or `TSK_FILTER_SITES` are *not* provided,
+then the respective tables will be *unchanged*.


Oh, you want the other tables entirely unchanged? Ah, that's what I've been confused about. Right, you don't want to reorder the populations because you want stable population/individual IDs.

Well, what if we add a flag TSK_STABLE_IDs or something to subset then which does the least amount of ID reshuffling as possible, which still doing the basic node-subset operation? This could be mutually exclusive to the FILTER_X operations, or we could drop them entirely if they're not useful.

Or maybe your original idea of a reorder_nodes function is better? I'm finding it hard to understand the desired semantics of the subset function at this point, so maybe another function is a better idea?

petrelharp · 2021-01-25T15:57:28Z

TODO:

collapse the two keep_unreferenced_X options into one
add keep_unreferenced as an option to canonicalise
figure out what do do with subset's many options

And, hm, let's see. Dredging this back up from my memory... We have three behaviors for subset:

reorder and remove unreferenced individuals and populations - needed for check_subset_equality
reorder but don't remove unreferenced - needed for canonicalise (with keep_unreferenced)
don't touch the individual and population tables - useful to keep population references stable (the motivation for keeping individuals is less obvious); and to match the behavior of simplify with filter_X=False

I do think we need to do all of these things. You're suggesting separating out (2) as a different method, tsk_table_collection_reorder_nodes, maybe? I like this option? It would make the "reorder nodes" operation more discoverable. I may not be up for the refactoring required to get reorder to not have a bunch of code duplicated from subset right now, though.

Does that make things clear? What do you think?

jeromekelleher · 2021-01-25T18:49:36Z

This is very helpful, thanks @petrelharp. Also very tricky!

It would be nice if we could separate out the operations of subseting from ordering, but I see how these are intrinsically tied up at the moment. Suppose we defined tables.subset(nodes)so that it only touches the node table and keeps all the other tables as-is.

Then, we have a canonicalise(keep_unreferenced=False) function which puts the tables in a canonical form, which can optionally not remove unreferenced items.

So, you're operations above become:

tables.subset(nodes); tables.canonicalise()
(part of the implementation of canonicalise)
tables.canonicalise(keep_unreferenced=False)

I don't think it's that much less efficient to split up the subset() and ordering() concerns really. I guess one concern though is that we've released subset() under the current semantics, so we shouldn't change them too lightly.

petrelharp · 2021-01-25T19:04:37Z

That also seems fine, except for changing the behavior of subset post-release. Probably no-one is relying on it? If someone is, it's @mufernando ... could you have a look at the last two comments and see what you think?

jeromekelleher · 2021-01-25T19:35:55Z

I don't know, it's all very tricky. On the one hand it's nice that subset() sort of "radiates out" from the given set of nodes and, optionally discards or not unreferenced entities. Nodes are the focal part of the data model, so it makes sense to make them the central entity we subset by. On the other hand, all the different keep/don't keep/reorder/don't reorder options are very hard to get your head around and keep track of.

What if there was just two options to subset then? By default we reorder and filter unreferenced; if KEEP_UNREFERENCED we keep unreferenced entitites at the end of the tables; and if MAINTAIN_REFERENCES we keep all other references besides nodes and edges stable. The canonicalise function can then call subset() and do some extra sorting.

Would this cover all the requirements?

@mufernando, any input here?

jeromekelleher · 2021-01-26T08:58:11Z

Hey, what if we made a method called rejigger_ids, that would solve all our problems! (:zany_face: )

mufernando · 2021-01-26T13:40:38Z

gee, this is way more complicated than I initially thought!

I don't know, it's all very tricky. On the one hand it's nice that subset() sort of "radiates out" from the given set of nodes and, optionally discards or not unreferenced entities. Nodes are the focal part of the data model, so it makes sense to make them the central entity we subset by. On the other hand, all the different keep/don't keep/reorder/don't reorder options are very hard to get your head around and keep track of.

What if there was just two options to subset then? By default we reorder and filter unreferenced; if KEEP_UNREFERENCED we keep unreferenced entitites at the end of the tables; and if MAINTAIN_REFERENCES we keep all other references besides nodes and edges stable. The canonicalise function can then call subset() and do some extra sorting.

Would this cover all the requirements?

@mufernando, any input here?

I agree with this, we defined subset as an operation on a set of nodes, but that also affects the other tables and we should stick to that. Some of the other, more "simple" behavior that the filter_X flags are doing should really be relegated to a different operation such as reorder_nodes, which we can implement later on and refactor subset to use it.

I'm not exactly clear on what the MAINTAIN_REFERENCES would do though.

petrelharp · 2021-01-26T21:48:29Z

I'll have a look and try something out. Maybe if I just put some more effort into the documentation, it'll all be clear and natural as-is. =)

petrelharp · 2021-01-26T23:55:44Z

Another data point on how we ended up with the current set of options:

petrelharp · 2021-01-27T00:20:42Z

Looking above, @jeromekelleher, I see that you maybe weren't entirely happy with this being what "canonicalise" means. If you had something else in mind, we could reserve that keyword for some other yet-to-be-defined operation, and implement this as sort(..., complete=True) or something.

petrelharp · 2021-01-27T06:02:53Z

Looking back, one option is to remove (3) entirely: we don't have a particular use case for "keep individual/population/site tables unchanged". Here's why I implemented it, though: I wanted to write a test where I (1) used subset to extract from a tree sequence both (a) all the relationships to a single individual; and (b) everything else, and then (2) use union to put (a) and (b) back together to get the original tree sequence. However, this runs into problems if the population table changes: if everything in (a) is from population 1, for instance, then what was population (1) becomes population (0) in (a), and so the pieces we union back together are no longer compatible (since we don't have a population_map argument to union). I thought that maintaining sites might be useful to people who, say, want to have genotypes comparable across different subsets. I don't have any use case for keeping individual tables unchanged, and implemented it for symmetry with simplify.

So: I'd be willing to remove the filter_sites, filter_individuals and filter_populations flags entirely. It seems a bit of a shame, because subset is a fairly low-level tool, and having it be more flexible seems helpful (as it was for the task above).

I think the reason this is difficult is because of how things are named in simplify: since it uses filter_X to toggle between "keep unchanged" and "reorder and remove", it's hard to come up with another option that does the halfway operation of "just reorder, but don't remove". Also, maybe we should not mirror simplify, as we want to avoid conflating the two in people's minds (they do pretty different things but easy to conflate!).

So, here's a proposal:

def subset(self, nodes, record_provenance=True,
                  reorder_populations=True,
                  remove_unreferenced=True):
    """
    Returns a tree sequence containing only information directly referencing the provided list of nodes to retain.
    The result will retain only the nodes whose IDs are listed in ``nodes``, only edges for which both parent
    and child are in ``nodes```, only mutations whose node is in ``nodes``, and only individuals that are referred to
    by one of the retained nodes.

    This has the side effect of reordering the nodes, individuals, and populations in the tree sequence: the nodes
    in the new tree sequence will be in the order provided in ``nodes``, and both individuals and populations will be
    ordered by the earliest retained node that refers to them. (However, ``reorder_populations`` may be set to False
    to keep the population table unchanged.)

    By default, the method removes all individuals and populations not referenced by any nodes, and all sites not
    referenced by any mutations. To retain these unreferencd individuals, populations, and sites, pass
    ``remove_unreferenced=False``. If this is done, the site table will remain unchanged, unreferenced individuals
    will appear at the end of the individuals table (and in their original order), and unreferenced populations will
    appear at the end of the population table (unless ``reorder_populations=False``).
    """

In this case, the three operations above are:

reorder and remove unreferenced individuals and populations - subset( )
reorder but don't remove unreferenced - subset(remove_unreferenced=False)
don't touch the population tables - subset(reorder_populations=False)
I've removed "don't touch individual tables" from (3) because I don't think there's really a use case for that. If there is, though, we could add reorder_individuals later.

petrelharp · 2021-01-27T17:03:50Z

Update:

my second proposal above sounds good for subset
it remains to decide if canonicalise removes unreferenced things or not.

We don't need to retain unreferenced things for its use in union (and arguably it'd be better if it didn't, actually). So, here's my proposal:

add a remove_unreferenced=True argument to canonicalise.

jeromekelleher · 2021-01-27T17:15:12Z

I like the new signature for subset, and canonicalise(remove_unreferenced=True) seems good to me. (Also keep_unreferenced=False is fine too, if this works better in terms of syncing with the C flags. I guess you'd have to do it as TSK_KEEP_UNREFERENCED, for options=0 to mean "remove unreferenced, right?)

jeromekelleher · 2021-01-27T17:15:47Z

Apologies for sending things in the wrong direction with the misguided FILTER_X options, btw.

petrelharp · 2021-01-29T15:48:51Z

(Note: subset has not been released with flags, so this will be an API change.)

c/tskit/tables.c

petrelharp · 2021-01-30T00:11:40Z

I think this is ready to go, and passing all the tests, despite what github and codecov say (!?!?). Want to take a final (?) look, @jeromekelleher?

Note: I thought about calling the option keep_unreferenced rather than remove_unreferenced, but thought the double-negative default of keep_unreferenced=False was more confusing.

jeromekelleher · 2021-02-02T11:32:45Z

I've gone through this in detail @petrelharp, and I happy to merge. I've rebased to fixup recent breakages, and made a few small changes. Can you have a look at the last two commits please to make sure you're OK with the updates?

Some notes:

For new additions to the API, it really does help if we don't hard-code defaults into the method signature unless we're absolutely certain they will never change (i.e., in the presence of all possible future parameters).
Also good to make sure that new additions like this are keyword only.

petrelharp · 2021-02-02T20:00:17Z

LGTM! Thanks for fixing those things up.

petrelharp · 2021-02-02T20:03:24Z

I can squash & rebase later, but not right now - go ahead if you want.

petrelharp · 2021-02-02T22:53:12Z

Whoops, sorry, almost had a git mishap there.

petrelharp · 2021-02-02T22:57:21Z

Um, I guess I could click the "rebase and merge" button, but I'm nervous to since last time. (I guess we need to adjust our threshold on the codecov again, eh?)

jeromekelleher · 2021-02-03T07:00:57Z

Um, I guess I could click the "rebase and merge" button, but I'm nervous to since last time. (I guess we need to adjust our threshold on the codecov again, eh?)

I think it's fine having multiple commits on something big like this. The codecov levels aren't part of the mergify rules, so we don't have to worry about them. We should really use the AUTOMERGE-REQUESTED label here on tskit for merging stuff rather than the buttons (I've not been using it recently on msprime because there's too much stuff going on, and there isn't really anyone to do the reviewing; and someone can't approve their own PRs)

assert equal method (closes tskit-dev#1076) add keep_unused arguments to subset py sort and tsutil to disorder ts verifying C sort against Py sort shuffle tables sphinx fixup for subset options extend subset to work on unsorted tables canonicalalise (closes tskit-dev#705) test unrestricted union

Fix some other merge issues.

- Move C code to more logical locations. - Update new method signatures to avoid hard-coding defaults. - Various minor rephrasings. - Fill in gap in C test coverage.

benjeffery · 2021-02-03T11:12:55Z

(I've not been using it recently on msprime because there's too much stuff going on, and there isn't really anyone to do the reviewing; and someone can't approve their own PRs)

We could add a label to merge without review if you are the PR author?

jeromekelleher · 2021-02-03T11:18:29Z

We could add a label to merge without review if you are the PR author?

It's not worth the bother - things will settle down soon enough in msprime, and it's bad practise anyway.

petrelharp commented Jan 2, 2021

View reviewed changes

jeromekelleher reviewed Jan 4, 2021

View reviewed changes

jeromekelleher reviewed Jan 7, 2021

View reviewed changes

jeromekelleher reviewed Jan 25, 2021

View reviewed changes

jeromekelleher mentioned this pull request Jan 26, 2021

Support migrations in sort. #1131

Merged

3 tasks

petrelharp force-pushed the update_union branch from 11a7cee to 554dee8 Compare January 29, 2021 17:43

petrelharp commented Jan 30, 2021

View reviewed changes

c/tskit/tables.c Outdated Show resolved Hide resolved

jeromekelleher force-pushed the update_union branch 2 times, most recently from 2656168 to 841c11a Compare February 2, 2021 11:29

petrelharp force-pushed the update_union branch 2 times, most recently from 92c359d to 841c11a Compare February 2, 2021 22:52

jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Feb 3, 2021

jeromekelleher approved these changes Feb 3, 2021

View reviewed changes

petrelharp and others added 6 commits February 3, 2021 07:02

canonical form

6086af3

fixup subset API

10b21b3

added argument to canonicalise

0f3687a

Make individuals optional in C test parser.

357c45a

Fix some other merge issues.

Minor updates to subset

7176062

- Move C code to more logical locations. - Update new method signatures to avoid hard-coding defaults. - Various minor rephrasings. - Fill in gap in C test coverage.

AdminBot-tskit force-pushed the update_union branch from 841c11a to 7176062 Compare February 3, 2021 07:02

mergify bot merged commit 5770b38 into tskit-dev:main Feb 3, 2021

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Feb 3, 2021

benjeffery mentioned this pull request Mar 16, 2021

canonical sorting #715

Closed

		return tables.tree_sequence()


		def insert_discrete_time_mutations(ts, num_times=4):

updates to union, subset, and sort #1108

updates to union, subset, and sort #1108

Conversation

petrelharp commented Jan 2, 2021 • edited Loading

codecov bot commented Jan 2, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petrelharp commented Jan 2, 2021

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petrelharp Jan 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromekelleher Jan 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromekelleher commented Jan 4, 2021 • edited Loading

jeromekelleher commented Jan 4, 2021

Choose a reason for hiding this comment

petrelharp commented Jan 6, 2021

jeromekelleher commented Jan 6, 2021

petrelharp commented Jan 6, 2021

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petrelharp commented Jan 8, 2021

petrelharp commented Jan 8, 2021

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petrelharp commented Jan 25, 2021

jeromekelleher commented Jan 25, 2021

petrelharp commented Jan 25, 2021

jeromekelleher commented Jan 25, 2021

jeromekelleher commented Jan 26, 2021

mufernando commented Jan 26, 2021

petrelharp commented Jan 26, 2021

petrelharp commented Jan 26, 2021 • edited Loading

petrelharp commented Jan 27, 2021

petrelharp commented Jan 27, 2021

petrelharp commented Jan 27, 2021

jeromekelleher commented Jan 27, 2021

jeromekelleher commented Jan 27, 2021

petrelharp commented Jan 29, 2021

petrelharp commented Jan 30, 2021

jeromekelleher commented Feb 2, 2021

petrelharp commented Feb 2, 2021

petrelharp commented Feb 2, 2021

petrelharp commented Feb 2, 2021

petrelharp commented Feb 2, 2021

jeromekelleher commented Feb 3, 2021

benjeffery commented Feb 3, 2021

jeromekelleher commented Feb 3, 2021

petrelharp commented Jan 2, 2021 •

edited

Loading

codecov bot commented Jan 2, 2021 •

edited

Loading

petrelharp Jan 5, 2021 •

edited

Loading

jeromekelleher Jan 4, 2021 •

edited

Loading

jeromekelleher commented Jan 4, 2021 •

edited

Loading

petrelharp commented Jan 26, 2021 •

edited

Loading