Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updates to union, subset, and sort #1108

Merged
merged 6 commits into from
Feb 3, 2021
Merged

Conversation

petrelharp
Copy link
Contributor

@petrelharp petrelharp commented Jan 2, 2021

There's a few things in this PR:

  • implements canonical sort (closes make sort stricter #705 and incorporates canonical sorting #715)
  • removes the restriction on shared history for .union (closes tidy up union #1095)
  • extends subset to work on completely unsorted tables
  • added shuffle_tables and assert_table_collection_equals methods to tests/tsutil.py
  • added more tests for subset, union, and sort generally

Notes:

Canonical sort: This is an option, e.g. tables.sort(canonical=True) and tsk_table_collection_sort(tables, TSK_SORT_CANONICAL). In addition to providing a method to canonicalize tables, this will ensure that mutation parents come before their children, something we didn't previously have. This first runs subset to do the reordering of individuals and populations. It turned out our original proposal to sort mutations canonically (sort by mutation.parent) was just wrong, so instead we are sorting by number of "mutation descendants". (This is necessary when mutation times are not unique.) I was tempted to add this to the standard .sort(), but given that sort is a bottleneck in practice and we have no use case which produces tables with mutations out of parent-sorted order, I did not.

keep_unused options: To allow subset to be used to reorder tables and not discard information, I had to add keep_unused_X arguments to it, for X in sites, individuals, and populations. This seems fine, although I'm tempted to (a) either just have a single keep_unused argument that does all three, or (b) remove these from the python subset and write a separate python method called reorder_ndoes( ) that's the python front-end.

Unchanged populations: Something I ran across in the process is that it's annoying to keep populations consistent. I added unchanged_populations arguments (and a corresponding TSK_UNCHANGED_POPULATIONS flag) to both .subset( ) and .sort( ) that don't reorder populations. However, we should probably write reorder_populations (and reorder_individuals) methods to make this sort of thing easier to deal with.

assert table collections equal: I thought that this would close #1076, but now I wonder if that was meant to be a full method (including in the C library), rather than just in tsutil?

deduplicate_sites: turns out this method, with the addition of mutation times, needs a sort before and after it, strictly speaking.

This clearly needs a squash, but I think (hope?) everything works and is (possibly) good to go.

@codecov
Copy link

codecov bot commented Jan 2, 2021

Codecov Report

Merging #1108 (7176062) into main (4f51148) will decrease coverage by 0.03%.
The diff coverage is 90.10%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1108      +/-   ##
==========================================
- Coverage   93.73%   93.70%   -0.04%     
==========================================
  Files          26       26              
  Lines       21254    21404     +150     
  Branches      902      902              
==========================================
+ Hits        19922    20056     +134     
- Misses       1294     1310      +16     
  Partials       38       38              
Flag Coverage Δ
c-tests 92.45% <87.66%> (-0.07%) ⬇️
lwt-tests 92.97% <ø> (ø)
python-c-tests 94.90% <100.00%> (+0.01%) ⬆️
python-tests 98.63% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
c/tskit/core.c 96.97% <ø> (-0.03%) ⬇️
c/tskit/core.h 100.00% <ø> (ø)
c/tskit/tables.c 90.73% <87.66%> (-0.09%) ⬇️
python/_tskitmodule.c 91.42% <100.00%> (+0.04%) ⬆️
python/tskit/tables.py 99.60% <100.00%> (+<0.01%) ⬆️
python/tskit/trees.py 97.42% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f51148...7176062. Read the comment docs.

c/tskit/tables.c Outdated
}

static int
tsk_table_sorter_sort_mutations_canonical(tsk_table_sorter_t *self)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a lot of code overlap with tsk_table_sorter_sort_mutations, and we could more cleanly just make canonical a flag that affects its behavior.

c/tskit/tables.c Outdated
}
subset_options
|= (TSK_SUBSET_KEEP_UNUSED_POPULATIONS | TSK_SUBSET_KEEP_UNUSED_INDIVIDUALS
| TSK_SUBSET_KEEP_UNUSED_SITES);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am tempted to make a single macro that is TSK_SUBSET_KEEP_UNUSED = TSK_SUBSET_KEEP_UNUSED_POPULATIONS | TSK_SUBSET_KEEP_UNUSED_INDIVIDUALS | TSK_SUBSET_KEEP_UNUSED_SITES.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, I think someone's going to want to use only some of these options at some point, so I vote to keep them separate.

c/tskit/tables.c Outdated
@@ -9206,36 +9385,56 @@ tsk_table_collection_subset(
}

// mutations and sites
i = 0;
// make a first pass through to build the mutation_map so that
// mutation parent can be remapped even if the table is not in order
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This next bit got more complicated, but subset is as a result a lot more robust, not requiring any sort of sorting whatsoever.

c/tskit/tables.h Outdated
@@ -684,6 +686,9 @@ typedef struct {
/** @brief Do not run integrity checks before performing an operation. */
#define TSK_NO_CHECK_INTEGRITY (1u << 29)

/** @brief Do not reorder the population table. */
#define TSK_UNCHANGED_POPULATIONS (1u << 28)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might want to be a common option - and myabe even default, since changing population references could be a big gotcha.

@@ -2648,6 +2668,9 @@ def deduplicate_sites(self):
duplicate ``position`` (and keeping only the *first* entry for each
site), and renumbering the ``site`` column of the mutation table
appropriately. This requires the site table to be sorted by position.

This method does not sort the tables afterwards, so mutations may no longer
be sorted by time.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to do something about this, but it's a job for a different PR.

@petrelharp
Copy link
Contributor Author

Ok - this is ready for someone to take a look at!

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good @petrelharp, although I haven't gone through everything in detail. A couple of high-level points, though.

nodes,
record_provenance=True,
unchanged_populations=False,
keep_unused_populations=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to mirror the options that we have in simplify, which are filter_populations=True, filter_individuals=True, etc.

Copy link
Contributor Author

@petrelharp petrelharp Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we can't exactly mirror simplify. Here we've got three options:

  1. remove unused pops
  2. reorder pops, with unused ones at the end (keep_unused_populations=True)
  3. keep pops the same (unchanged_populations=True)

This is because for canonicalize we need to reorder things, but not discard information - that's option (2). Similarly, the option for individuals toggles between options (1) and (2). But the filter_X flags for simplify toggle between options (1) and (3).

Maybe the best thing to do is to mirror simplify, and provide an extra flag that does option (2), e.g., TSK_CANONICALIZE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone ahead and done this change.

@@ -2558,7 +2558,7 @@ def map_ancestors(self, *args, **kwargs):
# A deprecated alias for link_ancestors()
return self.link_ancestors(*args, **kwargs)

def sort(self, edge_start=0):
def sort(self, edge_start=0, canonical=False, unchanged_populations=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the right way to do this. Wouldn't it be better to add a separate function canonicalise, which does the sorting and optional filtering of unreferenced stuff? Is there a good reason to not make a separate method? It would make the C implementation clearer I think, as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking that canonicalize would have a separate implementation, or would end up calling sort? I did it this way because there'd be a lot of code overlap.

Copy link
Member

@jeromekelleher jeromekelleher Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm imagining that it would look something like (ignore crappy indentation)

int TSK_WARN_UNUSED
tsk_table_collection_canonicalise(tsk_table_collection_t *self,  tsk_flags_t options)
{
    int ret = 0;
    tsk_id_t k;
    tsk_flags_t subset_options = options & TSK_UNCHANGED_POPULATIONS;
    tsk_id_t *nodes = NULL;
    tsk_table_sorter_t sorter;

    ret = tsk_table_sorter_init(&sorter, self, options);
    if (ret != 0) {
        goto out;
    }

        nodes = malloc(self->nodes.num_rows * sizeof(*nodes));
        if (nodes == NULL) {
            ret = TSK_ERR_NO_MEMORY;
            goto out;
        }
        for (k = 0; k < (tsk_id_t) self->nodes.num_rows; k++) {
            nodes[k] = k;
        }
        subset_options
            |= (TSK_SUBSET_KEEP_UNUSED_POPULATIONS | TSK_SUBSET_KEEP_UNUSED_INDIVIDUALS
                | TSK_SUBSET_KEEP_UNUSED_SITES);
        ret = tsk_table_collection_subset(
            self, nodes, self->nodes.num_rows, subset_options);
        if (ret != 0) {
            goto out;
        }
        sorter.sort_mutations = tsk_table_sorter_sort_mutations_canonical;
        
 
    ret = tsk_table_sorter_run(&sorter, start);
    if (ret != 0) {
        goto out;
    }
out:
    tsk_safe_free(nodes);
    tsk_table_sorter_free(&sorter);
    return ret;
}

I think we can add a flag to the tsk_table_sorter_t struct to decide whether we're sorting mutations in canonical order or just the standard order (i.e., I agree with you that the current duplication in tsk_table_sorter_sort_mutations_canonical isn't very nice).

And there's no changes to the current tsk_table_collection_sort.

How does that sound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I had this in mind as an alternative also. I will make the change (and the other suggestions).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add a flag to the tsk_table_sorter_t struct to decide whether we're sorting mutations in canonical order or just the standard order (i.e., I agree with you that the current duplication in tsk_table_sorter_sort_mutations_canonical isn't very nice).

Ok, I've put back sort and made canonicalise, but I've kept tsk_table_sorter_sort_mutations_canonical separate from tsk_table_sorter_sort_mutations because the two functions use different types for sorted_mutations, and I don't know how to write the code in a way that removes the duplications.

c/tskit/tables.c Outdated
if (keep_individuals) {
for (k = 0; k < (tsk_id_t) tables.individuals.num_rows; k++) {
if (individual_map[k] == TSK_NULL) {
ret = tsk_individual_table_get_row(&tables.individuals, k, &ind);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use the unsafe versions of the the get_row functions here since we're in charge of the ID values.

@jeromekelleher
Copy link
Member

jeromekelleher commented Jan 4, 2021

assert table collections equal: I thought that this would close #1076, but now I wonder if that was meant to be a full method (including in the C library), rather than just in tsutil?

The idea of #1076 is to add this to the public Python API so that downstream projects like msprime etc can use it in their test cases, and get informative errors when the tables aren't equal. We don't want to do this at the C level specifically because we want to provide good errors. A good approach might be to first try to see if the tables are equal using the low-level C call, and then if they are not, do some table-by-table and maybe row-by-row comparisons in Python to figure out exactly where the tables differ.

@jeromekelleher
Copy link
Member

TSK_UNCHANGED_POPULATIONS

Is a good idea and definitely useful, but I'm not sure about the name. It's in a different tense to the rest of the options (or something). The rest of the options like say TSK_FILTER_SITES describe actions, but this is describing a desired state. Something like TSK_NO_REORDER_POPULATIONS or something might be more consistent?

c/tskit/tables.h Outdated
@@ -717,6 +722,11 @@ typedef struct {
/* Flags for table init. */
#define TSK_NO_METADATA (1 << 0)

/* Flags for subset() */
#define TSK_SUBSET_KEEP_UNUSED_POPULATIONS (1 << 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could reuse the TSK_FILTER_POPULATIONS options etc for simplify?

@petrelharp
Copy link
Contributor Author

Ok - up-to-date now. I'm not sure how to de-duplicate the mutation sorting code without making the standard case less efficient. Have a look?

@jeromekelleher
Copy link
Member

Sorry I've been slow to get back to this @petrelharp, I've been tied up with PopGroup. Will take a good look tomorrow!

@petrelharp
Copy link
Contributor Author

No worries!

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a good look at the Python end of this and it looks good @petrelharp, but I think I'm confused about the entry point here. This started out as you wanting to do something specific, and has now turned into us defining a canonical form for tree sequences (which is great!). I'm not sure that the canonical form that I have in mind (i.e., no redundancy, at least by default) actually does what you want, though.

You want to keep population IDs stable through some union/subset etc operations, right?

python/tests/tsutil.py Outdated Show resolved Hide resolved
python/tests/tsutil.py Show resolved Hide resolved
)
k += 1
tables.sort()
tables.compute_mutation_parents
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing (), currently doesn't to anything.

return tables.tree_sequence()


def insert_discrete_time_mutations(ts, num_times=4):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick description of what the function does would be helpful

"""
rng = random.Random(seed)
orig = tables.copy()
tables.nodes.clear()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to raise an error here if migrations is non-empty (or implement?)

python/tskit/tables.py Outdated Show resolved Hide resolved
filter_individuals=True,
filter_sites=True,
canonicalise=False,
):
"""
Modifies the tables in place to contain only the entries referring to
the provided list of nodes, with nodes reordered according to the order
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More precisely "list of node IDs" - someone might thing these are Node objects.

"""
Modifies the tables in place to contain only the entries referring to
the provided list of nodes, with nodes reordered according to the order
they appear in the list. See :meth:`TreeSequence.subset` for a more
detailed description.

Note: the tables can be completely unsorted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "There are no sortedness requirements for the tables".

:param list nodes: The list of nodes for which to retain information. This
may be a numpy array (or array-like) object (dtype=np.int32).
:param bool record_provenance: Whether to record a provenance entry
in the provenance table for this operation.
:param bool filter_populations: Whether to remove populations not referenced by
retained nodes. If False, the population table will remain unchanged.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will be reordered, though, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No! I changed the behavior, to match simplify.

retained nodes. If False, the individuals table will remain unchanged.
:param bool filter_sites: Whether to remove sites not referenced by
retained mutations. If False, the site table will remain unchanged.
:param bool canonicalise: If True, retains all unused entries, putting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the canonical form should include any redundant information (see note above), so I don't think this is the right name for the parameter. Seems like we're getting into a bit of a mess here just to keep unreferenced objects in the data model. Is this something that we really need?

@petrelharp
Copy link
Contributor Author

Ok, the question is: should canonicalise remove unreferenced individuals and populations? (see comment and comment. Gee, I'm not sure.

Here's some use cases:

  1. Testing for shared overlap in union: we want to know that everything relevant to the nodes in question is equivalent. So, removing unreferenced nodes/individuals is OK.
  2. Use in test suites that the C and python versions do exactly the same thing (e.g., for subset or simplify). In these cases, I could imagine it allowing bugs to slip through the cracks (eg if we accidentally leave some extra populations in but they're not referenced?) but it seems kinda unlikely.
  3. Use in test suites for operations that should be "no change"; like t2 = ts.copy(); no_op(t2); t1.canonicalise(); t2.canonicalse(); assert t1 == t2. Dropping unreferenced inds/pops would mean we couldn't test no_op on tables with unreferenced things.
  4. Putting totally out-of-order tables in order so mutation parents come before children. Here it seems bad to drop unreferenced things, although arguably this functionality should be provided by a different method.

I do like the idea of dropping unreferenced things, since we aren't actually canonically sorting these. But for (2-4) it does seem good to leave them in. So, maybe it should be optional, defaulting to False?

Right now, subset has an option to retain unreferenced nodes and individuals, I could rename this from canonicalise to keep_unused_individuals and keep_unused_populations, and add the same options to TableCollection.canonicalise that controls this behavior?

Or, maybe these test cases are unconvincing?

@petrelharp
Copy link
Contributor Author

I got the rest of those changes in, btw.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies it's taken so long to get to this @petrelharp. I'm still confused though, unfortunately. My takes:

  1. The TableCollection.canonicalise() method is a useful operation which we should support. By default it should remove redundant information, but we should add a keep_unreferenced=False option to allow a user to keep unreferenced information at the end of the tables. It's probably not worth having keep_unreferenced_populations, keep_unreferenced_individuals etc, but we can imagine doing this, if needed. Unreferenced stuff should definitely go at the end of the table, though, as otherwise we're not canonicalising in any meaninful way.
  2. The subset(nodes) operation no longer requires sorted input - great!
  3. The subset(nodes) operation now has a bunch of options and complicated semantics I don't understand 😿
  4. We might be able to replace the complicated semantics of subset(nodes) with reorder_nodes(nodes) operation which rejiggles the node table a bit but doesn't change any other tables?

I've made some comments above, but I think they probably reflect my confusion in that some are probably contradictory - sorry about that.

c/tskit/tables.h Outdated
#define TSK_FILTER_SITES (1 << 0)
#define TSK_FILTER_POPULATIONS (1 << 1)
#define TSK_FILTER_INDIVIDUALS (1 << 2)
#define TSK_REDUCE_TO_SITE_TOPOLOGY (1 << 3)
#define TSK_KEEP_UNARY (1 << 4)
#define TSK_KEEP_INPUT_ROOTS (1 << 5)
#define TSK_CANONICALISE (1 << 6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems problematic to me to add a TSK_CANONICALISE flag to any operation, since (conceptually) all it will be doing is calling canonicalise(). Rather than,

tables.operation(canonicalise=True)

wouldn't it be more flexible (and involve less code duplication for us) to do

tables.operation()
tables.canonicalise()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what this flag does - this flag is the thing we use within tsk_table_collection_canonicalise. (I will explain later.)

c/tskit/tables.h Outdated
@@ -2629,6 +2632,22 @@ TSK_NO_CHECK_INTEGRITY
int tsk_table_collection_sort(
tsk_table_collection_t *self, const tsk_bookmark_t *start, tsk_flags_t options);

/**
@brief Puts the tables into canonical order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer "canonical form" rather than "canonical order" but 🤷

c/tskit/tables.h Outdated
3. Edges: if both parent and child are retained nodes.
4. Mutations: if the mutation's node is a retained node.
5. Sites: if any mutations remain at the site after removing mutations.
2. Individuals, if `TSK_FILTER_INDIVIDUALS`: if referred to by a retained node,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the documentation for the flags makes sense now. How about:

  1. Individuals: In the order they are referred to by retained nodes, with any remaining unreferenced individuals appended afterwards in their original order. If the TSK_FILTER_INDIVIDUALS flag is provided, unreferenced individuals are not retained.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, that's not what this does at the moment...

c/tskit/tables.h Outdated
order as in the original tables.
order as in the original tables. If any of `TSK_FILTER_INDIVIDUALS`,
`TSK_FILTER_POPULATIONS`, or `TSK_FILTER_SITES` are *not* provided,
then the respective tables will be *unchanged*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you want the other tables entirely unchanged? Ah, that's what I've been confused about. Right, you don't want to reorder the populations because you want stable population/individual IDs.

Well, what if we add a flag TSK_STABLE_IDs or something to subset then which does the least amount of ID reshuffling as possible, which still doing the basic node-subset operation? This could be mutually exclusive to the FILTER_X operations, or we could drop them entirely if they're not useful.

Or maybe your original idea of a reorder_nodes function is better? I'm finding it hard to understand the desired semantics of the subset function at this point, so maybe another function is a better idea?

@petrelharp
Copy link
Contributor Author

TODO:

  • collapse the two keep_unreferenced_X options into one
  • add keep_unreferenced as an option to canonicalise
  • figure out what do do with subset's many options

And, hm, let's see. Dredging this back up from my memory... We have three behaviors for subset:

  1. reorder and remove unreferenced individuals and populations - needed for check_subset_equality
  2. reorder but don't remove unreferenced - needed for canonicalise (with keep_unreferenced)
  3. don't touch the individual and population tables - useful to keep population references stable (the motivation for keeping individuals is less obvious); and to match the behavior of simplify with filter_X=False

I do think we need to do all of these things. You're suggesting separating out (2) as a different method, tsk_table_collection_reorder_nodes, maybe? I like this option? It would make the "reorder nodes" operation more discoverable. I may not be up for the refactoring required to get reorder to not have a bunch of code duplicated from subset right now, though.

Does that make things clear? What do you think?

@jeromekelleher
Copy link
Member

This is very helpful, thanks @petrelharp. Also very tricky!

It would be nice if we could separate out the operations of subseting from ordering, but I see how these are intrinsically tied up at the moment. Suppose we defined tables.subset(nodes)so that it only touches the node table and keeps all the other tables as-is.

Then, we have a canonicalise(keep_unreferenced=False) function which puts the tables in a canonical form, which can optionally not remove unreferenced items.

So, you're operations above become:

  1. tables.subset(nodes); tables.canonicalise()
  2. (part of the implementation of canonicalise)
  3. tables.canonicalise(keep_unreferenced=False)

I don't think it's that much less efficient to split up the subset() and ordering() concerns really. I guess one concern though is that we've released subset() under the current semantics, so we shouldn't change them too lightly.

@petrelharp
Copy link
Contributor Author

That also seems fine, except for changing the behavior of subset post-release. Probably no-one is relying on it? If someone is, it's @mufernando ... could you have a look at the last two comments and see what you think?

@jeromekelleher
Copy link
Member

I don't know, it's all very tricky. On the one hand it's nice that subset() sort of "radiates out" from the given set of nodes and, optionally discards or not unreferenced entities. Nodes are the focal part of the data model, so it makes sense to make them the central entity we subset by. On the other hand, all the different keep/don't keep/reorder/don't reorder options are very hard to get your head around and keep track of.

What if there was just two options to subset then? By default we reorder and filter unreferenced; if KEEP_UNREFERENCED we keep unreferenced entitites at the end of the tables; and if MAINTAIN_REFERENCES we keep all other references besides nodes and edges stable. The canonicalise function can then call subset() and do some extra sorting.

Would this cover all the requirements?

@mufernando, any input here?

@jeromekelleher
Copy link
Member

Hey, what if we made a method called rejigger_ids, that would solve all our problems! (:zany_face: )

@jeromekelleher jeromekelleher mentioned this pull request Jan 26, 2021
3 tasks
@mufernando
Copy link
Member

gee, this is way more complicated than I initially thought!

I don't know, it's all very tricky. On the one hand it's nice that subset() sort of "radiates out" from the given set of nodes and, optionally discards or not unreferenced entities. Nodes are the focal part of the data model, so it makes sense to make them the central entity we subset by. On the other hand, all the different keep/don't keep/reorder/don't reorder options are very hard to get your head around and keep track of.

What if there was just two options to subset then? By default we reorder and filter unreferenced; if KEEP_UNREFERENCED we keep unreferenced entitites at the end of the tables; and if MAINTAIN_REFERENCES we keep all other references besides nodes and edges stable. The canonicalise function can then call subset() and do some extra sorting.

Would this cover all the requirements?

@mufernando, any input here?

I agree with this, we defined subset as an operation on a set of nodes, but that also affects the other tables and we should stick to that. Some of the other, more "simple" behavior that the filter_X flags are doing should really be relegated to a different operation such as reorder_nodes, which we can implement later on and refactor subset to use it.

I'm not exactly clear on what the MAINTAIN_REFERENCES would do though.

@petrelharp
Copy link
Contributor Author

I'll have a look and try something out. Maybe if I just put some more effort into the documentation, it'll all be clear and natural as-is. =)

@petrelharp
Copy link
Contributor Author

petrelharp commented Jan 26, 2021

Another data point on how we ended up with the current set of options:
Screenshot from 2021-01-26 15-55-11

@petrelharp
Copy link
Contributor Author

Looking above, @jeromekelleher, I see that you maybe weren't entirely happy with this being what "canonicalise" means. If you had something else in mind, we could reserve that keyword for some other yet-to-be-defined operation, and implement this as sort(..., complete=True) or something.

@petrelharp
Copy link
Contributor Author

Looking back, one option is to remove (3) entirely: we don't have a particular use case for "keep individual/population/site tables unchanged". Here's why I implemented it, though: I wanted to write a test where I (1) used subset to extract from a tree sequence both (a) all the relationships to a single individual; and (b) everything else, and then (2) use union to put (a) and (b) back together to get the original tree sequence. However, this runs into problems if the population table changes: if everything in (a) is from population 1, for instance, then what was population (1) becomes population (0) in (a), and so the pieces we union back together are no longer compatible (since we don't have a population_map argument to union). I thought that maintaining sites might be useful to people who, say, want to have genotypes comparable across different subsets. I don't have any use case for keeping individual tables unchanged, and implemented it for symmetry with simplify.

So: I'd be willing to remove the filter_sites, filter_individuals and filter_populations flags entirely. It seems a bit of a shame, because subset is a fairly low-level tool, and having it be more flexible seems helpful (as it was for the task above).

I think the reason this is difficult is because of how things are named in simplify: since it uses filter_X to toggle between "keep unchanged" and "reorder and remove", it's hard to come up with another option that does the halfway operation of "just reorder, but don't remove". Also, maybe we should not mirror simplify, as we want to avoid conflating the two in people's minds (they do pretty different things but easy to conflate!).

So, here's a proposal:

def subset(self, nodes, record_provenance=True,
                  reorder_populations=True,
                  remove_unreferenced=True):
    """
    Returns a tree sequence containing only information directly referencing the provided list of nodes to retain.
    The result will retain only the nodes whose IDs are listed in ``nodes``, only edges for which both parent
    and child are in ``nodes```, only mutations whose node is in ``nodes``, and only individuals that are referred to
    by one of the retained nodes.

    This has the side effect of reordering the nodes, individuals, and populations in the tree sequence: the nodes
    in the new tree sequence will be in the order provided in ``nodes``, and both individuals and populations will be
    ordered by the earliest retained node that refers to them. (However, ``reorder_populations`` may be set to False
    to keep the population table unchanged.)

    By default, the method removes all individuals and populations not referenced by any nodes, and all sites not
    referenced by any mutations. To retain these unreferencd individuals, populations, and sites, pass
    ``remove_unreferenced=False``. If this is done, the site table will remain unchanged, unreferenced individuals
    will appear at the end of the individuals table (and in their original order), and unreferenced populations will
    appear at the end of the population table (unless ``reorder_populations=False``).
    """

In this case, the three operations above are:

  1. reorder and remove unreferenced individuals and populations - subset( )
  2. reorder but don't remove unreferenced - subset(remove_unreferenced=False)
  3. don't touch the population tables - subset(reorder_populations=False)
    I've removed "don't touch individual tables" from (3) because I don't think there's really a use case for that. If there is, though, we could add reorder_individuals later.

@petrelharp
Copy link
Contributor Author

Update:

  • my second proposal above sounds good for subset
  • it remains to decide if canonicalise removes unreferenced things or not.

We don't need to retain unreferenced things for its use in union (and arguably it'd be better if it didn't, actually). So, here's my proposal:

  • add a remove_unreferenced=True argument to canonicalise.

@jeromekelleher
Copy link
Member

I like the new signature for subset, and canonicalise(remove_unreferenced=True) seems good to me. (Also keep_unreferenced=False is fine too, if this works better in terms of syncing with the C flags. I guess you'd have to do it as TSK_KEEP_UNREFERENCED, for options=0 to mean "remove unreferenced, right?)

@jeromekelleher
Copy link
Member

Apologies for sending things in the wrong direction with the misguided FILTER_X options, btw.

@petrelharp
Copy link
Contributor Author

(Note: subset has not been released with flags, so this will be an API change.)

c/tskit/tables.c Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor Author

I think this is ready to go, and passing all the tests, despite what github and codecov say (!?!?). Want to take a final (?) look, @jeromekelleher?

Note: I thought about calling the option keep_unreferenced rather than remove_unreferenced, but thought the double-negative default of keep_unreferenced=False was more confusing.

@jeromekelleher jeromekelleher force-pushed the update_union branch 2 times, most recently from 2656168 to 841c11a Compare February 2, 2021 11:29
@jeromekelleher
Copy link
Member

I've gone through this in detail @petrelharp, and I happy to merge. I've rebased to fixup recent breakages, and made a few small changes. Can you have a look at the last two commits please to make sure you're OK with the updates?

Some notes:

  • For new additions to the API, it really does help if we don't hard-code defaults into the method signature unless we're absolutely certain they will never change (i.e., in the presence of all possible future parameters).
  • Also good to make sure that new additions like this are keyword only.

@petrelharp
Copy link
Contributor Author

LGTM! Thanks for fixing those things up.

@petrelharp
Copy link
Contributor Author

I can squash & rebase later, but not right now - go ahead if you want.

@petrelharp petrelharp force-pushed the update_union branch 2 times, most recently from 92c359d to 841c11a Compare February 2, 2021 22:52
@petrelharp
Copy link
Contributor Author

Whoops, sorry, almost had a git mishap there.

@petrelharp
Copy link
Contributor Author

Um, I guess I could click the "rebase and merge" button, but I'm nervous to since last time. (I guess we need to adjust our threshold on the codecov again, eh?)

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Feb 3, 2021
@jeromekelleher
Copy link
Member

Um, I guess I could click the "rebase and merge" button, but I'm nervous to since last time. (I guess we need to adjust our threshold on the codecov again, eh?)

I think it's fine having multiple commits on something big like this. The codecov levels aren't part of the mergify rules, so we don't have to worry about them. We should really use the AUTOMERGE-REQUESTED label here on tskit for merging stuff rather than the buttons (I've not been using it recently on msprime because there's too much stuff going on, and there isn't really anyone to do the reviewing; and someone can't approve their own PRs)

petrelharp and others added 6 commits February 3, 2021 07:02
assert equal method (closes tskit-dev#1076)

add keep_unused arguments to subset

py sort and tsutil to disorder ts

verifying C sort against Py sort

shuffle tables

sphinx fixup for subset options

extend subset to work on unsorted tables

canonicalalise (closes tskit-dev#705)

test unrestricted union
Fix some other merge issues.
- Move C code to more logical locations.
- Update new method signatures to avoid hard-coding defaults.
- Various minor rephrasings.
- Fill in gap in C test coverage.
@mergify mergify bot merged commit 5770b38 into tskit-dev:main Feb 3, 2021
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Feb 3, 2021
@benjeffery
Copy link
Member

(I've not been using it recently on msprime because there's too much stuff going on, and there isn't really anyone to do the reviewing; and someone can't approve their own PRs)

We could add a label to merge without review if you are the PR author?

@jeromekelleher
Copy link
Member

We could add a label to merge without review if you are the PR author?

It's not worth the bother - things will settle down soon enough in msprime, and it's bad practise anyway.

@benjeffery benjeffery mentioned this pull request Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tidy up union Add assert_equal to TableCollection and TreeSequence make sort stricter
4 participants