Skip to content

Conversation

@petrelharp
Copy link
Contributor

@petrelharp petrelharp commented Sep 24, 2025

Here's the start at what we discussed in #3183. Might you have a go at putting in the python tests, @hyanwong?

Something I haven't done that we said we might in the other PR is require that if all_edges or all_mutations are True then check_shared_overlap is False. I don't think we actually need to require this (since in some cases all three might be true and it's fine!) but we should probably say in the docstring that the user probably wants this to be the case. (I just don't remember why that is right now.)

Still TODO:

  • python tests:
    • check if other has no edges that we can use this to add mutations to a subset of nodes in an existing ts
    • check if we trim ts down to two disjoint spans and then union these together we get back ts
  • CHANGELOG
  • say something about checking shared overlap in the docs
  • explain about all_mutations and all_edges in the docstring

@codecov
Copy link

codecov bot commented Sep 24, 2025

Codecov Report

❌ Patch coverage is 84.21053% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.80%. Comparing base (e4ca469) to head (ddc1dba).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
c/tskit/tables.c 75.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3283      +/-   ##
==========================================
+ Coverage   89.79%   89.80%   +0.01%     
==========================================
  Files          29       29              
  Lines       31010    31026      +16     
  Branches     5674     5679       +5     
==========================================
+ Hits        27845    27863      +18     
+ Misses       1778     1777       -1     
+ Partials     1387     1386       -1     
Flag Coverage Δ
c-tests 86.85% <75.00%> (-0.02%) ⬇️
lwt-tests 80.38% <ø> (ø)
python-c-tests 87.05% <100.00%> (+0.08%) ⬆️
python-tests 98.84% <ø> (+0.01%) ⬆️
python-tests-no-jit 33.60% <ø> (ø)
python-tests-numpy1 50.18% <ø> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
python/_tskitmodule.c 87.05% <100.00%> (+0.08%) ⬆️
python/tskit/tables.py 99.77% <ø> (ø)
python/tskit/trees.py 98.88% <ø> (+0.03%) ⬆️
c/tskit/tables.c 83.08% <75.00%> (-0.02%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@petrelharp
Copy link
Contributor Author

Whoops - now this should be ready to write tests for.

@hyanwong
Copy link
Member

hyanwong commented Oct 4, 2025

I made a PR to your PR, @petrelharp - the only thing I can see that might be a problem is that we might want to include empty sites when unioning one TS with another. If so, we could either:

  1. Include empty sites when all_mutations=True
  2. Have an additional parameter: e.g. all_sites
  3. Have a single parameter all which does all_mutations, all_edges and all_sites (is there a call for separating edges and sites?)

I guess (2) gives the most flexibility for the user?

@petrelharp
Copy link
Contributor Author

Let's keep the convo here, and lmk if you don't have permissions to push here.

That all looks good; I think we just need some additional checks for correctness of the union (a lot of the checks are just that the number of things is right).

I guess I don't think we should have a separate all_sites argument, we should just have all_mutations also include all sites. Otherwise the behavior is complicated.

It's tempting to have a disjoint argument that does all_mutations=True, all_edges=True, and check_shared_overlap=False. I'm not sure, since it's adding another argument, but I think I like it since it points users towards the use case. (And if they use this with two that have shared overlap they'll get duplicated edges and hence an error, which is good.)

@petrelharp
Copy link
Contributor Author

There; I switched the behavior to add all sites for reals if all_mutations is True.

@hyanwong
Copy link
Member

hyanwong commented Oct 4, 2025

Great! I'll add more checks on actual correctness. Thanks @petrelharp !

@hyanwong
Copy link
Member

hyanwong commented Oct 9, 2025

we just need some additional checks for correctness of the union (a lot of the checks are just that the number of things is right).

I changed to assert that entire tables were equal rather than checking lengths, and added some tests of unioning empty tables to things.

One thing that has been picked up here is that unioning a table into an empty set of tables seems not to preserve the individual parents. Is this expected behaviour @petrelharp? At the moment it is causing my new tests to fail (see below for explanation: we aren't adding unreferenced individuals from other)


see https://github.com/tskit-dev/tskit/issues/1243 for mention of individual tests.

@hyanwong
Copy link
Member

hyanwong commented Oct 9, 2025

Also possibly todo: should we warn in the python function if the mutation/edge/population/site/node/individual table metadata schemas differ between the two tree sequences?

@hyanwong
Copy link
Member

Aha - I figured out what was going wrong in the tests. I was comparing the wrong "individuals". When we call union(all_edges=True, all_mutations=True), we don't include individuals that are unreferenced by any node. I.e. we aren't just smashing all the tables together, but doing something more sophisticated. I don't know what the preferred behaviour is in this case?

@hyanwong
Copy link
Member

It looks like the uncovered part in the C code is for when add_row returns an error. Is there a way to get coverage for this?
Screenshot 2025-10-10 at 19 21 58

@petrelharp
Copy link
Contributor Author

It looks like the uncovered part in the C code is for when add_row returns an error. Is there a way to get coverage for this?

No, that's not something we can (reasonably) cover. It probably just errors if it's out of memory.

@petrelharp
Copy link
Contributor Author

Aha - I figured out what was going wrong in the tests. I was comparing the wrong "individuals". When we call union(all_edges=True, all_mutations=True), we don't include individuals that are unreferenced by any node. I.e. we aren't just smashing all the tables together, but doing something more sophisticated. I don't know what the preferred behaviour is in this case?

Sorry, I'm not following? This doesn't ever claim to add nodes or individuals?

@petrelharp
Copy link
Contributor Author

Tangentially, I think the docstring for concatenate could be improved:

        Concatenate a set of tree sequences to the right of this one, by repeatedly
        calling :meth:`~TreeSequence.union` with an (optional)
        node mapping for each of the ``others``. If any node mapping is ``None``
        only map the sample nodes between the input tree sequence and this one,
        based on the numerical order of sample node IDs.

I don't think mentioning union here particularly helps. The key thing, that isn't spelled out, is that all coordinates in other are shifted by the current sequence_length (and then union'ed).


        :param Union[list, None] node_mappings: An list of node mappings for each
            input tree sequence in ``args``. Each should either be an array of
            integers of the same length as the number of nodes in the equivalent
            input tree sequence (see :meth:`~TreeSequence.union` for details), or
            ``None``. If ``None``, only sample nodes are mapped to each other.

I'm not sure about "the equivalent input tree sequence". Why "equivalent"? Shouldn't this just be "in the 'other' tree sequnce"?

@petrelharp
Copy link
Contributor Author

I think the tests are pretty good. Brainstorming corner cases:

  • As you say, what if metadata schema differ? That is covered by check_shared_equality. So I guess it might be nice to check, but maybe is not our job here.
  • What if the same site is in both but with different ancestral state or metadata. Currently it'll just use the one in self and ignore the other. I guess I vote to leave this with a "user beware" note.
  • What if there are mutations in both on the same node and at the same time. Then what will happen is the mutation in other comes after the mutation in self (since compute_mutation_parents uses order to resolve such things). Again, "user beware"?

So I guess I think this is maybe ready to go, with possibly some of the caveats above mentioned in the docstring. Are there other corner cases?

@benjeffery
Copy link
Member

Is this at a point where I should review?

@hyanwong
Copy link
Member

hyanwong commented Oct 15, 2025

Tangentially, I think the docstring for concatenate could be improved:

I agree! I'm happy to do that.

I don't think mentioning union here particularly helps. The key thing, that isn't spelled out, is that all coordinates in other are shifted by the current sequence_length (and then union'ed).

Yes, that's right. I mentioned union so that people could look up the docs for how that works, in case the documentation for concatenate was sparse (e.g. it allows us to say "see :meth:~TreeSequence.union for details". But I'm happy to remove it.

I'm not sure about "the equivalent input tree sequence". Why "equivalent"? Shouldn't this just be "in the 'other' tree sequnce"?

Yes, you are right. I mean "equivalent" in the list (as we are talking about a list of tree sequences and a list of node mappings). Maybe "respective" would be clearer?

@hyanwong
Copy link
Member

Aha - I figured out what was going wrong in the tests. I was comparing the wrong "individuals"....

Sorry, I'm not following? This doesn't ever claim to add nodes or individuals?

When you specify "new nodes" in the union (i.e. the node_mapping is -1), then those nodes in other get added to the new tree sequence, and any individuals associated with those nodes in other also get added.

However, if other contains additional individuals that are not associated with any nodes, these are not added. Perhaps this is what we want? Or perhaps we want a union that does its best to add everything (bar the samples and the individuals associated with those) to the returned tree sequence.

It probably just needs documenting?

@hyanwong
Copy link
Member

Is this at a point where I should review?

If you have time @benjeffery , I think that could be helpful, yes. Meanwhile I'll update the docstrings.

@hyanwong
Copy link
Member

Shall I squash all these commits down?

@petrelharp
Copy link
Contributor Author

We should squash at some point, let's wait for @benjeffery 's review.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a quick pass - haven't gone into algorithmic detail yet.

self,
other,
node_mapping,
all_edges=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placing these at the start of kwargs has broken compatibility with old code that might have been using them positionally as we had no *.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point. Easy enough to put a * in now, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm moving them to the end, but if it's not too rude then we should add a *.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, add a star, but just before your new kwargs, to not break existing positional usage.

@petrelharp
Copy link
Contributor Author

Okay - perhaps someone (@hyanwong?) should have a careful read of the docstrings and then this is good? The failing coverage is because the diff is not many lines and there's a "we can't cover this" line in that diff, so the percent covered on the diff is small.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few doc nits.

@petrelharp
Copy link
Contributor Author

Thanks, @benjeffery! Shall we merge this, @hyanwong? (Assuming the tests pass.)

@petrelharp petrelharp marked this pull request as ready for review October 22, 2025 20:11
Add some python tests
And fix concatenate()
all_mutations implies really all_sites
@hyanwong
Copy link
Member

Yes, please do. Sorry about the slow reply: both deep in teaching and also down with influenza, but what we have now looks good to me. I can always tweak the docstrings later if we figure out better ways of saying things. So... merge away, I reckon.

@hyanwong hyanwong added this pull request to the merge queue Oct 22, 2025
Merged via the queue into tskit-dev:main with commit 52a1bcc Oct 22, 2025
17 of 18 checks passed
@hyanwong hyanwong mentioned this pull request Nov 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants