Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions python/CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@
[0.5.3] - 2022-XX-XX
--------------------

**Fixes**

- ``ts.subset()`` produces valid tree sequences even if nodes are shuffled
out of time order (:user:`hyanwong`, :pr:`2479`, :issue:`2473`)

**Changes**

- Single statistics computed with ``TreeSequence.general_stat`` are now
Expand Down
21 changes: 21 additions & 0 deletions python/tests/test_highlevel.py
Original file line number Diff line number Diff line change
Expand Up @@ -1949,6 +1949,27 @@ def test_simplify_migrations_fails(self):
with pytest.raises(_tskit.LibraryError):
ts.simplify()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps these should go here? And could call verify_subset there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I could do, but I thought test_tables.py was specifically for testing TableCollection functions. The new functionality only applies to the ts.subset function (should it? Shouldn't the TableCollection function mirror the TreeSequence one exactly?). That was my rationale for dumping the tests in test_highlevel.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, good point. Um, I guess I think we should call sort in TableCollection.subset, instead of just here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that would probably be my preference.

There may be other functions which have a table and a tree sequence version, but where extra processing is required to make a valid tree sequence. It could be worth establishing a precedence for functions like this, if there is a use-case for not doing the extra processing in the tables version. We could have a parameter make_valid_ts that any such table function has, which is always set to True if called via the ts version.

It may not be that this isn't required for this particular function, so we could punt any decision down the line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, and I don't think we have a use case for doing subset followed by other processing steps that don't require sorted tables before sorting (if we did, then it could be useful for efficiency to keep it out). So, let's not add the extra parameter.

def test_subset_reverse_all_nodes(self):
ts = tskit.Tree.generate_comb(5).tree_sequence
assert np.all(ts.samples() == np.arange(ts.num_samples))
flipped_ids = np.flip(np.arange(ts.num_nodes))
new_ts = ts.subset(flipped_ids)
assert set(new_ts.samples()) == set(flipped_ids[np.arange(ts.num_samples)])
r1 = ts.first().rank()
r2 = new_ts.first().rank()
assert r1.shape == r2.shape
assert r1.label != r2.label

def test_subset_reverse_internal_nodes(self):
ts = tskit.Tree.generate_balanced(5).tree_sequence
internal_nodes = np.ones(ts.num_nodes, dtype=bool)
internal_nodes[ts.samples()] = False
node_ids = np.arange(ts.num_nodes)
node_ids[internal_nodes] = np.flip(node_ids[internal_nodes])
new_ts = ts.subset(node_ids)
assert np.any(new_ts.nodes_time != ts.nodes_time)
assert new_ts.first().rank() == ts.first().rank()

def test_deprecated_apis(self):
ts = msprime.simulate(10, random_seed=1)
assert ts.get_ll_tree_sequence() == ts.ll_tree_sequence
Expand Down
13 changes: 7 additions & 6 deletions python/tskit/trees.py
Original file line number Diff line number Diff line change
Expand Up @@ -6694,12 +6694,12 @@ def subset(
of the retained nodes. Note that this does *not* retain
the ancestry of these nodes - for that, see :meth:`.simplify`.

This has the side effect of reordering the nodes, individuals, and
populations in the tree sequence: the nodes in the new tree sequence
will be in the order provided in ``nodes``, and both individuals and
populations will be ordered by the earliest retained node that refers
to them. (However, ``reorder_populations`` may be set to False
to keep the population table unchanged.)
This has the side effect that it may change the order of the nodes,
individuals, populations, and migrations in the tree sequence: the nodes
in the new tree sequence will be in the order provided in ``nodes``, and
both individuals and populations will be ordered by the earliest retained
node that refers to them. (However, ``reorder_populations`` may be set to
False to keep the population table unchanged.)

By default, the method removes all individuals and populations not
referenced by any nodes, and all sites not referenced by any mutations.
Expand Down Expand Up @@ -6734,6 +6734,7 @@ def subset(
reorder_populations=reorder_populations,
remove_unreferenced=remove_unreferenced,
)
tables.sort()
return tables.tree_sequence()

def union(
Expand Down