Save flags and metadata when splitting disjoint nodes #382

hyanwong · 2024-05-30T00:07:14Z

And add split_nodes to preprocess_ts. Fixes #373

hyanwong · 2024-05-30T00:16:02Z

@nspope : I also added the following to the changelog. Do my short summaries read OK?

- Variational gamma uses a rescaling approach which helps considerably if e.g.
  population sizes vary over time

- Variational gamma does not use mutational area of branches, but average path
  length, which reduces bias in tree sequences containing polytomies

**Breaking changes**

- Variational gamma uses an improper (flat) prior, and therefore
  no longer needs `population_size` specifying.

hyanwong · 2024-05-30T08:16:40Z

Also note the API for adding metadata information on split nodes: split_disjoint_nodes(ts, metadata_key="original_node_id"). If metadata_key is of non-zero-length, then original node ID is added to the split nodes using this key. By default the key is None treated as "" - meaning "don't try to add any metadata". This means the routine will work (and not include extra metadata) for any metadata schema.

I'm unsure if this seems a bit hacky? But it seems good to me to always be able to run the routine by default.

nspope · 2024-05-30T15:26:13Z

do my short summaries read OK?

LGTM! Let me know when you're ready for a review.

tsdate/util.py

hyanwong · 2024-05-31T11:27:58Z

Definitely ready for a review, thanks!

Re your comment about tskit.unpack_strings, that's a good point, and I haven't tried this, no.

hyanwong · 2024-06-03T14:36:03Z

After a quick chat with Jerome, we think that (a) this routine is mostly tsinfer-specific so (b) we should probably always try to set the metadata, and simply warn if this is not possible. This could probably apply to the other tsinfer-set metadata fields like ancestor_data_id too.

And add split_nodes to preprocess_ts. Fixes tskit-dev#373

hyanwong · 2024-06-03T16:39:35Z

OK, I think this is ready to go. Does it look OK to you @nspope ?

nspope

LGTM, just a couple comments re: tests. Also, just out of curiosity, I'm wondering what the timing of this routine on the 40K GEL data is, versus the old version (that didn't do anything with metadata). If there's a sizeable gap, lets open an issue about moving some of the internals into numba.

nspope · 2024-06-03T16:39:45Z

tests/test_util.py

+
+
+class TestSplitDisjointNodes:
+    def test_nosplit(self):


would also be good to test that split_disjoint_nodes leaves a tree sequence unchanged if it's already had the routine called on it

also it looks like I stuck "TestNodeSplitting" class into test_functions.py, let's move it over to this file (I think it includes something like the test I suggest above)

Great, done. I'll get Ben to test the speed, then merge.

And address other comments

hyanwong · 2024-06-04T13:21:51Z

I didn't test on the 40K ts, but I did on chr2 of the unified genealogy (7524 samples), which isn't too bad a test, as split_disjoint_nodes still creates half-a-million new nodes. For reference, it takes about 75 seconds to simplify this ts (i.e. removing unary nodes) before splitting.

The old routine took 15 secs to run. The new one, having set a permissive_json schema, takes 25 secs. I think that's fine, so I'll merge.

At the moment, tsinfer doesn't set a node metadata schema, so the metadata is not saved anyway (and so it only takes 15 secs even with the new code). When we fix tsinfer to set a proper schema, it will take a little longer on large tree sequences, but still a fraction of the time required for simplification.

hyanwong force-pushed the split-disjoint branch from 4e51a89 to 9fbf276 Compare May 30, 2024 00:25

nspope reviewed May 30, 2024

View reviewed changes

tsdate/util.py Outdated Show resolved Hide resolved

hyanwong force-pushed the split-disjoint branch from 9fbf276 to fa0fbbd Compare June 3, 2024 14:15

Save flags and metadata when splitting disjoint nodes

a8c7aac

And add split_nodes to preprocess_ts. Fixes tskit-dev#373

hyanwong force-pushed the split-disjoint branch from fa0fbbd to a8c7aac Compare June 3, 2024 15:36

nspope approved these changes Jun 3, 2024

View reviewed changes

Move old disjoint tests

13d0269

And address other comments

hyanwong merged commit a18f33c into tskit-dev:main Jun 4, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save flags and metadata when splitting disjoint nodes #382

Save flags and metadata when splitting disjoint nodes #382

hyanwong commented May 30, 2024

hyanwong commented May 30, 2024

hyanwong commented May 30, 2024

nspope commented May 30, 2024

hyanwong commented May 31, 2024

hyanwong commented Jun 3, 2024

hyanwong commented Jun 3, 2024

nspope left a comment

nspope Jun 3, 2024

nspope Jun 3, 2024 •

edited

Loading

hyanwong Jun 3, 2024

hyanwong commented Jun 4, 2024

Save flags and metadata when splitting disjoint nodes #382

Save flags and metadata when splitting disjoint nodes #382

Conversation

hyanwong commented May 30, 2024

hyanwong commented May 30, 2024

hyanwong commented May 30, 2024

nspope commented May 30, 2024

hyanwong commented May 31, 2024

hyanwong commented Jun 3, 2024

hyanwong commented Jun 3, 2024

nspope left a comment

Choose a reason for hiding this comment

nspope Jun 3, 2024

Choose a reason for hiding this comment

nspope Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

hyanwong Jun 3, 2024

Choose a reason for hiding this comment

hyanwong commented Jun 4, 2024

nspope Jun 3, 2024 •

edited

Loading