Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save flags and metadata when splitting disjoint nodes #382

Merged
merged 2 commits into from
Jun 4, 2024

Conversation

hyanwong
Copy link
Member

And add split_nodes to preprocess_ts. Fixes #373

@hyanwong
Copy link
Member Author

@nspope : I also added the following to the changelog. Do my short summaries read OK?

- Variational gamma uses a rescaling approach which helps considerably if e.g.
  population sizes vary over time

- Variational gamma does not use mutational area of branches, but average path
  length, which reduces bias in tree sequences containing polytomies

**Breaking changes**

- Variational gamma uses an improper (flat) prior, and therefore
  no longer needs `population_size` specifying.

@hyanwong
Copy link
Member Author

Also note the API for adding metadata information on split nodes: split_disjoint_nodes(ts, metadata_key="original_node_id"). If metadata_key is of non-zero-length, then original node ID is added to the split nodes using this key. By default the key is None treated as "" - meaning "don't try to add any metadata". This means the routine will work (and not include extra metadata) for any metadata schema.

I'm unsure if this seems a bit hacky? But it seems good to me to always be able to run the routine by default.

@nspope
Copy link
Contributor

nspope commented May 30, 2024

do my short summaries read OK?

LGTM! Let me know when you're ready for a review.

tsdate/util.py Outdated Show resolved Hide resolved
@hyanwong
Copy link
Member Author

Definitely ready for a review, thanks!

Re your comment about tskit.unpack_strings, that's a good point, and I haven't tried this, no.

@hyanwong
Copy link
Member Author

hyanwong commented Jun 3, 2024

After a quick chat with Jerome, we think that (a) this routine is mostly tsinfer-specific so (b) we should probably always try to set the metadata, and simply warn if this is not possible. This could probably apply to the other tsinfer-set metadata fields like ancestor_data_id too.

And add split_nodes to preprocess_ts. Fixes tskit-dev#373
@hyanwong
Copy link
Member Author

hyanwong commented Jun 3, 2024

OK, I think this is ready to go. Does it look OK to you @nspope ?

Copy link
Contributor

@nspope nspope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple comments re: tests. Also, just out of curiosity, I'm wondering what the timing of this routine on the 40K GEL data is, versus the old version (that didn't do anything with metadata). If there's a sizeable gap, lets open an issue about moving some of the internals into numba.



class TestSplitDisjointNodes:
def test_nosplit(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would also be good to test that split_disjoint_nodes leaves a tree sequence unchanged if it's already had the routine called on it

Copy link
Contributor

@nspope nspope Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it looks like I stuck "TestNodeSplitting" class into test_functions.py, let's move it over to this file (I think it includes something like the test I suggest above)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, done. I'll get Ben to test the speed, then merge.

And address other comments
@hyanwong
Copy link
Member Author

hyanwong commented Jun 4, 2024

I didn't test on the 40K ts, but I did on chr2 of the unified genealogy (7524 samples), which isn't too bad a test, as split_disjoint_nodes still creates half-a-million new nodes. For reference, it takes about 75 seconds to simplify this ts (i.e. removing unary nodes) before splitting.

The old routine took 15 secs to run. The new one, having set a permissive_json schema, takes 25 secs. I think that's fine, so I'll merge.

At the moment, tsinfer doesn't set a node metadata schema, so the metadata is not saved anyway (and so it only takes 15 secs even with the new code). When we fix tsinfer to set a proper schema, it will take a little longer on large tree sequences, but still a fraction of the time required for simplification.

@hyanwong hyanwong merged commit a18f33c into tskit-dev:main Jun 4, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add split_disjoint_nodes to preprocess routine
2 participants