Decouple the LightweightTableCollection from C module. #862

jeromekelleher · 2020-09-25T16:37:02Z

An initial attempt to pull out the LightweightTableCollection code in a way that's reusable among downstream Python projects.

Motivated by #655, but also it'd be nice to get rid of a few K-lines of boilerplate from msprime and tsinfer.

The idea is that Python projects (like msprime) that use the tskit C API can include this new header file into their own low-level C module. Since they are already doing a git submodule for tskit, then by pulling in this file from the tskit source it should ensure that the code stays up to date and those projects don't need to worry about compatibility issues.

I'm sure this will work for Python C modules like we're using in msprime/tsinfer, but we're kinda hoping that this will also work for Cython code, like @terhorst has been using.

terhorst · 2020-09-25T17:19:09Z

Thanks for the quick patch. I had one question which might be obvious, but I couldn't grasp it from your test code. It seems like there is no way to go from a LightweightTreeSequence to a _tskit.TreeSequence except through tskit.TableCollection. I end up stuffing all the tables by hand, essentially just replicating tskit.TableCollection().fromdict anyways:

https://github.com/terhorst/xsmc/blob/631b6cf34dc1187d687d00c9f457afad38c3c971/xsmc/xsmc.py#L49-L57

It seems like I might be missing something -- I don't really see what's the gain of going through LightweightTreeSequence; couldn't you just use TreeSequence.dump_tables()? Is the idea that the former is more like a protocol that interoperates between versions?

Asking because it's fairly easy to bundle the _tskit module inside another Python package, but somewhat non-trivial to bundle tskit.

codecov · 2020-09-25T17:30:50Z

Codecov Report

Merging #862 into main will increase coverage by 0.00%.
The diff coverage is 94.77%.

@@           Coverage Diff           @@
##             main     #862   +/-   ##
=======================================
  Coverage   93.52%   93.52%           
=======================================
  Files          24       25    +1     
  Lines       19927    19932    +5     
  Branches      789      789           
=======================================
+ Hits        18636    18641    +5     
  Misses       1259     1259           
  Partials       32       32

Impacted Files	Coverage Δ
python/_tskit_lwt_interface.h	`94.77% <94.77%> (ø)`
python/_tskitmodule.c	`90.60% <100.00%> (-0.72%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 786fd14...6a5eb93. Read the comment docs.

benjeffery

LGTM!
@jeromekelleher shall we merge now and do the docs/example bulids in a follow up?

jeromekelleher · 2020-09-28T09:30:04Z

That's a good question @terhorst. If you have your data in a LightweightTableCollection lwt, then the way to get a tskit TreeSequence is:

tables = tskit.TableCollection.fromdict(lwt.asdict())
ts = tables.tree_sequence()

However, if I understand your application you'll probably want to do this on the C side. So, you have a tskit.TreeSequence as input, and you want to turn that into a tsk_treeseq_t so you can do stuff with it in C. For that, on the Python side do

lwt = xmsc.LightweightTableCollection(ts.sequence_length)
lwt.fromdict(ts.tables.asdict())
# pass lwt to your class/function that's doing the C stuff.

Then, in C

tsk_treeseq_t ts;
// lwt is a LightweightTableCollection instance
ret = tsk_treeseq_init(&ts, lwt->tables, TSK_BUILD_INDEXES);
// now you can do stuff with &ts.

There's a bit of inefficiency here in that we need to recompute the indexes (I think the index data isn't included in the dictionary encoding at the moment), but it's pretty minimal unless the tree sequence is big.

Does this make sense/help?

jeromekelleher · 2020-09-28T09:36:26Z

@jeromekelleher shall we merge now and do the docs/example bulids in a follow up?

I don't mind - are you OK with the names of things? I didn't really much thought into how that would affect things. But yeah, if your happy with it as is, then lets merge and lodge an issue for follow-ups.

One thing I'd like to sort out is how we can reuse the tests in downstream repos without editing the code. I think the only change we make at the moment is the import X as c_module. Any ideas on how we could somehow parameterise this test, so we could soft-link the test module into the downstream repo's test suite?

benjeffery · 2020-09-28T12:47:02Z

Any ideas on how we could somehow parameterise this test, so we could soft-link the test module into the downstream repo's test suite?

@jeromekelleher I've just pushed a commit here showing how this might be done, basically you move the tests into tskit so they can be used whereever. 8b2aca0

jeromekelleher · 2020-09-28T12:57:15Z

python/tests/test_dict_encoding.py

-                tskit.TableCollection.fromdict(d)
-        # Check we've done something
-        self.assertTrue(seen_msprime)
+from tskit.dict_encoding_tests import *  # noqa: F403,F401


It's a good idea, but this will mean that we use the version of the tests from the installed version of tskit, not the version which corresponds to the C-lib version. These could get out of date - suppose we have msprime using version 0.99.X and everything is working fine. Then we release a tskit version which is based on 0.99.X + 1, which has an added some extra fields to the data model that are irrelevant for msprime. Then, these tests will break. It would essentially force us to update the C API in all downstream packages everytime we update the C API in Python tskit, which I don't think we want to do.

Probably simplest to mess around with this downstream and see what can be made work. Maybe we merge the basic change first, and see how we go? It'll make things easier to test if we can point msprime at a hash in the main repo rather than pointing at a fork.

Yes, the tskit-using-application tests need to pull from the tests from the submodule of tskit, not the installed tskit. So it's either a weird looking import or a symlink I think?

Ok, lets back out the last commit i did, then merge.

jeromekelleher mentioned this pull request Sep 25, 2020

Use LightweightTableCollection code from tskit tskit-dev/msprime#1215

Merged

benjeffery approved these changes Sep 28, 2020

View reviewed changes

jeromekelleher force-pushed the decouple-lwt branch from 8203950 to 3076a0c Compare September 28, 2020 09:33

benjeffery changed the base branch from master to main September 28, 2020 12:11

jeromekelleher commented Sep 28, 2020

View reviewed changes

benjeffery force-pushed the decouple-lwt branch from 8b2aca0 to 3076a0c Compare September 28, 2020 13:04

jeromekelleher mentioned this pull request Sep 28, 2020

Document LightweightTableCollection interface and usage #864

Closed

jeromekelleher added the AUTOMERGE-REQUESTED label Sep 28, 2020

Decouple the LightweightTableCollection from C module.

6a5eb93

AdminBot-tskit force-pushed the decouple-lwt branch from 3076a0c to 6a5eb93 Compare September 28, 2020 13:16

mergify bot merged commit 160f14b into tskit-dev:main Sep 28, 2020

mergify bot removed the AUTOMERGE-REQUESTED label Sep 28, 2020

This was referenced Sep 28, 2020

Individual parents #866

Closed

Add indexes to the dict encoding #870

Closed

jeromekelleher deleted the decouple-lwt branch September 29, 2020 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decouple the LightweightTableCollection from C module. #862

Decouple the LightweightTableCollection from C module. #862

Uh oh!

jeromekelleher commented Sep 25, 2020 •

edited

Loading

Uh oh!

terhorst commented Sep 25, 2020

Uh oh!

codecov bot commented Sep 25, 2020 •

edited

Loading

Uh oh!

benjeffery left a comment

Uh oh!

jeromekelleher commented Sep 28, 2020 •

edited

Loading

Uh oh!

jeromekelleher commented Sep 28, 2020

Uh oh!

benjeffery commented Sep 28, 2020

Uh oh!

jeromekelleher Sep 28, 2020

Uh oh!

jeromekelleher Sep 28, 2020

Uh oh!

benjeffery Sep 28, 2020

Uh oh!

benjeffery Sep 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Decouple the LightweightTableCollection from C module. #862

Decouple the LightweightTableCollection from C module. #862

Uh oh!

Conversation

jeromekelleher commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

terhorst commented Sep 25, 2020

Uh oh!

codecov bot commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

benjeffery left a comment

Choose a reason for hiding this comment

Uh oh!

jeromekelleher commented Sep 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Sep 28, 2020

Uh oh!

benjeffery commented Sep 28, 2020

Uh oh!

jeromekelleher Sep 28, 2020

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Sep 28, 2020

Choose a reason for hiding this comment

Uh oh!

benjeffery Sep 28, 2020

Choose a reason for hiding this comment

Uh oh!

benjeffery Sep 28, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeromekelleher commented Sep 25, 2020 •

edited

Loading

codecov bot commented Sep 25, 2020 •

edited

Loading

jeromekelleher commented Sep 28, 2020 •

edited

Loading