Skip to content

Conversation

@jeromekelleher
Copy link
Member

@jeromekelleher jeromekelleher commented Sep 25, 2020

An initial attempt to pull out the LightweightTableCollection code in a way that's reusable among downstream Python projects.

Motivated by #655, but also it'd be nice to get rid of a few K-lines of boilerplate from msprime and tsinfer.

The idea is that Python projects (like msprime) that use the tskit C API can include this new header file into their own low-level C module. Since they are already doing a git submodule for tskit, then by pulling in this file from the tskit source it should ensure that the code stays up to date and those projects don't need to worry about compatibility issues.

I'm sure this will work for Python C modules like we're using in msprime/tsinfer, but we're kinda hoping that this will also work for Cython code, like @terhorst has been using.

@terhorst
Copy link
Contributor

Thanks for the quick patch. I had one question which might be obvious, but I couldn't grasp it from your test code. It seems like there is no way to go from a LightweightTreeSequence to a _tskit.TreeSequence except through tskit.TableCollection. I end up stuffing all the tables by hand, essentially just replicating tskit.TableCollection().fromdict anyways:

https://github.com/terhorst/xsmc/blob/631b6cf34dc1187d687d00c9f457afad38c3c971/xsmc/xsmc.py#L49-L57

It seems like I might be missing something -- I don't really see what's the gain of going through LightweightTreeSequence; couldn't you just use TreeSequence.dump_tables()? Is the idea that the former is more like a protocol that interoperates between versions?

Asking because it's fairly easy to bundle the _tskit module inside another Python package, but somewhat non-trivial to bundle tskit.

@codecov
Copy link

codecov bot commented Sep 25, 2020

Codecov Report

Merging #862 into main will increase coverage by 0.00%.
The diff coverage is 94.77%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #862   +/-   ##
=======================================
  Coverage   93.52%   93.52%           
=======================================
  Files          24       25    +1     
  Lines       19927    19932    +5     
  Branches      789      789           
=======================================
+ Hits        18636    18641    +5     
  Misses       1259     1259           
  Partials       32       32           
Impacted Files Coverage Δ
python/_tskit_lwt_interface.h 94.77% <94.77%> (ø)
python/_tskitmodule.c 90.60% <100.00%> (-0.72%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 786fd14...6a5eb93. Read the comment docs.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
@jeromekelleher shall we merge now and do the docs/example bulids in a follow up?

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Sep 28, 2020

That's a good question @terhorst. If you have your data in a LightweightTableCollection lwt, then the way to get a tskit TreeSequence is:

tables = tskit.TableCollection.fromdict(lwt.asdict())
ts = tables.tree_sequence()

However, if I understand your application you'll probably want to do this on the C side. So, you have a tskit.TreeSequence as input, and you want to turn that into a tsk_treeseq_t so you can do stuff with it in C. For that, on the Python side do

lwt = xmsc.LightweightTableCollection(ts.sequence_length)
lwt.fromdict(ts.tables.asdict())
# pass lwt to your class/function that's doing the C stuff.

Then, in C

tsk_treeseq_t ts;
// lwt is a LightweightTableCollection instance
ret = tsk_treeseq_init(&ts, lwt->tables, TSK_BUILD_INDEXES);
// now you can do stuff with &ts.

There's a bit of inefficiency here in that we need to recompute the indexes (I think the index data isn't included in the dictionary encoding at the moment), but it's pretty minimal unless the tree sequence is big.

Does this make sense/help?

@jeromekelleher
Copy link
Member Author

@jeromekelleher shall we merge now and do the docs/example bulids in a follow up?

I don't mind - are you OK with the names of things? I didn't really much thought into how that would affect things. But yeah, if your happy with it as is, then lets merge and lodge an issue for follow-ups.

One thing I'd like to sort out is how we can reuse the tests in downstream repos without editing the code. I think the only change we make at the moment is the import X as c_module. Any ideas on how we could somehow parameterise this test, so we could soft-link the test module into the downstream repo's test suite?

@benjeffery benjeffery changed the base branch from master to main September 28, 2020 12:11
@benjeffery
Copy link
Member

Any ideas on how we could somehow parameterise this test, so we could soft-link the test module into the downstream repo's test suite?

@jeromekelleher I've just pushed a commit here showing how this might be done, basically you move the tests into tskit so they can be used whereever. 8b2aca0

tskit.TableCollection.fromdict(d)
# Check we've done something
self.assertTrue(seen_msprime)
from tskit.dict_encoding_tests import * # noqa: F403,F401
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea, but this will mean that we use the version of the tests from the installed version of tskit, not the version which corresponds to the C-lib version. These could get out of date - suppose we have msprime using version 0.99.X and everything is working fine. Then we release a tskit version which is based on 0.99.X + 1, which has an added some extra fields to the data model that are irrelevant for msprime. Then, these tests will break. It would essentially force us to update the C API in all downstream packages everytime we update the C API in Python tskit, which I don't think we want to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably simplest to mess around with this downstream and see what can be made work. Maybe we merge the basic change first, and see how we go? It'll make things easier to test if we can point msprime at a hash in the main repo rather than pointing at a fork.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the tskit-using-application tests need to pull from the tests from the submodule of tskit, not the installed tskit. So it's either a weird looking import or a symlink I think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, lets back out the last commit i did, then merge.

@mergify mergify bot merged commit 160f14b into tskit-dev:main Sep 28, 2020
This was referenced Sep 28, 2020
@jeromekelleher jeromekelleher deleted the decouple-lwt branch September 29, 2020 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants