-
Notifications
You must be signed in to change notification settings - Fork 78
Metadata schema (WIP) #491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it. We should allow for arbitrary unicode in there (which means treating as raw bytes), but that's an easy fix.
The only issue I can see is the thorny one of round-tripping through text. For this purpose it would be more convenient to have an extra table "metadata_schemas", which would look like
name schema
individuals {"something": "something else}
nodes {"something": "something else}
etc.
It's not as easy to process from the C side of things because we'd have to parse the names and match them up with tables, etc, but that's not so bad. We'd also have to check for duplicates, which would definitely be tedious.
So, how important is round-tripping through text? And, can we imagine any other table-level annotations that we'd like to add in the future? (In which case, adding an extra table now would be the wrong thing to do). I don't know, I see benefits on both sides.
Another thing to consider: we also want to have top-level metadata, so we can have TreeSequence.metadata, which we should do now.
|
See load_text for entry point in text parsing. I'm not sure I care all that much about text to be honest, but we should consider it anyway. |
|
For research-quality data this seems good, but for messing around, it seems a bit tedious to have to specify a schema, if all you want to do is dump arbitrary metadata e.g. for testing purposes. Will there be a sensible default for the schema? For this sort of thing, I would also quite like the encoding to be taken care of, so I could just pass a (serializable) python dict as metadata, and the json encoding would deal with the rest: but then ideally I'd like to pull out the metadata and not have to explictly Also, just to second the idea of TreeSequence level metadata. In particular, given a tree sequence, I'd like to know whether time is measured in generations, years, thousands of years, "gene frequencies" (as in tsinfer), or is just some arbitrary order. |
|
@jeromekelleher Yeah, I've not done this right for unicode - will fix. For round tripping through text it is quite simple to have a table header be part of the dump? @hyanwong I mentioned in #57 that we would think about auto-generating the schema being an option. Something like a |
OK, sounds good. I'll look forward to a proposal! |
Ah yes, but this wouldn't include the top-level table collection metadata and schema. Then again, we don't include the top-level sequence_length attribute either, so I'm not sure we should worry about it. Text is very much "handy-to-have-for-tests-and-viz" for me, but people shouldn't be encouraged to use it for real things. |
|
I've been reading over tskit-dev/msprime#925 and popsim-consortium/stdpopsim#438 thinking about how metadata is managed in a sequence of operations on a tree sequence by different tools. Say I want to add a new attribute to existing metadata - what happens if I specify an encoding that is different to the existing schema? As there is only one schema I can't. Given that "Namespaces are one honking great idea -- let's do more of those!" is there an argument for namespacing the metadata? e.g. |
I had imagined that well behaving code that processes a tree sequence would read in the metadata, update the schema to add new fields and then write it back out. If all the encoding/decoding details are taken care of transparently by tskit, then the encoding shouldn't matter. How the new and old metadata is merged is definitely problematic though, and this does put a lot of trust in code that manipulates tree sequences doing the right thing. In practise though, shouldn't tagging each field with the name of the program that produced it in the schema be sufficient? I'm not sure that we'll be doing that much mixing of metadatas, and that things like stdpopsim, msprime and pyslim are outliers which we can make sure do sensible things. I like the idea of namespacing, but I would definitely resent all the extra |
|
ps. I hadn't read the Zen of Python in a long time. Some real wisdom in there! |
Codecov Report
@@ Coverage Diff @@
## master #491 +/- ##
==========================================
- Coverage 86.84% 86.73% -0.11%
==========================================
Files 21 22 +1
Lines 15996 16300 +304
Branches 3109 3163 +54
==========================================
+ Hits 13891 14138 +247
- Misses 1055 1076 +21
- Partials 1050 1086 +36
Continue to review full report at Codecov.
|
|
Just discovered https://github.com/ijl/orjson - wow! Looks like this might be a good fit. |
|
|
|
Yeah, that was the one rub. If it became a problem I maybe able to monkey-patch something in. |
|
Hmm, slightly worried about the orjson Rust dependency. Looks like there's good wheel support, but still. What would it actually buy us? Datetime and numpy array support looks nice, but not sure this is worth a complicated dependency... |
|
Yep, def needs to earn it's keep - one reason to bring it in might be performance, we'll be calling |
|
On a similar note some of this might come in handy for validation: https://www.peterbe.com/plog/jsonschema-validate-10x-faster-in-python |
|
Ok, metadata is now validated and encoded on def test_metadata_schema(self):
metadata_schema = {
"encoding": "json",
"schema": {
"title": "Example Metadata",
"type": "object",
"properties": {"one": {"type": "string"}, "two": {"type": "number"}},
"required": ["one", "two"],
"additionalProperties": False,
},
}
data = {"one": "val one", "two": 5}
individuals = tskit.tables.IndividualTable(metadata_schema=metadata_schema)
individuals.add_row(metadata=data)
self.assertDictEqual(individuals[0].metadata, data)If you try to add invalid metadata you get an error like this: Next steps:
|
|
Currently if |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments.
python/tskit/tables.py
Outdated
| import warnings | ||
|
|
||
| import numpy as np | ||
| from jsonschema import validate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer importing the module and calling functions on it usually, as then it's clear in the code what functions are defined locally and which ones from from outside. So, when I see jsonschema.validate, it's totally obvious, but if I read validate I'm not so sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I agree - editor put this one in and I haven't done a full-on diff check.
a0c2750
python/tskit/tables.py
Outdated
| if self.metadata_schema is None: | ||
| return metadata | ||
| else: | ||
| validate(metadata, self.metadata_schema["schema"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be nice to define catch jsonschema's exception here and raise our own. We don't want jsonschema to become part of our API --- maybe we'll want to ditch it for something better later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See exceptions.ProvenanceValidationError, eg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 38de515
|
Looks great, let's talk through the details tomorrow. |
this might be an argument to not use json for big tree sequences where performance is an issue... |
|
Bulk methods now take objects as metadata, e.g.: population_table.set_columns(metadata=[{'foo':1, 'bar':2}, {'foo':3, 'bar':4}])Next step is for |
|
@jeromekelleher and I had a chat about this PR, we decided too:
We'll then defer other encoders and perf work to subsequent PRs |
So, will it be possible to say "X or nothing at all" in the schema, then? Currently we have mixes of metadata and no metadata when e.g. we add more mutations to a SLiM simulation with msprime. Adding SLiM metadata to all the msprime mutations would not be a bad thing, but we'll have to add the capability to msprime.mutate( ) to be able to do that, if it will fail tree sequence validation. |
|
The schema can specify, per attribute, if an attribute is required or not. Also configurable is if attributes not in the schema are allowed. I'll be adding some docs and examples to this as it's a very flexible system. |
|
Hey @grahamgower, would you mind taking a look over this proposal please? It'd be good to get your thoughts (especially since you're in the trenches with this stuff on stdpopsim!) |
|
@grahamgower this gist explains the situation and proposal: https://gist.github.com/benjeffery/e7c68bab9839259dbab9d888d2458fa3 |
#57
I've got far enough with this that I'd appreciate some feedback - on how I've modified the C code particularly. I've concentrated on the "consumer" end of things. I've only worked on the
IndividualTableas a proof-of-concept. The added test illustrates the functionality:This can be extended to other encodings such as
structandmsgpack. The added row is not yet validated against the schema, but that's the plan. Could also supportindividuals.add_row(metadata=data)by encoding too.