-
Notifications
You must be signed in to change notification settings - Fork 78
Metadata - high-level python #543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4c38215 to
9a326cb
Compare
Codecov Report
@@ Coverage Diff @@
## master #543 +/- ##
==========================================
+ Coverage 87.31% 87.35% +0.04%
==========================================
Files 21 22 +1
Lines 16510 16682 +172
Branches 3243 3260 +17
==========================================
+ Hits 14415 14573 +158
Misses 1030 1030
- Partials 1065 1079 +14
Continue to review full report at Codecov.
|
b762ffc to
d5818ec
Compare
|
@jeromekelleher just realised I forgot to ping you on this one as it is ready. |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @benjeffery. I haven't gone through this line-by-line, but have some high-level questions. I think we should break up the concept of schemas and codecs, as it's confusing to me that we have a class call JSONMetadataSchema --- is the schema JSON, or the encoded data JSON?
The documentation is a bit muddled I think. What we need is a top-level section somewhere in the Python API page that discusses metadata in detail. How it works, from top to bottom. Registering schemas etc is part of the data model, but decoding etc isn't. So, we need a section in the data model that discusses the required format of the metadata schemas (which I guess is akin to the Provenance schemas, so should be close to that and follow a similar pattern). Writing these high-level definitional sections first will help, because we can then refer back to them in the other sections where we're currently hitting metadata.
I guess a tutorial section on how to build in metadata schemas would be helpful too.
docs/python-api.rst
Outdated
| only raw :class:`bytes` values are accepted: no character encoding or | ||
| decoding is done on the data. Consider the following example:: | ||
| Consider the following example where a table has no | ||
| ``metadata_schema`` such that arbitrary bytes can be stored. (This is not recommended |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "such that arbitrary bytes can be stored and no automatic encoding or decoding of objects is performed
(see :ref:sec_metadata_XXX for details)"
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is really nice, thanks. Might be worth looking at the architecture of numcodecs for a little more inspiration on how the codec architecture could be structured.
python/tskit/metadata.py
Outdated
| def __init__(self, codec: str, schema: dict) -> None: | ||
| self._schema = schema | ||
| self._codec = codec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the codec be read in as part of the schema? How else are clients to know how to decode the data when they read it in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confused what you mean here. If you have a metadata_schema string from the C API you should be calling metadata.parse_metadata_schema. This is for making a MetadataSchema when you are setting it in client table creation code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about the structure of the schema itself, so we have
{
"type": "object",
"properties": {
"codec": "json", # A required property in our schema?
"accession_number": {"type": "number"},
"collection_date": {
"name": "Collection date",
"description": "Date of sample collection in ISO format",
"type": "string",
"pattern": "^([1-9][0-9]{3})-(1[0-2]|0[1-9])-(3[01]|0[1-9]|[12][0-9])?$"
},
},
"required": ["accession_number"],
"additionalProperties": False,
}
So then when client code reads the schema (in whatever language) it knows how to interpret the bytes that are in the metadata fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhhh, I see where we are missing each other. A codec applies on a per-table basis, not a per-property basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The per-property basis is for typing when using the struct codec.
|
@jeromekelleher I've had another pass at metadata.py. As discussed there is only one schema now, which is used across all codecs. Also cleaned up how the |
711bf1f to
71c67a7
Compare
|
The Codec class structure looks great, nice and simple and very clear. Looks like the |
|
Sorry - should have said, yes just working out how to extend the schema from |
Yes, the more I think about this the more important it is. People would end up doing crazy things if we had a top level dictionary that isn't validated as part of the metaschema (or, conversely, we'd have to validate ourselves which seems perverse). |
|
@jeromekelleher codec is now part of the schema. Annoyingly if I add codec as a required property in the metaschema it becomes required at all level of the schema as the meta schema is recursively defined. |
4bbe071 to
7eeb169
Compare
|
I've done a big docs refactor. I've also managed to get codec properly into the metaschema as a required top-level property. |
|
OK, will checkout and review tomorrow latest. |
|
@jeromekelleher no rush, plenty else to get on with that is orthogonal |
f7d35d7 to
f4ba2ca
Compare
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @benjeffery! Just a few small changes needed to merge here. We do need to "flatten" the external API by importing the metadata operations in the tskit namespace, and propagating this through the docs.
Let's see how the struct codec looks, and then take a final pass through the API and docs.
512289b to
9ec44e6
Compare
|
Great! Let's get this merged! |
26f194d to
50f9df6
Compare
|
@jeromekelleher squashed, rebased and CI green! |
|
Merged, hooray!!!! This is a huge power-up, thanks @benjeffery! |
Fixes #57
Continuation of #491, this is the high-level python that validates. encodes and decodes metadata based on schemas. I've also experimented with adding type annotations, I'm liking them so far.
This gist explains the functionality: https://gist.github.com/benjeffery/e7c68bab9839259dbab9d888d2458fa3
Left to do: