Metadata - high-level python #543

benjeffery · 2020-04-18T00:34:08Z

Fixes #57
Continuation of #491, this is the high-level python that validates. encodes and decodes metadata based on schemas. I've also experimented with adding type annotations, I'm liking them so far.

This gist explains the functionality: https://gist.github.com/benjeffery/e7c68bab9839259dbab9d888d2458fa3

Left to do:

Complete tests
Method docs
Changelog

codecov · 2020-04-20T12:18:22Z

Codecov Report

Merging #543 into master will increase coverage by 0.04%.
The diff coverage is 97.44%.

@@            Coverage Diff             @@
##           master     #543      +/-   ##
==========================================
+ Coverage   87.31%   87.35%   +0.04%     
==========================================
  Files          21       22       +1     
  Lines       16510    16682     +172     
  Branches     3243     3260      +17     
==========================================
+ Hits        14415    14573     +158     
  Misses       1030     1030              
- Partials     1065     1079      +14

Flag	Coverage Δ
#c_tests	`88.59% <97.44%> (+0.10%)`	⬆️
#python_c_tests	`90.42% <97.44%> (+0.03%)`	⬆️
#python_tests	`99.04% <97.44%> (-0.17%)`	⬇️

Impacted Files	Coverage Δ
python/tskit/trees.py	`98.18% <92.64%> (-0.49%)`	⬇️
python/tskit/__init__.py	`100.00% <100.00%> (ø)`
python/tskit/exceptions.py	`100.00% <100.00%> (ø)`
python/tskit/metadata.py	`100.00% <100.00%> (ø)`
python/tskit/tables.py	`99.65% <100.00%> (+0.01%)`	⬆️
python/_tskitmodule.c	`83.81% <0.00%> (-0.17%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f2d7810...50f9df6. Read the comment docs.

benjeffery · 2020-04-21T18:08:15Z

@jeromekelleher just realised I forgot to ping you on this one as it is ready.

jeromekelleher

Looks good, thanks @benjeffery. I haven't gone through this line-by-line, but have some high-level questions. I think we should break up the concept of schemas and codecs, as it's confusing to me that we have a class call JSONMetadataSchema --- is the schema JSON, or the encoded data JSON?

The documentation is a bit muddled I think. What we need is a top-level section somewhere in the Python API page that discusses metadata in detail. How it works, from top to bottom. Registering schemas etc is part of the data model, but decoding etc isn't. So, we need a section in the data model that discusses the required format of the metadata schemas (which I guess is akin to the Provenance schemas, so should be close to that and follow a similar pattern). Writing these high-level definitional sections first will help, because we can then refer back to them in the other sections where we're currently hitting metadata.

I guess a tutorial section on how to build in metadata schemas would be helpful too.

docs/data-model.rst

jeromekelleher · 2020-04-22T09:33:45Z

docs/python-api.rst

-only raw :class:`bytes` values are accepted: no character encoding or
-decoding is done on the data. Consider the following example::
+Consider the following example where a table has no
+``metadata_schema`` such that arbitrary bytes can be stored. (This is not recommended


Maybe "such that arbitrary bytes can be stored and no automatic encoding or decoding of objects is performed
(see :ref:sec_metadata_XXX for details)"

docs/python-api.rst

python/tskit/metadata.py

jeromekelleher

Yes, this is really nice, thanks. Might be worth looking at the architecture of numcodecs for a little more inspiration on how the codec architecture could be structured.

jeromekelleher · 2020-04-23T10:05:52Z

python/tskit/metadata.py

+    def __init__(self, codec: str, schema: dict) -> None:
+        self._schema = schema
+        self._codec = codec


Shouldn't the codec be read in as part of the schema? How else are clients to know how to decode the data when they read it in?

Confused what you mean here. If you have a metadata_schema string from the C API you should be calling metadata.parse_metadata_schema. This is for making a MetadataSchema when you are setting it in client table creation code.

I'm thinking about the structure of the schema itself, so we have

{ "type": "object", "properties": { "codec": "json", # A required property in our schema? "accession_number": {"type": "number"}, "collection_date": { "name": "Collection date", "description": "Date of sample collection in ISO format", "type": "string", "pattern": "^([1-9][0-9]{3})-(1[0-2]|0[1-9])-(3[01]|0[1-9]|[12][0-9])?$" }, }, "required": ["accession_number"], "additionalProperties": False, }

So then when client code reads the schema (in whatever language) it knows how to interpret the bytes that are in the metadata fields.

Ahhhh, I see where we are missing each other. A codec applies on a per-table basis, not a per-property basis.

The per-property basis is for typing when using the struct codec.

benjeffery · 2020-04-29T09:11:51Z

@jeromekelleher I've had another pass at metadata.py. As discussed there is only one schema now, which is used across all codecs. Also cleaned up how the "" noop schema and codec are handled. Not checked coverage or changed docs yet, wanted feedback on this change first.

jeromekelleher · 2020-04-29T09:42:56Z

The Codec class structure looks great, nice and simple and very clear.

Looks like the codec attribute is still external to the actual schema - is this something you're planning on changing, or is it better this way?

benjeffery · 2020-04-29T11:24:43Z

Sorry - should have said, yes just working out how to extend the schema from jsonschema to include codec. I think it is simpler to have it be part of the schema.

jeromekelleher · 2020-04-29T13:15:44Z

Sorry - should have said, yes just working out how to extend the schema from jsonschema to include codec. I think it is simpler to have it be part of the schema.

Yes, the more I think about this the more important it is. People would end up doing crazy things if we had a top level dictionary that isn't validated as part of the metaschema (or, conversely, we'd have to validate ourselves which seems perverse).

benjeffery · 2020-04-30T16:32:42Z

@jeromekelleher codec is now part of the schema. Annoyingly if I add codec as a required property in the metaschema it becomes required at all level of the schema as the meta schema is recursively defined.
Just giving the docs a second pass needed now I think.

benjeffery · 2020-05-05T14:45:26Z

I've done a big docs refactor. I've also managed to get codec properly into the metaschema as a required top-level property.

jeromekelleher · 2020-05-05T17:03:42Z

OK, will checkout and review tomorrow latest.

benjeffery · 2020-05-05T17:43:40Z

@jeromekelleher no rush, plenty else to get on with that is orthogonal

jeromekelleher

Looks good, thanks @benjeffery! Just a few small changes needed to merge here. We do need to "flatten" the external API by importing the metadata operations in the tskit namespace, and propagating this through the docs.

Let's see how the struct codec looks, and then take a final pass through the API and docs.

docs/metadata.rst

docs/tutorial.rst

python/tskit/metadata.py

appveyor.yml

jeromekelleher · 2020-05-07T07:46:54Z

Great! Let's get this merged!

benjeffery · 2020-05-07T09:55:21Z

@jeromekelleher squashed, rebased and CI green!

jeromekelleher · 2020-05-07T11:23:31Z

Merged, hooray!!!! This is a huge power-up, thanks @benjeffery!

benjeffery changed the title ~~Metadata python~~ Metadata - high-level python Apr 18, 2020

benjeffery mentioned this pull request Apr 18, 2020

Metadata schema (WIP) #491

Closed

benjeffery force-pushed the metadata_python branch from 4c38215 to 9a326cb Compare April 20, 2020 11:30

benjeffery marked this pull request as ready for review April 21, 2020 13:47

benjeffery force-pushed the metadata_python branch 2 times, most recently from b762ffc to d5818ec Compare April 21, 2020 14:34

jeromekelleher reviewed Apr 22, 2020

View reviewed changes

jeromekelleher reviewed Apr 23, 2020

View reviewed changes

benjeffery force-pushed the metadata_python branch from 711bf1f to 71c67a7 Compare April 29, 2020 09:12

benjeffery force-pushed the metadata_python branch 3 times, most recently from 4bbe071 to 7eeb169 Compare May 5, 2020 14:39

benjeffery requested a review from jeromekelleher May 5, 2020 14:44

benjeffery force-pushed the metadata_python branch 2 times, most recently from f7d35d7 to f4ba2ca Compare May 6, 2020 00:31

jeromekelleher reviewed May 6, 2020

View reviewed changes

benjeffery force-pushed the metadata_python branch from 512289b to 9ec44e6 Compare May 7, 2020 00:14

Schema driven metadata validation, encode and decode

50f9df6

benjeffery force-pushed the metadata_python branch from 26f194d to 50f9df6 Compare May 7, 2020 09:21

jeromekelleher merged commit 88f043c into tskit-dev:master May 7, 2020

benjeffery deleted the metadata_python branch May 7, 2020 11:26

hyanwong mentioned this pull request Jun 1, 2020

Output tree sequence should include metadata schemas tskit-dev/tsinfer#272

Closed

Metadata - high-level python #543

Metadata - high-level python #543

Uh oh!

Conversation

benjeffery commented Apr 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

benjeffery commented Apr 21, 2020

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeromekelleher Apr 22, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Apr 23, 2020

Choose a reason for hiding this comment

Uh oh!

benjeffery Apr 23, 2020

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benjeffery Apr 23, 2020

Choose a reason for hiding this comment

Uh oh!

benjeffery Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benjeffery commented Apr 29, 2020

Uh oh!

jeromekelleher commented Apr 29, 2020

Uh oh!

benjeffery commented Apr 29, 2020

Uh oh!

jeromekelleher commented Apr 29, 2020

Uh oh!

benjeffery commented Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjeffery commented May 5, 2020

Uh oh!

jeromekelleher commented May 5, 2020

Uh oh!

benjeffery commented May 5, 2020

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeromekelleher commented May 7, 2020

Uh oh!

benjeffery commented May 7, 2020

Uh oh!

jeromekelleher commented May 7, 2020

Uh oh!

Reviewers

Assignees

benjeffery commented Apr 18, 2020 •

edited

Loading

codecov bot commented Apr 20, 2020 •

edited

Loading

jeromekelleher Apr 23, 2020 •

edited

Loading

benjeffery Apr 23, 2020 •

edited

Loading

benjeffery commented Apr 30, 2020 •

edited

Loading