Skip to content

Conversation

@benjeffery
Copy link
Member

@benjeffery benjeffery commented Apr 18, 2020

Fixes #57
Continuation of #491, this is the high-level python that validates. encodes and decodes metadata based on schemas. I've also experimented with adding type annotations, I'm liking them so far.

This gist explains the functionality: https://gist.github.com/benjeffery/e7c68bab9839259dbab9d888d2458fa3

Left to do:

  • Complete tests
  • Method docs
  • Changelog

@benjeffery benjeffery changed the title Metadata python Metadata - high-level python Apr 18, 2020
@benjeffery benjeffery mentioned this pull request Apr 18, 2020
@codecov
Copy link

codecov bot commented Apr 20, 2020

Codecov Report

Merging #543 into master will increase coverage by 0.04%.
The diff coverage is 97.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #543      +/-   ##
==========================================
+ Coverage   87.31%   87.35%   +0.04%     
==========================================
  Files          21       22       +1     
  Lines       16510    16682     +172     
  Branches     3243     3260      +17     
==========================================
+ Hits        14415    14573     +158     
  Misses       1030     1030              
- Partials     1065     1079      +14     
Flag Coverage Δ
#c_tests 88.59% <97.44%> (+0.10%) ⬆️
#python_c_tests 90.42% <97.44%> (+0.03%) ⬆️
#python_tests 99.04% <97.44%> (-0.17%) ⬇️
Impacted Files Coverage Δ
python/tskit/trees.py 98.18% <92.64%> (-0.49%) ⬇️
python/tskit/__init__.py 100.00% <100.00%> (ø)
python/tskit/exceptions.py 100.00% <100.00%> (ø)
python/tskit/metadata.py 100.00% <100.00%> (ø)
python/tskit/tables.py 99.65% <100.00%> (+0.01%) ⬆️
python/_tskitmodule.c 83.81% <0.00%> (-0.17%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f2d7810...50f9df6. Read the comment docs.

@benjeffery benjeffery marked this pull request as ready for review April 21, 2020 13:47
@benjeffery benjeffery force-pushed the metadata_python branch 2 times, most recently from b762ffc to d5818ec Compare April 21, 2020 14:34
@benjeffery
Copy link
Member Author

@jeromekelleher just realised I forgot to ping you on this one as it is ready.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @benjeffery. I haven't gone through this line-by-line, but have some high-level questions. I think we should break up the concept of schemas and codecs, as it's confusing to me that we have a class call JSONMetadataSchema --- is the schema JSON, or the encoded data JSON?

The documentation is a bit muddled I think. What we need is a top-level section somewhere in the Python API page that discusses metadata in detail. How it works, from top to bottom. Registering schemas etc is part of the data model, but decoding etc isn't. So, we need a section in the data model that discusses the required format of the metadata schemas (which I guess is akin to the Provenance schemas, so should be close to that and follow a similar pattern). Writing these high-level definitional sections first will help, because we can then refer back to them in the other sections where we're currently hitting metadata.

I guess a tutorial section on how to build in metadata schemas would be helpful too.

only raw :class:`bytes` values are accepted: no character encoding or
decoding is done on the data. Consider the following example::
Consider the following example where a table has no
``metadata_schema`` such that arbitrary bytes can be stored. (This is not recommended
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "such that arbitrary bytes can be stored and no automatic encoding or decoding of objects is performed
(see :ref:sec_metadata_XXX for details)"

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is really nice, thanks. Might be worth looking at the architecture of numcodecs for a little more inspiration on how the codec architecture could be structured.

Comment on lines 106 to 108
def __init__(self, codec: str, schema: dict) -> None:
self._schema = schema
self._codec = codec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the codec be read in as part of the schema? How else are clients to know how to decode the data when they read it in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused what you mean here. If you have a metadata_schema string from the C API you should be calling metadata.parse_metadata_schema. This is for making a MetadataSchema when you are setting it in client table creation code.

Copy link
Member

@jeromekelleher jeromekelleher Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about the structure of the schema itself, so we have

 {
            "type": "object",
            "properties": {
                "codec": "json", # A required property in our schema?
                "accession_number": {"type": "number"}, 
                "collection_date": {
                    "name": "Collection date",
                    "description": "Date of sample collection in ISO format",
                    "type": "string", 
                    "pattern": "^([1-9][0-9]{3})-(1[0-2]|0[1-9])-(3[01]|0[1-9]|[12][0-9])?$"
                },
            },
            "required": ["accession_number"],
            "additionalProperties": False,
        }

So then when client code reads the schema (in whatever language) it knows how to interpret the bytes that are in the metadata fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh, I see where we are missing each other. A codec applies on a per-table basis, not a per-property basis.

Copy link
Member Author

@benjeffery benjeffery Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-property basis is for typing when using the struct codec.

@benjeffery
Copy link
Member Author

@jeromekelleher I've had another pass at metadata.py. As discussed there is only one schema now, which is used across all codecs. Also cleaned up how the "" noop schema and codec are handled. Not checked coverage or changed docs yet, wanted feedback on this change first.

@jeromekelleher
Copy link
Member

The Codec class structure looks great, nice and simple and very clear.

Looks like the codec attribute is still external to the actual schema - is this something you're planning on changing, or is it better this way?

@benjeffery
Copy link
Member Author

Sorry - should have said, yes just working out how to extend the schema from jsonschema to include codec. I think it is simpler to have it be part of the schema.

@jeromekelleher
Copy link
Member

Sorry - should have said, yes just working out how to extend the schema from jsonschema to include codec. I think it is simpler to have it be part of the schema.

Yes, the more I think about this the more important it is. People would end up doing crazy things if we had a top level dictionary that isn't validated as part of the metaschema (or, conversely, we'd have to validate ourselves which seems perverse).

@benjeffery
Copy link
Member Author

benjeffery commented Apr 30, 2020

@jeromekelleher codec is now part of the schema. Annoyingly if I add codec as a required property in the metaschema it becomes required at all level of the schema as the meta schema is recursively defined.
Just giving the docs a second pass needed now I think.

@benjeffery benjeffery force-pushed the metadata_python branch 3 times, most recently from 4bbe071 to 7eeb169 Compare May 5, 2020 14:39
@benjeffery benjeffery requested a review from jeromekelleher May 5, 2020 14:44
@benjeffery
Copy link
Member Author

I've done a big docs refactor. I've also managed to get codec properly into the metaschema as a required top-level property.

@jeromekelleher
Copy link
Member

OK, will checkout and review tomorrow latest.

@benjeffery
Copy link
Member Author

@jeromekelleher no rush, plenty else to get on with that is orthogonal

@benjeffery benjeffery force-pushed the metadata_python branch 2 times, most recently from f7d35d7 to f4ba2ca Compare May 6, 2020 00:31
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @benjeffery! Just a few small changes needed to merge here. We do need to "flatten" the external API by importing the metadata operations in the tskit namespace, and propagating this through the docs.

Let's see how the struct codec looks, and then take a final pass through the API and docs.

@jeromekelleher
Copy link
Member

Great! Let's get this merged!

@benjeffery
Copy link
Member Author

@jeromekelleher squashed, rebased and CI green!

@jeromekelleher jeromekelleher merged commit 88f043c into tskit-dev:master May 7, 2020
@jeromekelleher
Copy link
Member

Merged, hooray!!!! This is a huge power-up, thanks @benjeffery!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metadata schemas and automatic decoding

2 participants