Metadata schema (WIP) #491

benjeffery · 2020-03-19T12:08:35Z

I've got far enough with this that I'd appreciate some feedback - on how I've modified the C code particularly. I've concentrated on the "consumer" end of things. I've only worked on the IndividualTable as a proof-of-concept. The added test illustrates the functionality:

    def test_metadata_schema(self):
        metadata_schema = {
            "encoding": "json",
            "schema": {
                "title": "Example Metadata",
                "type": "object",
                "properties": {"one": {"type": "string"}, "two": {"type": "number"}},
                "required": ["one", "two"],
            },
        }
        data = {"one": "val one", "two": 5}
        individuals = tskit.tables.IndividualTable(metadata_schema=metadata_schema)
        individuals.add_row(metadata=json.dumps(data).encode())
        self.assertDictEqual(individuals[0].metadata, data)

This can be extended to other encodings such as struct and msgpack. The added row is not yet validated against the schema, but that's the plan. Could also support individuals.add_row(metadata=data) by encoding too.

jeromekelleher

I like it. We should allow for arbitrary unicode in there (which means treating as raw bytes), but that's an easy fix.

The only issue I can see is the thorny one of round-tripping through text. For this purpose it would be more convenient to have an extra table "metadata_schemas", which would look like

name           schema
individuals   {"something": "something else}
nodes          {"something": "something else}

etc.

It's not as easy to process from the C side of things because we'd have to parse the names and match them up with tables, etc, but that's not so bad. We'd also have to check for duplicates, which would definitely be tedious.

So, how important is round-tripping through text? And, can we imagine any other table-level annotations that we'd like to add in the future? (In which case, adding an extra table now would be the wrong thing to do). I don't know, I see benefits on both sides.

Another thing to consider: we also want to have top-level metadata, so we can have TreeSequence.metadata, which we should do now.

c/tskit/tables.c

python/_tskitmodule.c

jeromekelleher · 2020-03-19T13:52:24Z

See load_text for entry point in text parsing.

I'm not sure I care all that much about text to be honest, but we should consider it anyway.

hyanwong · 2020-03-19T14:00:22Z

For research-quality data this seems good, but for messing around, it seems a bit tedious to have to specify a schema, if all you want to do is dump arbitrary metadata e.g. for testing purposes. Will there be a sensible default for the schema? For this sort of thing, I would also quite like the encoding to be taken care of, so I could just pass a (serializable) python dict as metadata, and the json encoding would deal with the rest:

individuals.add_row(metadata={"random":"anything"})

but then ideally I'd like to pull out the metadata and not have to explictly json.load() it, and I can't see how you could distinguish between metadata that has been passed in as a python object and metadata that has been explicitly json-ified by the user beforehand.

Also, just to second the idea of TreeSequence level metadata. In particular, given a tree sequence, I'd like to know whether time is measured in generations, years, thousands of years, "gene frequencies" (as in tsinfer), or is just some arbitrary order.

benjeffery · 2020-03-19T14:18:31Z

@jeromekelleher Yeah, I've not done this right for unicode - will fix. For round tripping through text it is quite simple to have a table header be part of the dump?

@hyanwong I mentioned in #57 that we would think about auto-generating the schema being an option. Something like a metadata_schema = generate_schema_from_example('struct', data). I also think we should consider add_row doing the validating and then encoding when a schema is present. I want to get the basic schema storage rock-solid first, then we can iterate on the high-level API.

jeromekelleher · 2020-03-19T14:48:39Z

@jeromekelleher For round tripping through text it is quite simple to have a table header be part of the dump?

OK, sounds good. I'll look forward to a proposal!

jeromekelleher · 2020-03-19T14:51:51Z

@jeromekelleher For round tripping through text it is quite simple to have a table header be part of the dump?

Ah yes, but this wouldn't include the top-level table collection metadata and schema. Then again, we don't include the top-level sequence_length attribute either, so I'm not sure we should worry about it. Text is very much "handy-to-have-for-tests-and-viz" for me, but people shouldn't be encouraged to use it for real things.

benjeffery · 2020-04-01T19:14:55Z

I've been reading over tskit-dev/msprime#925 and popsim-consortium/stdpopsim#438 thinking about how metadata is managed in a sequence of operations on a tree sequence by different tools.

Say I want to add a new attribute to existing metadata - what happens if I specify an encoding that is different to the existing schema? As there is only one schema I can't. Given that "Namespaces are one honking great idea -- let's do more of those!" is there an argument for namespacing the metadata? e.g. individuals[0].metadata.msprime.attribute where each namespace can have its own encoding and schema?

jeromekelleher · 2020-04-01T19:57:36Z

Say I want to add a new attribute to existing metadata - what happens if I specify an encoding that is different to the existing schema? As there is only one schema I can't. Given that "Namespaces are one honking great idea -- let's do more of those!" is there an argument for namespacing the metadata? e.g. individuals[0].metadata.msprime.attribute where each namespace can have its own encoding and schema?

I had imagined that well behaving code that processes a tree sequence would read in the metadata, update the schema to add new fields and then write it back out. If all the encoding/decoding details are taken care of transparently by tskit, then the encoding shouldn't matter. How the new and old metadata is merged is definitely problematic though, and this does put a lot of trust in code that manipulates tree sequences doing the right thing.

In practise though, shouldn't tagging each field with the name of the program that produced it in the schema be sufficient? I'm not sure that we'll be doing that much mixing of metadatas, and that things like stdpopsim, msprime and pyslim are outliers which we can make sure do sensible things.

I like the idea of namespacing, but I would definitely resent all the extra tsinfer's I'd need to type when doing population.metadata.tsinfer.name, or individual.metadata.tsinfer.id. That to me is very much the nominal case, which we should make as fluent as possible.

jeromekelleher · 2020-04-01T19:58:12Z

ps. I hadn't read the Zen of Python in a long time. Some real wisdom in there!

codecov · 2020-04-02T09:48:06Z

Codecov Report

Merging #491 into master will decrease coverage by 0.10%.
The diff coverage is 83.28%.

@@            Coverage Diff             @@
##           master     #491      +/-   ##
==========================================
- Coverage   86.84%   86.73%   -0.11%     
==========================================
  Files          21       22       +1     
  Lines       15996    16300     +304     
  Branches     3109     3163      +54     
==========================================
+ Hits        13891    14138     +247     
- Misses       1055     1076      +21     
- Partials     1050     1086      +36

Flag	Coverage Δ
#c_tests	`87.78% <92.72%> (+0.02%)`	⬆️
#python_c_tests	`90.17% <83.28%> (-0.38%)`	⬇️
#python_tests	`98.78% <92.72%> (-0.40%)`	⬇️

Impacted Files	Coverage Δ
python/_tskitmodule.c	`83.81% <73.78%> (-0.41%)`	⬇️
python/tskit/metadata.py	`89.23% <89.23%> (ø)`
python/tskit/trees.py	`98.09% <92.06%> (-0.51%)`	⬇️
python/tskit/exceptions.py	`100.00% <100.00%> (ø)`
python/tskit/tables.py	`99.81% <100.00%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be8c0f3...290edf4. Read the comment docs.

benjeffery · 2020-04-02T12:56:24Z

Just discovered https://github.com/ijl/orjson - wow! Looks like this might be a good fit.

hyanwong · 2020-04-02T13:01:06Z

has strict JSON conformance in not supporting Nan/Infinity/-Infinity - but I guess these might be OK if serialised in numpy? And it may not matter in metadata anyway.

benjeffery · 2020-04-02T13:02:44Z

Yeah, that was the one rub. If it became a problem I maybe able to monkey-patch something in.

jeromekelleher · 2020-04-02T13:16:03Z

Hmm, slightly worried about the orjson Rust dependency. Looks like there's good wheel support, but still. What would it actually buy us? Datetime and numpy array support looks nice, but not sure this is worth a complicated dependency...

benjeffery · 2020-04-02T13:24:02Z

Yep, def needs to earn it's keep - one reason to bring it in might be performance, we'll be calling dumps/loads on every row. But lets bench it and see.

benjeffery · 2020-04-02T13:26:59Z

On a similar note some of this might come in handy for validation: https://www.peterbe.com/plog/jsonschema-validate-10x-faster-in-python

benjeffery · 2020-04-02T14:01:23Z

Ok, metadata is now validated and encoded on add_row and decoded (not validated) on __get_item__ such that this now works for all tables:

    def test_metadata_schema(self):
        metadata_schema = {
            "encoding": "json",
            "schema": {
                "title": "Example Metadata",
                "type": "object",
                "properties": {"one": {"type": "string"}, "two": {"type": "number"}},
                "required": ["one", "two"],
                "additionalProperties": False,
            },
        }
        data = {"one": "val one", "two": 5}
        individuals = tskit.tables.IndividualTable(metadata_schema=metadata_schema)
        individuals.add_row(metadata=data)
        self.assertDictEqual(individuals[0].metadata, data)

If you try to add invalid metadata you get an error like this:

jsonschema.exceptions.ValidationError: Additional properties are not allowed ("I really shouldn't be here" was unexpected
)

Failed validating 'additionalProperties' in schema:
    {'additionalProperties': False,
     'properties': {'one': {'type': 'string'}, 'two': {'type': 'number'}},
     'required': ['one', 'two'],
     'title': 'Example Metadata',
     'type': 'object'}

On instance:
    {"I really shouldn't be here": 6, 'one': 'val one', 'two': 5}

Next steps:

Consider validation and encoding for bulk methods. Either by providing methods to encode and validate an array of metadata or by doing so inside append_columns and set_columns if those methods are passed non-bytes metadata columns.
Currently written for simplicity, prob awful perf.

benjeffery · 2020-04-02T14:25:27Z

Currently if metadata_schema is None then nothing is validated/decoded/encoded. This makes the code backwards compatible, but I wonder if it makes it too easy to do the wrong thing. Maybe an explicit metadata_schema = { "encoding": "binary"} would be better.

jeromekelleher

Minor comments.

jeromekelleher · 2020-04-02T15:04:18Z

python/tskit/tables.py

 import warnings

 import numpy as np
+from jsonschema import validate


I prefer importing the module and calling functions on it usually, as then it's clear in the code what functions are defined locally and which ones from from outside. So, when I see jsonschema.validate, it's totally obvious, but if I read validate I'm not so sure.

Yep, I agree - editor put this one in and I haven't done a full-on diff check.
a0c2750

jeromekelleher · 2020-04-02T15:05:43Z

python/tskit/tables.py

+        if self.metadata_schema is None:
+            return metadata
+        else:
+            validate(metadata, self.metadata_schema["schema"])


Be nice to define catch jsonschema's exception here and raise our own. We don't want jsonschema to become part of our API --- maybe we'll want to ditch it for something better later.

See exceptions.ProvenanceValidationError, eg.

Done in 38de515

jeromekelleher · 2020-04-02T15:09:34Z

Looks great, let's talk through the details tomorrow.

petrelharp · 2020-04-02T16:34:19Z

one reason to bring it in might be performance, we'll be calling dumps/loads on every row

this might be an argument to not use json for big tree sequences where performance is an issue...

benjeffery · 2020-04-02T16:39:32Z

Bulk methods now take objects as metadata, e.g.:

population_table.set_columns(metadata=[{'foo':1, 'bar':2}, {'foo':3, 'bar':4}])

Next step is for table.metadata to return decoded metadata.

benjeffery · 2020-04-03T14:04:24Z

@jeromekelleher and I had a chat about this PR, we decided too:

Validation should fail if metadata is not passed to add_row and the table has a schema where {} would fail.
Remove metadata validation/decoding/encoding from bulk methods on the table API
Create a tskit.metadata.MetadataSchema class to manage validation, decoding/encoding.
Add decode to ts.node et al.

We'll then defer other encoders and perf work to subsequent PRs

petrelharp · 2020-04-03T14:54:11Z

Validation should fail if metadata is not passed to add_row and the table has a schema where {} would fail.

So, will it be possible to say "X or nothing at all" in the schema, then? Currently we have mixes of metadata and no metadata when e.g. we add more mutations to a SLiM simulation with msprime. Adding SLiM metadata to all the msprime mutations would not be a bad thing, but we'll have to add the capability to msprime.mutate( ) to be able to do that, if it will fail tree sequence validation.

benjeffery · 2020-04-03T17:26:48Z

The schema can specify, per attribute, if an attribute is required or not. Also configurable is if attributes not in the schema are allowed. I'll be adding some docs and examples to this as it's a very flexible system.

jeromekelleher · 2020-04-08T20:59:45Z

Hey @grahamgower, would you mind taking a look over this proposal please? It'd be good to get your thoughts (especially since you're in the trenches with this stuff on stdpopsim!)

benjeffery · 2020-04-08T21:16:43Z

@grahamgower this gist explains the situation and proposal: https://gist.github.com/benjeffery/e7c68bab9839259dbab9d888d2458fa3

benjeffery · 2020-04-18T00:36:46Z

This PR is superseded by #542 and #543

jeromekelleher reviewed Mar 19, 2020

View reviewed changes

c/tskit/tables.c Outdated Show resolved Hide resolved

c/tskit/tables.c Outdated Show resolved Hide resolved

python/_tskitmodule.c Outdated Show resolved Hide resolved

benjeffery mentioned this pull request Mar 23, 2020

[C API] Metadata schema field on tables #493

Merged

8 tasks

jeromekelleher mentioned this pull request Mar 24, 2020

Use population configurations from oldest epoch for recapition. popsim-consortium/stdpopsim#438

Merged

benjeffery force-pushed the metadata branch from eb09594 to 8d479e6 Compare April 2, 2020 09:48

jeromekelleher reviewed Apr 2, 2020

View reviewed changes

benjeffery added 19 commits April 16, 2020 02:28

Add metadata_schema to the python C module

aba4d12

Add metadata_schema to python

b8669fb

fixup! add to other tables

1cfd97d

fixup! remove unused code

3091248

fixup! more tests

5636e7d

fixup! validate and encode on add row

13f3b1a

fixup! bulk methods metadata

9b2e572

fixup! dont 'import from'

2329b7a

fixup! return decoded metadata

1940f53

fixup! raise a tskit exception

67e5104

fixup! remove decorator smart-assery

77390d2

fixup! More testing

744f6a4

fixup! move metadata handling into a module and class

5d61ed1

fixup! more tests

c0ee6dc

fixup! test for absent metadata

359acf3

fixup! spike of ts.metadata_schemas

ae404ef

fixup! TS.metadata_schemas complete

318171f

fixup! TreeSequene methods return classes with decoded metadata

f44625b

Fix metadata_schema attribute delete

e53aada

benjeffery force-pushed the metadata branch from 4c0987d to e53aada Compare April 16, 2020 01:29

benjeffery added 6 commits April 17, 2020 12:29

Change metadata_schema to string from bytes in low-level API

338a045

Add low-level metadata schema tests

e3f97bd

fixup! use normal ret checking

d5881fa

fixup! copywrite

267ff13

fixup! windows compile error fix

d5e7eff

fixup! correct python c interface code

290edf4

This was referenced Apr 17, 2020

Add metadata schema methods to Python C module #542

Merged

Metadata - high-level python #543

Merged

benjeffery closed this Apr 18, 2020

Metadata schema (WIP) #491

Metadata schema (WIP) #491

Uh oh!

Conversation

benjeffery commented Mar 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeromekelleher commented Mar 19, 2020

Uh oh!

hyanwong commented Mar 19, 2020

Uh oh!

benjeffery commented Mar 19, 2020

Uh oh!

jeromekelleher commented Mar 19, 2020

Uh oh!

jeromekelleher commented Mar 19, 2020

Uh oh!

benjeffery commented Apr 1, 2020

Uh oh!

jeromekelleher commented Apr 1, 2020

Uh oh!

jeromekelleher commented Apr 1, 2020

Uh oh!

codecov bot commented Apr 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

benjeffery commented Apr 2, 2020

Uh oh!

hyanwong commented Apr 2, 2020

Uh oh!

benjeffery commented Apr 2, 2020

Uh oh!

jeromekelleher commented Apr 2, 2020

Uh oh!

benjeffery commented Apr 2, 2020

Uh oh!

benjeffery commented Apr 2, 2020

Uh oh!

benjeffery commented Apr 2, 2020

Uh oh!

benjeffery commented Apr 2, 2020

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Apr 2, 2020

Choose a reason for hiding this comment

Uh oh!

benjeffery Apr 2, 2020

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Apr 2, 2020

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Apr 2, 2020

Choose a reason for hiding this comment

Uh oh!

benjeffery Apr 3, 2020

Choose a reason for hiding this comment

Uh oh!

jeromekelleher commented Apr 2, 2020

Uh oh!

petrelharp commented Apr 2, 2020

Uh oh!

benjeffery commented Apr 2, 2020

Uh oh!

benjeffery commented Apr 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petrelharp commented Apr 3, 2020

Uh oh!

benjeffery commented Apr 3, 2020

Uh oh!

benjeffery commented Mar 19, 2020 •

edited

Loading

codecov bot commented Apr 2, 2020 •

edited

Loading

benjeffery commented Apr 3, 2020 •

edited

Loading