Skip to content

Conversation

@benjeffery
Copy link
Member

@benjeffery benjeffery commented Mar 19, 2020

#57

I've got far enough with this that I'd appreciate some feedback - on how I've modified the C code particularly. I've concentrated on the "consumer" end of things. I've only worked on the IndividualTable as a proof-of-concept. The added test illustrates the functionality:

    def test_metadata_schema(self):
        metadata_schema = {
            "encoding": "json",
            "schema": {
                "title": "Example Metadata",
                "type": "object",
                "properties": {"one": {"type": "string"}, "two": {"type": "number"}},
                "required": ["one", "two"],
            },
        }
        data = {"one": "val one", "two": 5}
        individuals = tskit.tables.IndividualTable(metadata_schema=metadata_schema)
        individuals.add_row(metadata=json.dumps(data).encode())
        self.assertDictEqual(individuals[0].metadata, data)

This can be extended to other encodings such as struct and msgpack. The added row is not yet validated against the schema, but that's the plan. Could also support individuals.add_row(metadata=data) by encoding too.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. We should allow for arbitrary unicode in there (which means treating as raw bytes), but that's an easy fix.

The only issue I can see is the thorny one of round-tripping through text. For this purpose it would be more convenient to have an extra table "metadata_schemas", which would look like

name           schema
individuals   {"something": "something else}
nodes          {"something": "something else}

etc.

It's not as easy to process from the C side of things because we'd have to parse the names and match them up with tables, etc, but that's not so bad. We'd also have to check for duplicates, which would definitely be tedious.

So, how important is round-tripping through text? And, can we imagine any other table-level annotations that we'd like to add in the future? (In which case, adding an extra table now would be the wrong thing to do). I don't know, I see benefits on both sides.

Another thing to consider: we also want to have top-level metadata, so we can have TreeSequence.metadata, which we should do now.

@jeromekelleher
Copy link
Member

See load_text for entry point in text parsing.

I'm not sure I care all that much about text to be honest, but we should consider it anyway.

@hyanwong
Copy link
Member

For research-quality data this seems good, but for messing around, it seems a bit tedious to have to specify a schema, if all you want to do is dump arbitrary metadata e.g. for testing purposes. Will there be a sensible default for the schema? For this sort of thing, I would also quite like the encoding to be taken care of, so I could just pass a (serializable) python dict as metadata, and the json encoding would deal with the rest:

individuals.add_row(metadata={"random":"anything"})

but then ideally I'd like to pull out the metadata and not have to explictly json.load() it, and I can't see how you could distinguish between metadata that has been passed in as a python object and metadata that has been explicitly json-ified by the user beforehand.

Also, just to second the idea of TreeSequence level metadata. In particular, given a tree sequence, I'd like to know whether time is measured in generations, years, thousands of years, "gene frequencies" (as in tsinfer), or is just some arbitrary order.

@benjeffery
Copy link
Member Author

@jeromekelleher Yeah, I've not done this right for unicode - will fix. For round tripping through text it is quite simple to have a table header be part of the dump?

@hyanwong I mentioned in #57 that we would think about auto-generating the schema being an option. Something like a metadata_schema = generate_schema_from_example('struct', data). I also think we should consider add_row doing the validating and then encoding when a schema is present. I want to get the basic schema storage rock-solid first, then we can iterate on the high-level API.

@jeromekelleher
Copy link
Member

@jeromekelleher For round tripping through text it is quite simple to have a table header be part of the dump?

OK, sounds good. I'll look forward to a proposal!

@jeromekelleher
Copy link
Member

@jeromekelleher For round tripping through text it is quite simple to have a table header be part of the dump?

Ah yes, but this wouldn't include the top-level table collection metadata and schema. Then again, we don't include the top-level sequence_length attribute either, so I'm not sure we should worry about it. Text is very much "handy-to-have-for-tests-and-viz" for me, but people shouldn't be encouraged to use it for real things.

@benjeffery
Copy link
Member Author

I've been reading over tskit-dev/msprime#925 and popsim-consortium/stdpopsim#438 thinking about how metadata is managed in a sequence of operations on a tree sequence by different tools.

Say I want to add a new attribute to existing metadata - what happens if I specify an encoding that is different to the existing schema? As there is only one schema I can't. Given that "Namespaces are one honking great idea -- let's do more of those!" is there an argument for namespacing the metadata? e.g. individuals[0].metadata.msprime.attribute where each namespace can have its own encoding and schema?

@jeromekelleher
Copy link
Member

Say I want to add a new attribute to existing metadata - what happens if I specify an encoding that is different to the existing schema? As there is only one schema I can't. Given that "Namespaces are one honking great idea -- let's do more of those!" is there an argument for namespacing the metadata? e.g. individuals[0].metadata.msprime.attribute where each namespace can have its own encoding and schema?

I had imagined that well behaving code that processes a tree sequence would read in the metadata, update the schema to add new fields and then write it back out. If all the encoding/decoding details are taken care of transparently by tskit, then the encoding shouldn't matter. How the new and old metadata is merged is definitely problematic though, and this does put a lot of trust in code that manipulates tree sequences doing the right thing.

In practise though, shouldn't tagging each field with the name of the program that produced it in the schema be sufficient? I'm not sure that we'll be doing that much mixing of metadatas, and that things like stdpopsim, msprime and pyslim are outliers which we can make sure do sensible things.

I like the idea of namespacing, but I would definitely resent all the extra tsinfer's I'd need to type when doing population.metadata.tsinfer.name, or individual.metadata.tsinfer.id. That to me is very much the nominal case, which we should make as fluent as possible.

@jeromekelleher
Copy link
Member

ps. I hadn't read the Zen of Python in a long time. Some real wisdom in there!

@codecov
Copy link

codecov bot commented Apr 2, 2020

Codecov Report

Merging #491 into master will decrease coverage by 0.10%.
The diff coverage is 83.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #491      +/-   ##
==========================================
- Coverage   86.84%   86.73%   -0.11%     
==========================================
  Files          21       22       +1     
  Lines       15996    16300     +304     
  Branches     3109     3163      +54     
==========================================
+ Hits        13891    14138     +247     
- Misses       1055     1076      +21     
- Partials     1050     1086      +36     
Flag Coverage Δ
#c_tests 87.78% <92.72%> (+0.02%) ⬆️
#python_c_tests 90.17% <83.28%> (-0.38%) ⬇️
#python_tests 98.78% <92.72%> (-0.40%) ⬇️
Impacted Files Coverage Δ
python/_tskitmodule.c 83.81% <73.78%> (-0.41%) ⬇️
python/tskit/metadata.py 89.23% <89.23%> (ø)
python/tskit/trees.py 98.09% <92.06%> (-0.51%) ⬇️
python/tskit/exceptions.py 100.00% <100.00%> (ø)
python/tskit/tables.py 99.81% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be8c0f3...290edf4. Read the comment docs.

@benjeffery
Copy link
Member Author

Just discovered https://github.com/ijl/orjson - wow! Looks like this might be a good fit.

@hyanwong
Copy link
Member

hyanwong commented Apr 2, 2020

has strict JSON conformance in not supporting Nan/Infinity/-Infinity - but I guess these might be OK if serialised in numpy? And it may not matter in metadata anyway.

@benjeffery
Copy link
Member Author

Yeah, that was the one rub. If it became a problem I maybe able to monkey-patch something in.

@jeromekelleher
Copy link
Member

Hmm, slightly worried about the orjson Rust dependency. Looks like there's good wheel support, but still. What would it actually buy us? Datetime and numpy array support looks nice, but not sure this is worth a complicated dependency...

@benjeffery
Copy link
Member Author

Yep, def needs to earn it's keep - one reason to bring it in might be performance, we'll be calling dumps/loads on every row. But lets bench it and see.

@benjeffery
Copy link
Member Author

On a similar note some of this might come in handy for validation: https://www.peterbe.com/plog/jsonschema-validate-10x-faster-in-python

@benjeffery
Copy link
Member Author

Ok, metadata is now validated and encoded on add_row and decoded (not validated) on __get_item__ such that this now works for all tables:

    def test_metadata_schema(self):
        metadata_schema = {
            "encoding": "json",
            "schema": {
                "title": "Example Metadata",
                "type": "object",
                "properties": {"one": {"type": "string"}, "two": {"type": "number"}},
                "required": ["one", "two"],
                "additionalProperties": False,
            },
        }
        data = {"one": "val one", "two": 5}
        individuals = tskit.tables.IndividualTable(metadata_schema=metadata_schema)
        individuals.add_row(metadata=data)
        self.assertDictEqual(individuals[0].metadata, data)

If you try to add invalid metadata you get an error like this:

jsonschema.exceptions.ValidationError: Additional properties are not allowed ("I really shouldn't be here" was unexpected
)

Failed validating 'additionalProperties' in schema:
    {'additionalProperties': False,
     'properties': {'one': {'type': 'string'}, 'two': {'type': 'number'}},
     'required': ['one', 'two'],
     'title': 'Example Metadata',
     'type': 'object'}

On instance:
    {"I really shouldn't be here": 6, 'one': 'val one', 'two': 5}

Next steps:

  • Consider validation and encoding for bulk methods. Either by providing methods to encode and validate an array of metadata or by doing so inside append_columns and set_columns if those methods are passed non-bytes metadata columns.

  • Currently written for simplicity, prob awful perf.

@benjeffery
Copy link
Member Author

Currently if metadata_schema is None then nothing is validated/decoded/encoded. This makes the code backwards compatible, but I wonder if it makes it too easy to do the wrong thing. Maybe an explicit metadata_schema = { "encoding": "binary"} would be better.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.

import warnings

import numpy as np
from jsonschema import validate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer importing the module and calling functions on it usually, as then it's clear in the code what functions are defined locally and which ones from from outside. So, when I see jsonschema.validate, it's totally obvious, but if I read validate I'm not so sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree - editor put this one in and I haven't done a full-on diff check.
a0c2750

if self.metadata_schema is None:
return metadata
else:
validate(metadata, self.metadata_schema["schema"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be nice to define catch jsonschema's exception here and raise our own. We don't want jsonschema to become part of our API --- maybe we'll want to ditch it for something better later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See exceptions.ProvenanceValidationError, eg.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 38de515

@jeromekelleher
Copy link
Member

Looks great, let's talk through the details tomorrow.

@petrelharp
Copy link
Contributor

one reason to bring it in might be performance, we'll be calling dumps/loads on every row

this might be an argument to not use json for big tree sequences where performance is an issue...

@benjeffery
Copy link
Member Author

Bulk methods now take objects as metadata, e.g.:

population_table.set_columns(metadata=[{'foo':1, 'bar':2}, {'foo':3, 'bar':4}])

Next step is for table.metadata to return decoded metadata.

@benjeffery
Copy link
Member Author

benjeffery commented Apr 3, 2020

@jeromekelleher and I had a chat about this PR, we decided too:

  • Validation should fail if metadata is not passed to add_row and the table has a schema where {} would fail.
  • Remove metadata validation/decoding/encoding from bulk methods on the table API
  • Create a tskit.metadata.MetadataSchema class to manage validation, decoding/encoding.
  • Add decode to ts.node et al.

We'll then defer other encoders and perf work to subsequent PRs

@petrelharp
Copy link
Contributor

Validation should fail if metadata is not passed to add_row and the table has a schema where {} would fail.

So, will it be possible to say "X or nothing at all" in the schema, then? Currently we have mixes of metadata and no metadata when e.g. we add more mutations to a SLiM simulation with msprime. Adding SLiM metadata to all the msprime mutations would not be a bad thing, but we'll have to add the capability to msprime.mutate( ) to be able to do that, if it will fail tree sequence validation.

@benjeffery
Copy link
Member Author

The schema can specify, per attribute, if an attribute is required or not. Also configurable is if attributes not in the schema are allowed. I'll be adding some docs and examples to this as it's a very flexible system.

@jeromekelleher
Copy link
Member

Hey @grahamgower, would you mind taking a look over this proposal please? It'd be good to get your thoughts (especially since you're in the trenches with this stuff on stdpopsim!)

@benjeffery
Copy link
Member Author

@grahamgower this gist explains the situation and proposal: https://gist.github.com/benjeffery/e7c68bab9839259dbab9d888d2458fa3

@benjeffery
Copy link
Member Author

This PR is superseded by #542 and #543

@benjeffery benjeffery closed this Apr 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants