Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set node metadata schema #203

Closed
hyanwong opened this issue Mar 4, 2022 · 11 comments
Closed

Set node metadata schema #203

hyanwong opened this issue Mar 4, 2022 · 11 comments

Comments

@hyanwong
Copy link
Member

hyanwong commented Mar 4, 2022

At the moment we don't set a metadata schema for nodes returned by tsdate: we simply dump a binary json string in there, e.g.

>>> dated_ts.node(node_id)
Node(id=37, flags=0, time=33273.69140479153, population=-1, individual=-1, metadata=b'{"mn": 33273.69140479153, "vr": 12870629.731027713}')

We should probably attempt to set the "mn" and "vr" keys assuming the schema is valid, and simply omit them (perhaps with a warning) if it is not. Perhaps we should also store this as {"tsdate_time":{"mn":XXX,"vr":YYY}} so that it's clear what the metadata refers to?

If no schema exists, and the node metadata is entirely empty, we can probably set the nodes table schema to tskit.MetadataSchema.permissive_json()

I wonder if @benjeffery, the metadata king, has any thoughts on the best thing to do here.

@hyanwong hyanwong changed the title Set node metadata in json format Set node metadata schema Mar 4, 2022
@hyanwong
Copy link
Member Author

hyanwong commented Mar 4, 2022

I just remembered tskit-dev/tsinfer#416, in which the metadata schema is set for the node table for tree sequences produced by tsinfer.

I noted there that if, for efficiency, we have a tsinfer-specific struct schema for node metadata, we should probably check for it in tsdate and amend the schema to add the necessary fields that tsdate wants to set.

One (very minor) issue with this is that currently tsdate does not require tsinfer to be installed. It seems a bit of an overkill
to require tsinfer to be installed simply to be able to check if a schema matches tsinfer.node_metadata_schema, in order to replace it with a superset of that schema, but perhaps that's what we'll have to do?

(edit) - perhaps what we can do is hard-code tsinfer_node_metadata_schema with how it is defined and then check in the test suite that it matches tsinfer.node_metadata_schema. That means we shouldn't need to require users to have tsinfer installed (only for testing)

@jeromekelleher
Copy link
Member

Just to note here that we're supposed to be working with things that arent' tsinfer too, so we need to be very careful about assumptions we make about metadata.

@benjeffery
Copy link
Member

Thanks for flagging this up @hyanwong - can I clarify that the aim here is to add tsdate specific attributes to the possibly-preexisting metadata? If so this would seem to be a great use case to consider as we add the higher level metadata API.
As I won't get to that till this round of C work in tskit is done, maybe we should write some strawman metadata updating code here as first stab at what the metadata API will eventually do, then replace it when that is ready.
Basically something like "If there is an updateable schema, either stuct or json then add in the new properties, if not assign the minimal schema for the tsdate properties".

@hyanwong
Copy link
Member Author

hyanwong commented Mar 7, 2022

@benjeffery: can I clarify that the aim here is to add tsdate specific attributes to the possibly-preexisting metadata?

Yes, exactly.

@jeromekelleher: Just to note here that we're supposed to be working with things that arent' tsinfer too, so we need to be very careful about assumptions we make about metadata.

Yes, absolutely. I have been thinking about this, and I think it's all compatible. The idea would be that we ask forgiveness rather than permission: if it is a struct type, we somehow extend the schema (see below), if it is json type we attempt to add the extra fields. If we fail in either, we issue a warning but go ahead with the dating anyway (since the metadata additions are an optional nicety).

@benjeffery: Basically something like "If there is an updateable schema, either struct or json then add in the new properties, if not assign the minimal schema for the tsdate properties".

For a struct type, I presume we could either (a) check if it matches against an existing type, and if so, create a derived struct type with extra fields, dump the existing data and save it back as the new type, or (b) take the existing schema and simply add some new fields onto the end. I don't know if (b) is a bit hairy, though?

@benjeffery
Copy link
Member

I don't know if (b) is a bit hairy, though?

The order of fields in struct is determined by the schema, either alphabetically (by default) or specified by the index key. So if there is a performance reason it is possible just to append bytes without it being hairy.

@hyanwong
Copy link
Member Author

hyanwong commented Mar 7, 2022

The order of fields in struct is determined by the schema, either alphabetically (by default) or specified by the index key. So if there is a performance reason it is possible just to append bytes without it being hairy.

Thanks @benjeffery. I guess it might be safer to save the metadata out into python dict, then put it back in again? Something like this (which I presume will work for both JSON and struct metadata):

if "tsdate" not in tables.nodes.metadata_schema.schema["properties"]:
    # add the required tsdate properties to the node metadata schema
    tsdate_schema = tables.nodes.metadata_schema.schema.copy()
    tsdate_schema["properties"]["tsdate"]={
        "type": "object",
        "default": {"mean": float("NaN"), "variance": float("NaN")},
        "properties": {
            "mean": {
                "description":
                    "The mean time of this node, calculated from the tsdate posterior probabilities. "
                    "This may not be the same as the node time, as it is not constrained by parent-child order.",
                "type": "number",
                "binaryFormat": "d",
            },
            "variance": {
                "description": "The variance in times of this node, calculated from the tsdate posterior probabilities",
                "type": "number",
                "binaryFormat": "d",
            }
        },
    }

    meta = [r.metadata for r in tables.nodes]

    # Clear all metadata before changing the schema - is there a function to do this?
    tables.nodes.packset_metadata([b''] * tables.nodes.num_rows)
    tables.nodes.metadata_schema = tskit.MetadataSchema(tsdate_schema)
    # Put the node metadata back
    for i, (r, md) in enumerate(zip(tables.nodes, meta)):
        tables.nodes[i] = r.replace(metadata=md)

@hyanwong
Copy link
Member Author

hyanwong commented Mar 7, 2022

NB: if we are going with the idea of using struct metadata, as in tskit-dev/tsinfer#416 (comment), then we can do as above and add the tsdate metadata in a property called tsdate without taking up space on each node, which is nice and clean, I think. We can also fully spell out "mean" and "variance".

@benjeffery
Copy link
Member

Code above looks like the slow-but-correct way to do this, and as we will be replacing this by the eventual high-level API I think that is fine.

@hyanwong
Copy link
Member Author

Just revisiting this before re-dating the unified genealogy. I think that if the metadata schema on nodes is JSON or struct, we should use the code above to add means and variances. If it is binary, and if there is any metadata content on nodes, we shouldn't stomp on the metadata at all: we can simply warn the user that metadata on means and variances isn't being stored. If it is binary but there is no metadata, we can store struct data, I guess?

For the (re-dated) unified genealogy, I think the only node metadata is from tsdate, so we can remove all the existing node metadata before re-dating, and therefore default to struct.

@hyanwong
Copy link
Member Author

Just a quick comment re #303 - if struct data and we are adding fields, we probably need to specify default values for each new property, otherwise we won't be able to write non-null data back in to the new metadata column.

This is done in #303, which also takes the decision to overwrite any existing data in the mean and variance (mn, vr) fields.

@hyanwong
Copy link
Member Author

hyanwong commented Jun 5, 2024

Set to JSON everywhere, where possible. Closing

@hyanwong hyanwong closed this as completed Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants