Add tsinfer specific schema for node metadata. #416

jeromekelleher · 2020-12-10T16:58:58Z

We can add a Struct schema to the NodeTable which will let us efficiently store information about the inference process. Currently, this should contain "ancestor_data_id" (default = -1) and "sample_data_id" (default=-1). I made a start on doing this, but it turned out to be quite messy, mainly due to the lack of default value handling in tskit metadata.

Once these two issues are resolved we have another look at this:

tskit-dev/tskit#1073
tskit-dev/tskit#1084

It's not clear whether the struct schema of JSON would end up being most efficient here, but we should start with struct anyway just to see how it works out.

This is purely an interface for tsinfer to encode information about the nodes - tsinfer users will not have the option of setting or updating the schema.

Follow on from #392.

hyanwong · 2022-03-04T19:23:36Z

I'm revisiting this as (a) those two tskit issues have now been fixed and (b) I'm thinking about how tsdate adds metadata to nodes (specifically, the mean estimated time for a node and the variance associated with that time). At the moment this is simply dumped into the node metadata, which probably overwrites any existing values.

The need to add to the struct by tsdate makes me thing that perhaps the permissive JSON schema would be best for node metadata in tsinfer?

Alternatively, since tsdate is primarily going to be used by tsinfer, we could have some logic in tsdate that checks whether the schema matches tsinfer.node_schema and if so, swapps it for a schema which also includes space for the mean time and variance values. This would be more fiddly, but much more efficient in terms of space, I think.

hyanwong · 2022-03-04T19:40:06Z

P.s. just to note that sample data files don't contain metadata that can be set on nodes (only on individuals), so tsinfer is free to use whatever schema it wants for node metadata, without stomping on any user data.

hyanwong · 2022-03-04T22:18:56Z

I think the schema should look like this:

metadata.MetadataSchema(
    {
        "codec": "struct",
        "type": ["object", "null"],
        "properties": {
            "ancestor_data_id": {
                "description": "",
                "type": "integer",
                "binaryFormat": "i",
                "default": -1,
            },
            "sample_data_id": {
                "description": "Date of sample collection in ISO format",
                "type": "integer",
                "binaryFormat": "i",
                "default": -1,
            },
        },
        "required": ["ancestor_data_id", "sample_data_id"],
        "additionalProperties": False,
    }
)

jeromekelleher · 2022-03-07T11:04:24Z

I'm hesitant to use a struct schema to be honest, wouldn't JSON be a a lot simpler?

Not sure why we'd set defaults as well as marking them as required?

Looks like your descriptions are off?

hyanwong · 2022-03-07T11:07:44Z

I'm hesitant to use a struct schema to be honest, wouldn't JSON be a a lot simpler?

You said in the initial text of this issue "we should start with struct anyway just to see how it works out". Note that for big inferences, it might be quite tedious storing the JSON for each node (27 million ancestors in the unified genealogy, right?)

Not sure why we'd set defaults as well as marking them as required?

I thought they had to be required for a struct, but would take the default if not given? I probably misunderstood the meaning of the two or the interaction between them though.

Looks like your descriptions are off?

Oops, yes. Thanks for catching this.

jeromekelleher · 2022-03-07T11:16:55Z

Yes, that's true. 27M is quite a few, so the difference would be a few hundred megabytes. All right, I think that's a good justification for going with struct, let's try it out.

hyanwong · 2022-03-07T11:19:53Z

Given that this is not a metadata field controlled by the user, I agree that we should try struct first. tskit-dev/tskit#2129 might be a useful thing, which could be used if the user really wanted to change the type to JSON for later modification.

hyanwong · 2022-03-07T17:12:12Z

If we are using a schema, then we can also save this stuff under a "tsinfer" key, without taking up lots of space, which I think is a lot cleaner. So I propose using a schema like this:

schema = {
    "codec": "struct",
    "type": ["object", "null"],
    "properties": {
        "tsinfer": {
            "description": "Information about node identity from the tsinfer inference process",
            "type": "object",
            "default": {"ancestor_data_id": -1, "sample_data_id": -1},
            "properties": {
                "ancestor_data_id": {
                    "description":
                        "The corresponding ancestor ID "
                        "in the ancestors file created by the inference process, "
                        "or -1 if not applicable",
                    "type": "number",
                    "binaryFormat": "i",
                    "default": -1,
                },
                "sample_data_id": {
                    "description": 
                        "The corresponding sample ID "
                        "in the sample data file used for inference, "
                        "or -1 if not applicable",
                    "type": "number",
                    "binaryFormat": "i",
                    "default": -1,
                },
            },
        },
    },
    "additionalProperties": False,
}

jeromekelleher added this to the Release 0.2.1 milestone Dec 10, 2020

hyanwong mentioned this issue Mar 4, 2022

Set node metadata schema tskit-dev/tsdate#203

Closed

hyanwong linked a pull request Mar 5, 2022 that will close this issue

Use struct for node metadata #642

Open

jeromekelleher modified the milestones: Release 0.4.0, Release 0.5.0 May 15, 2023

hyanwong mentioned this issue Jul 26, 2023

Set metadata schemas tskit-dev/tsdate#303

Closed

hyanwong linked a pull request Jun 7, 2024 that will close this issue

Set node metadata schema #931

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tsinfer specific schema for node metadata. #416

Add tsinfer specific schema for node metadata. #416

jeromekelleher commented Dec 10, 2020

hyanwong commented Mar 4, 2022 •

edited

Loading

hyanwong commented Mar 4, 2022

hyanwong commented Mar 4, 2022

jeromekelleher commented Mar 7, 2022

hyanwong commented Mar 7, 2022

jeromekelleher commented Mar 7, 2022

hyanwong commented Mar 7, 2022

hyanwong commented Mar 7, 2022

Add tsinfer specific schema for node metadata. #416

Add tsinfer specific schema for node metadata. #416

Comments

jeromekelleher commented Dec 10, 2020

hyanwong commented Mar 4, 2022 • edited Loading

hyanwong commented Mar 4, 2022

hyanwong commented Mar 4, 2022

jeromekelleher commented Mar 7, 2022

hyanwong commented Mar 7, 2022

jeromekelleher commented Mar 7, 2022

hyanwong commented Mar 7, 2022

hyanwong commented Mar 7, 2022

hyanwong commented Mar 4, 2022 •

edited

Loading