Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tsinfer specific schema for node metadata. #416

Open
jeromekelleher opened this issue Dec 10, 2020 · 8 comments · May be fixed by #642 or #931
Open

Add tsinfer specific schema for node metadata. #416

jeromekelleher opened this issue Dec 10, 2020 · 8 comments · May be fixed by #642 or #931
Milestone

Comments

@jeromekelleher
Copy link
Member

We can add a Struct schema to the NodeTable which will let us efficiently store information about the inference process. Currently, this should contain "ancestor_data_id" (default = -1) and "sample_data_id" (default=-1). I made a start on doing this, but it turned out to be quite messy, mainly due to the lack of default value handling in tskit metadata.

Once these two issues are resolved we have another look at this:

tskit-dev/tskit#1073
tskit-dev/tskit#1084

It's not clear whether the struct schema of JSON would end up being most efficient here, but we should start with struct anyway just to see how it works out.

This is purely an interface for tsinfer to encode information about the nodes - tsinfer users will not have the option of setting or updating the schema.

Follow on from #392.

@jeromekelleher jeromekelleher added this to the Release 0.2.1 milestone Dec 10, 2020
@hyanwong
Copy link
Member

hyanwong commented Mar 4, 2022

I'm revisiting this as (a) those two tskit issues have now been fixed and (b) I'm thinking about how tsdate adds metadata to nodes (specifically, the mean estimated time for a node and the variance associated with that time). At the moment this is simply dumped into the node metadata, which probably overwrites any existing values.

The need to add to the struct by tsdate makes me thing that perhaps the permissive JSON schema would be best for node metadata in tsinfer?

Alternatively, since tsdate is primarily going to be used by tsinfer, we could have some logic in tsdate that checks whether the schema matches tsinfer.node_schema and if so, swapps it for a schema which also includes space for the mean time and variance values. This would be more fiddly, but much more efficient in terms of space, I think.

@hyanwong
Copy link
Member

hyanwong commented Mar 4, 2022

P.s. just to note that sample data files don't contain metadata that can be set on nodes (only on individuals), so tsinfer is free to use whatever schema it wants for node metadata, without stomping on any user data.

@hyanwong
Copy link
Member

hyanwong commented Mar 4, 2022

I think the schema should look like this:

metadata.MetadataSchema(
    {
        "codec": "struct",
        "type": ["object", "null"],
        "properties": {
            "ancestor_data_id": {
                "description": "",
                "type": "integer",
                "binaryFormat": "i",
                "default": -1,
            },
            "sample_data_id": {
                "description": "Date of sample collection in ISO format",
                "type": "integer",
                "binaryFormat": "i",
                "default": -1,
            },
        },
        "required": ["ancestor_data_id", "sample_data_id"],
        "additionalProperties": False,
    }
)

@hyanwong hyanwong linked a pull request Mar 5, 2022 that will close this issue
@jeromekelleher
Copy link
Member Author

I'm hesitant to use a struct schema to be honest, wouldn't JSON be a a lot simpler?

Not sure why we'd set defaults as well as marking them as required?

Looks like your descriptions are off?

@hyanwong
Copy link
Member

hyanwong commented Mar 7, 2022

I'm hesitant to use a struct schema to be honest, wouldn't JSON be a a lot simpler?

You said in the initial text of this issue "we should start with struct anyway just to see how it works out". Note that for big inferences, it might be quite tedious storing the JSON for each node (27 million ancestors in the unified genealogy, right?)

Not sure why we'd set defaults as well as marking them as required?

I thought they had to be required for a struct, but would take the default if not given? I probably misunderstood the meaning of the two or the interaction between them though.

Looks like your descriptions are off?

Oops, yes. Thanks for catching this.

@jeromekelleher
Copy link
Member Author

Yes, that's true. 27M is quite a few, so the difference would be a few hundred megabytes. All right, I think that's a good justification for going with struct, let's try it out.

@hyanwong
Copy link
Member

hyanwong commented Mar 7, 2022

Given that this is not a metadata field controlled by the user, I agree that we should try struct first. tskit-dev/tskit#2129 might be a useful thing, which could be used if the user really wanted to change the type to JSON for later modification.

@hyanwong
Copy link
Member

hyanwong commented Mar 7, 2022

If we are using a schema, then we can also save this stuff under a "tsinfer" key, without taking up lots of space, which I think is a lot cleaner. So I propose using a schema like this:

schema = {
    "codec": "struct",
    "type": ["object", "null"],
    "properties": {
        "tsinfer": {
            "description": "Information about node identity from the tsinfer inference process",
            "type": "object",
            "default": {"ancestor_data_id": -1, "sample_data_id": -1},
            "properties": {
                "ancestor_data_id": {
                    "description":
                        "The corresponding ancestor ID "
                        "in the ancestors file created by the inference process, "
                        "or -1 if not applicable",
                    "type": "number",
                    "binaryFormat": "i",
                    "default": -1,
                },
                "sample_data_id": {
                    "description": 
                        "The corresponding sample ID "
                        "in the sample data file used for inference, "
                        "or -1 if not applicable",
                    "type": "number",
                    "binaryFormat": "i",
                    "default": -1,
                },
            },
        },
    },
    "additionalProperties": False,
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants