-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tsinfer specific schema for node metadata. #416
Comments
I'm revisiting this as (a) those two tskit issues have now been fixed and (b) I'm thinking about how tsdate adds metadata to nodes (specifically, the mean estimated time for a node and the variance associated with that time). At the moment this is simply dumped into the node metadata, which probably overwrites any existing values. The need to add to the struct by tsdate makes me thing that perhaps the permissive JSON schema would be best for node metadata in tsinfer? Alternatively, since tsdate is primarily going to be used by tsinfer, we could have some logic in |
P.s. just to note that sample data files don't contain metadata that can be set on nodes (only on individuals), so |
I think the schema should look like this:
|
I'm hesitant to use a struct schema to be honest, wouldn't JSON be a a lot simpler? Not sure why we'd set defaults as well as marking them as required? Looks like your descriptions are off? |
You said in the initial text of this issue "we should start with struct anyway just to see how it works out". Note that for big inferences, it might be quite tedious storing the JSON for each node (27 million ancestors in the unified genealogy, right?)
I thought they had to be required for a struct, but would take the default if not given? I probably misunderstood the meaning of the two or the interaction between them though.
Oops, yes. Thanks for catching this. |
Yes, that's true. 27M is quite a few, so the difference would be a few hundred megabytes. All right, I think that's a good justification for going with struct, let's try it out. |
Given that this is not a metadata field controlled by the user, I agree that we should try struct first. tskit-dev/tskit#2129 might be a useful thing, which could be used if the user really wanted to change the type to JSON for later modification. |
If we are using a schema, then we can also save this stuff under a "tsinfer" key, without taking up lots of space, which I think is a lot cleaner. So I propose using a schema like this:
|
We can add a Struct schema to the NodeTable which will let us efficiently store information about the inference process. Currently, this should contain "ancestor_data_id" (default = -1) and "sample_data_id" (default=-1). I made a start on doing this, but it turned out to be quite messy, mainly due to the lack of default value handling in tskit metadata.
Once these two issues are resolved we have another look at this:
tskit-dev/tskit#1073
tskit-dev/tskit#1084
It's not clear whether the struct schema of JSON would end up being most efficient here, but we should start with struct anyway just to see how it works out.
This is purely an interface for tsinfer to encode information about the nodes - tsinfer users will not have the option of setting or updating the schema.
Follow on from #392.
The text was updated successfully, but these errors were encountered: