New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardised metadata keys #495
Comments
We currently have |
I think it's reasonable to have |
One consideration is that metadata is going to be harder to access from the C API. |
The units of sequence length could also be nice (although I hope everyone just uses bp). |
What do you think about keeping this sort of thing as a tree sequence attribute vs pre-defined standard metadata @petrelharp ? |
Feels like metadata to me, I think. |
Agree that this is metadata. We need a sensible term for the default length of 1 in msprime. |
OK, sounds like agreement then. I guess we can stick with the operational definition of "if tskit uses it, it's data; otherwise metadata". Things can be promoted later if we do end up using them. |
Another "standard metadata" field we should include is whether the sequence comes from a circular chromosome or a linear one. In fact, it might be useful to know it it is meant to represent the whole chromosome or merely a portion of a larger one (I guess this is vital for circular chromosomes, where the distinction between a part of a circle and the whole circle is rather important). If we want to include mtDNA in stored human tree sequences, I guess this is not just an issue relevant to bacteria/viruses. |
It sounds like we might want 2 metadata schemas at the ts level: one for standardised things like |
Going through old issues. I see that there is also a proposal to add the reference sequence to a ts, as a specific new area in the file store. I'm wondering if the reference sequence is more like top-level metadata? I think probably not, as the suggestion is that there would be some tskit utilities that would use it (and we decided that metadata is only stuff that is unused by tskit), but it does have a metadata-ish feel to me, which is why I mention it. |
I think it's too big for metadata @hyanwong, and it would need to be visible to the C library if we're going to do anything useful with it, so it can't be metadata. |
Note that having a |
Oh, interesting! |
If the |
Now that |
In #375 we discussed the idea of having
time_units
stored somewhere in the tree sequence. This is useful info that users should have access to, while still being something that tskit does not need to use. One idea is that we should document that this is a standard thing to expect in a tree sequence metadata, and it has a number of accepted values ("generations", "years", "megayears").Since tskit doesn't directly use this information, I guess it's correct that it's regarded as metadata.
Since this is a standardised thing that we expect for every tree sequence, however, I guess we could also argue that this should be a top-level tree sequence attribute, rather than optional metadata.
Are there other things like this that we should have standardised keys for? Or, if they are standardised, should we add them as tree sequence attributes?
The text was updated successfully, but these errors were encountered: