Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardised metadata keys #495

Closed
jeromekelleher opened this issue Mar 24, 2020 · 17 comments
Closed

Standardised metadata keys #495

jeromekelleher opened this issue Mar 24, 2020 · 17 comments
Labels
enhancement New feature or request Python API Issue is about the Python API
Projects

Comments

@jeromekelleher
Copy link
Member

In #375 we discussed the idea of having time_units stored somewhere in the tree sequence. This is useful info that users should have access to, while still being something that tskit does not need to use. One idea is that we should document that this is a standard thing to expect in a tree sequence metadata, and it has a number of accepted values ("generations", "years", "megayears").

Since tskit doesn't directly use this information, I guess it's correct that it's regarded as metadata.

Since this is a standardised thing that we expect for every tree sequence, however, I guess we could also argue that this should be a top-level tree sequence attribute, rather than optional metadata.

Are there other things like this that we should have standardised keys for? Or, if they are standardised, should we add them as tree sequence attributes?

@benjeffery
Copy link
Member

We currently have sequence_length as a top-level attribute on tsk_table_collection_t which is similar for the other dimension. Does it make sense to have one as units and the other as total length?

@jeromekelleher
Copy link
Member Author

I think it's reasonable to have time_units as an attribute. I'm imaging that we could have a few others later on like assembly to describe sequence the coordinate space. The question is really whether this is tree sequence data or metadata...

@benjeffery
Copy link
Member

benjeffery commented Mar 24, 2020

The question is really whether this is tree sequence data or metadata...

One consideration is that metadata is going to be harder to access from the C API.

@petrelharp
Copy link
Contributor

The units of sequence length could also be nice (although I hope everyone just uses bp).

@jeromekelleher
Copy link
Member Author

What do you think about keeping this sort of thing as a tree sequence attribute vs pre-defined standard metadata @petrelharp ?

@petrelharp
Copy link
Contributor

Feels like metadata to me, I think.

@molpopgen
Copy link
Member

Agree that this is metadata. We need a sensible term for the default length of 1 in msprime.

@jeromekelleher
Copy link
Member Author

Agree that this is metadata. We need a sensible term for the default length of 1 in msprime.

OK, sounds like agreement then. I guess we can stick with the operational definition of "if tskit uses it, it's data; otherwise metadata". Things can be promoted later if we do end up using them.

@hyanwong
Copy link
Member

Another "standard metadata" field we should include is whether the sequence comes from a circular chromosome or a linear one. In fact, it might be useful to know it it is meant to represent the whole chromosome or merely a portion of a larger one (I guess this is vital for circular chromosomes, where the distinction between a part of a circle and the whole circle is rather important). If we want to include mtDNA in stored human tree sequences, I guess this is not just an issue relevant to bacteria/viruses.

@hyanwong
Copy link
Member

It sounds like we might want 2 metadata schemas at the ts level: one for standardised things like time_units and one for user-content. Alternatively we could require users to derive any new schema from a standard base, but this seems a little more involved for the user?

@hyanwong
Copy link
Member

Going through old issues. I see that there is also a proposal to add the reference sequence to a ts, as a specific new area in the file store. I'm wondering if the reference sequence is more like top-level metadata? I think probably not, as the suggestion is that there would be some tskit utilities that would use it (and we decided that metadata is only stuff that is unused by tskit), but it does have a metadata-ish feel to me, which is why I mention it.

@jeromekelleher
Copy link
Member Author

I think it's too big for metadata @hyanwong, and it would need to be visible to the C library if we're going to do anything useful with it, so it can't be metadata.

@hyanwong
Copy link
Member

Note that having a time_units value would enable useful warnings or errors to be emitted when people wrongly try to run tskit branch length stats on plain tsinferr-ed tree sequences (that have frequency-based values for node times). I think we should therefore give this issue a little more of a priority boost, if possible.

@petrelharp
Copy link
Contributor

Oh, interesting!

@hyanwong
Copy link
Member

If the tskit stats framework uses time_units (even simply for validation) then perhaps that would no longer make this attribute officially "metadata"?!

@jeromekelleher
Copy link
Member Author

Excellent points @hyanwong - opened #1644 to discuss.

@hyanwong
Copy link
Member

Now that time_units is implemented, I think this can be probably be closed in favour of #2649 which picks up the other attribute/metadata that's been discussed here.

Metadata v2 automation moved this from Ideas to Done Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Python API Issue is about the Python API
Projects
Development

No branches or pull requests

5 participants