Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practices for metadata conventions #280

Closed
alimanfoo opened this issue Jul 23, 2018 · 13 comments
Closed

Best practices for metadata conventions #280

alimanfoo opened this issue Jul 23, 2018 · 13 comments

Comments

@alimanfoo
Copy link
Member

Some communities and applications need to define conventions regarding how array and/or group attributes (.zattrs) are used in order to achieve interoperability. It would be good to capture some best practices regarding how this is done, so that there is some consistency in how conventions are documented and implemented, and conventions can be discovered easily so that different groups don't reinvent the wheel.

This issue is intended for discussion and ideas regarding best practices for defining metadata conventions. Some questions to be addressed (not exhaustive):

  • How should conventions be documented?
  • How/where should conventions be published?
  • How do conventions avoid naming/syntax clashes when more than one convention is used at the same time?
  • Are there any best practices regarding how JSON syntax is used?
@alimanfoo
Copy link
Member Author

Regarding how conventions are published and maintained, here is one idea. We create a new github repo zarr-developers/zarr-conventions. This is a pure sphinx documentation repo with a corresponding RTFD site at zarr-conventions.readthedocs.io. Any group or individual can contribute a convention as a new document via a PR. Once initial PR is accepted, the contributor is given push access to the repo so they can maintain their own convention document.

@alimanfoo
Copy link
Member Author

Regarding JSON syntax conventions, a few suggestions have been made, collecting them here.

@shoyer suggested a group-level conventions attribute hat should be a mapping from convention names to versions, e.g., {'conventions': {'netZDF': 1.0, 'CF': 1.6}}.

@clbarnes mentioned that N5 keeps all of the application-specific optional fields in a dict within the attributes JSON with a name like "_netZDF".

Also noting that JSON-LD may be applicable, although I'm not necessarily advocating the use of JSON-LD as it may be too complicated and/or not fit our requirements.

@jakirkham
Copy link
Member

Also probably worth looking at Apache Avro's Spec for Schemas, which is also JSON-based.

@jakirkham
Copy link
Member

Guessing @martindurant has spent a fair bit of time thinking about things like this and may have some thoughts.

@clbarnes
Copy link
Contributor

clbarnes commented Jul 23, 2018

@alimanfoo I like the idea of a conventions repo; they could even each go into separate repos under this org, with a canonical list of the conventions available (with links to them) in the base zarr README to keep them discoverable (although it would make discovering Convention B from Convention A more difficult). This structure would make it easier to have conventions which fork off each other in a transparent way, and be more easy to distinguish histories/activities of particular specs.

On the one hand, it would stop separate conventions cluttering each other's issues and notifications, but on the other hand, maybe it's best to force all convention-maintainers to communicate through the same channels so they can resynchronise as much as possible.

I think it'll be important to try to enforce some structure on at least the landing page of each convention doc so that they're easily comparable, with fields for who maintains it, which zarr version it targets, some applications and so on.

@shoyer
Copy link
Contributor

shoyer commented Jul 23, 2018 via email

@alimanfoo
Copy link
Member Author

alimanfoo commented Jul 23, 2018 via email

@martindurant
Copy link
Member

I would -1 on using Avro as inspiration, its schema description seems unnecessarily complex. The do have schema evolution as a concept, which exists in a number of record-based data encapsulation formats.

I just saw the mention of JSON-schema; this is also complex, but more along the lines of "thorough" than convoluted. I'm not sure it makes a good fit.

The simpler case that people have been mentioning here, of a set of conventions that the data-set adheres to (with versions, I suppose), and all-optional fields, makes sense to me.

One thing in this conversation: shouldn't a metadata convention for a data-set apply at the upper-most level only? In general, I wouldn't think that the constituent parts will meet the convention spec, and it would seem unnecessary in any case to repeat the information.

@alimanfoo
Copy link
Member Author

Re JSON schema and JSON LD I thought I'd mention them in case there's anything useful, but in general I'm totally in favour of keeping it simple. I love the tweet that went around recently, "always try to go full stupid", if zarr has any design principles then this is chief among them.

Also maybe worth noting that, in general, a "metadata convention" could include any/all of the following:

  • Definition of some metadata vocabulary (i.e., attribute names and allowed values for those attributes)
  • Rules for how those properties should be used (e.g., attributes 'foo' and 'bar' should be present on the root group of a hierarchy, attributes 'spam' and 'eggs' should be present on all arrays, etc.)
  • Special processing/inference rules (e.g., values of 'dimensions' attribute should be considered to be inherited by all sub-groups unless explicitly overridden)
  • Conventions regarding naming of groups/arrays (e.g., if there is a dimension 'foo' and an array named 'foo' then assume the array is the coordinate variable for the dimension with the same name)

E.g., the CF conventions provide examples of all of the above.

Regarding metadata vocabularies, it would be nice if there was a way to mix-and-match more than one vocabulary without fear of name clashes. I know this pushes towards complexity, but seems like an important requirement to enable reuse of vocabularies.

Regarding whether conventions apply at the uppermost level, I imagine this is true in some cases (e.g., CF conventions), but might not be true in others (e.g., where a convention is really just a definition of some useful metadata vocabulary which you can use wherever you need it).

@shoyer also makes a good point, there are other types of convention too, e.g., regarding how data from a particular domain is laid out and stored in zarr, which might not involve any metadata conventions at all (e.g., in our genomics work we have such a convention). What should people do if they have data/usage conventions?

@shoyer
Copy link
Contributor

shoyer commented Oct 18, 2018

To unambiguously identify conventions, I would suggest including room for both a version number and a URL, in addition to the name.

@bendichter
Copy link

I am using the consolidated metadata feature in a context where the json files will be uploaded to a server. I need a way to validate these files, and I think json schema is ideal for this. I don't have any experience with Avro, but I do have experience with JSON schema and I think it would work well for this task. JSON-LD would allow you to link easily to other schemas, but I don't think that's a feature that is needed right now. Is a json schema available for these files? If not, I may be able to provide it.

@jhamman
Copy link
Member

jhamman commented Feb 3, 2024

Closing in favor of ZEP0004: https://zarr.dev/zeps/draft/ZEP0004.html

@jhamman jhamman closed this as completed Feb 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants