-
-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best practices for metadata conventions #280
Comments
Regarding how conventions are published and maintained, here is one idea. We create a new github repo zarr-developers/zarr-conventions. This is a pure sphinx documentation repo with a corresponding RTFD site at zarr-conventions.readthedocs.io. Any group or individual can contribute a convention as a new document via a PR. Once initial PR is accepted, the contributor is given push access to the repo so they can maintain their own convention document. |
Regarding JSON syntax conventions, a few suggestions have been made, collecting them here. @shoyer suggested a group-level @clbarnes mentioned that N5 keeps all of the application-specific optional fields in a dict within the attributes JSON with a name like "_netZDF". Also noting that JSON-LD may be applicable, although I'm not necessarily advocating the use of JSON-LD as it may be too complicated and/or not fit our requirements. |
Also probably worth looking at Apache Avro's Spec for Schemas, which is also JSON-based. |
Guessing @martindurant has spent a fair bit of time thinking about things like this and may have some thoughts. |
@alimanfoo I like the idea of a conventions repo; they could even each go into separate repos under this org, with a canonical list of the conventions available (with links to them) in the base zarr README to keep them discoverable (although it would make discovering Convention B from Convention A more difficult). This structure would make it easier to have conventions which fork off each other in a transparent way, and be more easy to distinguish histories/activities of particular specs. On the one hand, it would stop separate conventions cluttering each other's issues and notifications, but on the other hand, maybe it's best to force all convention-maintainers to communicate through the same channels so they can resynchronise as much as possible. I think it'll be important to try to enforce some structure on at least the landing page of each convention doc so that they're easily comparable, with fields for who maintains it, which zarr version it targets, some applications and so on. |
I'd be happy with a conventions repo, but at least for the conventions I'm
interested in here (dimension names and netCDF) a full repository would be
overkill. These would be short and likely almost entirely static documents.
.
There is also somewhat of a logic distinction between conventions that add
small metadata layers (like these that I have proposed) and those that
effectively define custom file formats (e.g., a full spec for how to
represent a particular type of dataset in Zarr). Small metadata conventions
are definitely a better fit for being defined alongside Zarr.
…On Mon, Jul 23, 2018 at 9:09 AM Chris Barnes ***@***.***> wrote:
@alimanfoo <https://github.com/alimanfoo> I like the idea of a
conventions repo; they could even each go into separate repos under this
org, with a canonical list of the conventions available (with links to
them) in the base zarr README to keep them discoverable (although it would
make discovering Convention B from Convention A more difficult). This
structure would make it easier to have conventions which fork off each
other in a transparent way, and be more easy to distinguish
histories/activities of particular specs.
On the one hand, it would stop separate conventions cluttering each
other's issues and notifications, but on the other hand, maybe it's best to
force all convention-maintainers to communicate through the same channels
so they can resynchronise as much as possible.
I think it'll be important to try to enforce some structure on at least
the landing page of each convention doc so that they're easily comparable,
with fields for who maintains it, which zarr version it targets, some
applications and so on.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#280 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1vh1ZSZt54pP-cxAZWV-keDtw_ngks5uJfUwgaJpZM4ValcP>
.
|
Just looking at Avro now, looks like Avro schemas are defined in JSON but
specify a binary encoding for data. For zarr metadata conventions we'd want
some way to specify/constrain what JSON data appear in .zattrs. Maybe JSON
schema? http://json-schema.org/
…On Monday, 23 July 2018, jakirkham ***@***.***> wrote:
Also probably worth looking at Apache Avro's Spec for Schemas
<https://avro.apache.org/docs/1.8.1/spec.html#schemas>, which is also
JSON-based.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#280 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QjfZIIQzDIoRO8Qsk8sv5Pz-XJyFks5uJfAvgaJpZM4ValcP>
.
--
If I do not respond to an email within a few days, please feel free to
resend your email and/or contact me by other means.
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
|
I would -1 on using Avro as inspiration, its schema description seems unnecessarily complex. The do have schema evolution as a concept, which exists in a number of record-based data encapsulation formats. I just saw the mention of JSON-schema; this is also complex, but more along the lines of "thorough" than convoluted. I'm not sure it makes a good fit. The simpler case that people have been mentioning here, of a set of conventions that the data-set adheres to (with versions, I suppose), and all-optional fields, makes sense to me. One thing in this conversation: shouldn't a metadata convention for a data-set apply at the upper-most level only? In general, I wouldn't think that the constituent parts will meet the convention spec, and it would seem unnecessary in any case to repeat the information. |
Re JSON schema and JSON LD I thought I'd mention them in case there's anything useful, but in general I'm totally in favour of keeping it simple. I love the tweet that went around recently, "always try to go full stupid", if zarr has any design principles then this is chief among them. Also maybe worth noting that, in general, a "metadata convention" could include any/all of the following:
E.g., the CF conventions provide examples of all of the above. Regarding metadata vocabularies, it would be nice if there was a way to mix-and-match more than one vocabulary without fear of name clashes. I know this pushes towards complexity, but seems like an important requirement to enable reuse of vocabularies. Regarding whether conventions apply at the uppermost level, I imagine this is true in some cases (e.g., CF conventions), but might not be true in others (e.g., where a convention is really just a definition of some useful metadata vocabulary which you can use wherever you need it). @shoyer also makes a good point, there are other types of convention too, e.g., regarding how data from a particular domain is laid out and stored in zarr, which might not involve any metadata conventions at all (e.g., in our genomics work we have such a convention). What should people do if they have data/usage conventions? |
To unambiguously identify conventions, I would suggest including room for both a version number and a URL, in addition to the name. |
I am using the consolidated metadata feature in a context where the json files will be uploaded to a server. I need a way to validate these files, and I think json schema is ideal for this. I don't have any experience with Avro, but I do have experience with JSON schema and I think it would work well for this task. JSON-LD would allow you to link easily to other schemas, but I don't think that's a feature that is needed right now. Is a json schema available for these files? If not, I may be able to provide it. |
Closing in favor of ZEP0004: https://zarr.dev/zeps/draft/ZEP0004.html |
Some communities and applications need to define conventions regarding how array and/or group attributes (.zattrs) are used in order to achieve interoperability. It would be good to capture some best practices regarding how this is done, so that there is some consistency in how conventions are documented and implemented, and conventions can be discovered easily so that different groups don't reinvent the wheel.
This issue is intended for discussion and ideas regarding best practices for defining metadata conventions. Some questions to be addressed (not exhaustive):
The text was updated successfully, but these errors were encountered: