Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: zarr spec v3: adds optional dimensions and the "netZDF" format #276

Closed
wants to merge 2 commits into from

Conversation

shoyer
Copy link
Contributor

@shoyer shoyer commented Jul 15, 2018

xref #167

For ease of review, this is currently written by modifying docs/spec/v2.rst, but this would of course actually be submitted as a separate v3 file.

This does not yet include any changes to the zarr reference implementation, which would need to grow at least:

  • Array.dimensions
  • Group.dimensions
  • Group.resize for simultaneously resizing multiple arrays in a group which share the same dimension (conflicting dimension sizes are forbidden by the spec)
  • Group.netzdf for indicating whether a group satisfies the netzdf spec or not.

Note: I do like "netzdf" but I'm open to less punny alternatives :).

@shoyer shoyer mentioned this pull request Jul 15, 2018
@jhamman
Copy link
Member

jhamman commented Jul 16, 2018

Thanks @shoyer for writing this up. I had been using ZCDF as the acronym for this feature set in zarr but also don't have strong feelings about the name at this point. FWI, @alimanfoo, @rabernat, @mrocklin, and I have had a few offline exchanges on the subject (see https://github.com/jhamman/zcdf_spec for a Zarr subspec that describes what xarray has currently implemented). Without speaking for anyone else, I think there is growing excitement about the concept of a Zarr+NetCDF data model.

@alimanfoo
Copy link
Member

Great to see this. I like the design, it's simple and intuitive. Couple of questions...

  1. Do you need a way to handle coordinate variables, i.e., being able to express the fact that an array contains the coordinates for some dimension?

  2. These features could be implemented within .zattrs without requiring any changes to the spec. I'm open to considering a spec change, and there may be other reasons for wanting to update the spec in the not-too-distant future (xref Complex dtype error  #244). But a spec change will mean some disruption for existing users and some minor complexities about supporting migration etc. Ultimately I'll be happy to follow the consensus but was just wondering what the rationale was for including these changes within the core metadata and not .zattrs.

@mrocklin
Copy link
Contributor

These features could be implemented within .zattrs without requiring any changes to the spec

I find myself agreeing with this. I think that ideally Zarr would remain low-level and that we would provide extra conventions/subspecs on top of it.

My understanding is that one reason for HDF's current inertia is that it had a bunch of special features glommed onto it by various user communities. If we can separate these out that might be nice for long term maintenance.

@shoyer
Copy link
Contributor Author

shoyer commented Jul 16, 2018

Do you need a way to handle coordinate variables, i.e., being able to express the fact that an array contains the coordinates for some dimension?

No, for more sophisticated metadata needs we can simply use a subset of CF Conventions. These are pretty standard for applications that handle netCDF files, like xarray.

Ultimately I'll be happy to follow the consensus but was just wondering what the rationale was for including these changes within the core metadata and not .zattrs.

This is a good question. Mostly it comes down to having the specs all in one place, so it's obvious where to find this convention for everyone implementing the zarr spec. Dimensions are broadly useful enough for self-described data that I think people in many fields would find them useful. In particular, I would hate to see separate communities develop their own specs for dimensions, just because they didn't think to search for zarr netcdf.

I also think there are probably use cases for defining named dimensions on some but not all arrays and/or axes. This wouldn't make sense as part of the "netzdf" spec which xarray would require.

Finally, incorporating dimensions directly in the data model follows precedence from the netCDF format itself, which is actually pretty simple. I agree that we don't want to make zar as complex as HDF5 (which is part of why these aren't full fledged dimension scales) but adding a couple of optional metadata keys is only a small step in that direction.

@jakirkham
Copy link
Member

Have not thought about this too deeply yet. So this is just a very rough idea that we can discard if it doesn't fit, but what if we added ZCDF as a library within the org that built off Zarr? This would address some of the discoverability, and feature creep concerns raised thus far. It would also eliminate the need for things like checks as to whether the NetCDF spec is implemented by specific objects.

@mrocklin
Copy link
Contributor

mrocklin commented Jul 16, 2018 via email

@alimanfoo
Copy link
Member

This is a good question. Mostly it comes down to having the specs all in one place, so it's obvious where to find this convention for everyone implementing the zarr spec. Dimensions are broadly useful enough for self-described data that I think people in many fields would find them useful. In particular, I would hate to see separate communities develop their own specs for dimensions, just because they didn't think to search for zarr netcdf.

FWIW we could add this as a "NetZDF spec" (or whatever name) alongside the existing storage specs in the specs section of the Zarr docs, should be pretty visible (in fact might be more visible as it would get its own heading in the toc tree).

I would be keen to minimise disruption for existing users and implementers if possible. A spec version change would imply some inconvenience, even if relatively small, as existing data would need to be migrated.

@mrocklin
Copy link
Contributor

mrocklin commented Jul 16, 2018 via email

@shoyer
Copy link
Contributor Author

shoyer commented Jul 16, 2018

My understanding is that this proposal is entirely compatible (both
backwards and forwards) with existing data

Indeed. I considered naming this "v2.1" based on semantic versioning until I saw that the zarr spec only uses integer versions.

The only backwards incompatibility it introduces is the addition of new optional metadata fields. I would hope that any existing software would simply ignore these, rather than assume that no fields could ever be introduced in the future.

@shoyer
Copy link
Contributor Author

shoyer commented Jul 16, 2018

Have not thought about this too deeply yet. So this is just a very rough idea that we can discard if it doesn't fit, but what if we added ZCDF as a library within the org that built off Zarr? This would address some of the discoverability, and feature creep concerns raised thus far. It would also eliminate the need for things like checks as to whether the NetCDF spec is implemented by specific objects.

Yes, this makes some amount of sense.

The main downside of incorporating these changes into zarr proper is that for netCDF compatibility we really want the guarantee of consistent dimension sizes between arrays. This would require a small amount of refactoring and additional complexity to achieve within the Zarr library.

docs/spec/v2.rst Outdated
(any non-``null`` value), MUST also be defined on an ancestor group. Dimension
sizes can be overwritten in descendent groups, but the size of each named
dimensions on an array MUST match the size of that dimension on the most direct
ancestor group on which it is defined.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going to change this, to make group dimensions and consistency entirely optional:

If dimensions are set in a group, their sizes on all contained arrays
are REQUIRED to be consistent. Dimension sizes can be overwritten
in descendant groups, but the size of each named dimension (any
non-`null` value) on an array MUST match the size of that dimension
on the most direct ancestor group on which it is defined.

@jakirkham
Copy link
Member

From a neuroscience data perspective, this gets pretty complicated pretty fast if one wants to be general. Please see NWB as an example. Personally wouldn't want Zarr to take any of this on. It would be better handled in a library on top of Zarr. Note NWB currently is built on top of HDF5, but it would be reasonable to consider an NWB spec on top of Zarr.

Can't speak to the Earth Sciences or what people in this field want out of Zarr. If dimensionality is the one thing desired, maybe this is ok. If there are 5 or 6 more things needed in the pipe, maybe having a library built on top of Zarr would be better. Would be good if some people could answer these sorts of questions.

@jakirkham
Copy link
Member

Sorry for the multiple posts. GitHub is having some sort of issue that is affecting me.

@jakirkham
Copy link
Member

Some miscellaneous thoughts about dimensionality in our field since Matt asked.

Naming dimensions has certainly come up before. Here is one example and another. Also some discussion about axes names in this comment and below. Scientist definitely like having this sort of feature as it helps them keep track of what something means and is useful if the order ever needs to change for an operation. So this sort of use case benefits from the proposal.

The other thing that typically comes to mind when discussing dimensions, which I don't think has come up thus far is units. It's pretty useful to know something is in ms, mV, or other relevant units. Libraries like quantities or pint are useful for tracking units and combining them sensible. This could be an addition to the proposal or perhaps something to add to a downstream data format library.

For tracking time in some cases we have timestamps. This supplants the need for dimension or units and often parallels other information (e.g. snapshots of other data at a particular time). This could use existing things like structured arrays.

However when looking applying some basic machine learning, dimensions pretty quickly become a mess. Especially if various different kinds of dimensions get mixed together. For example PCA is a pretty common technique to perform in a variety of cases to find the biggest contributors to some distribution. The units of this sort of thing are frequently strange and difficult to think about. This case probably either needs a different proposal or users have to work with existing metadata information to make this work for their use case.

@mrocklin
Copy link
Contributor

mrocklin commented Jul 16, 2018 via email

@DennisHeimbigner
Copy link

From the point of view of the Unidata netcdf group, named dimensions (shared dimensions in netcdf parlance) is essential for managing coordinate variables. So the netcdf extension to Zarr (or possibly TIleDB) will include only named dimensions and anonymous dimensions will probably be suppressed. We went around about this with the HDF5 group long ago.
One of the sticking points was multi-dimensional coordinate variables.

@alimanfoo
Copy link
Member

Speaking as a user in the genomics domain, I certainly would find this feature useful, it is common to have multiple arrays sharing dimensions. I don’t have a broad experience in other domains but expect this feature to be generally very useful. So I am very supportive and would like to give this as much prominence as possible.

My reasons for leaning towards use of .zattrs is not meant in any way to diminish the importance or broad applicability of this feature, it is based purely on technical considerations, basically on what is easiest to implement and provides the least disruption for existing users and implementers.

My understanding is that this proposal is entirely compatible (both backwards and forwards) with existing data

Yes in theory, although unfortunately it’s not quite that simple in practice. I’ll try to unpack some details about versioning and change management in Zarr. Btw I’m not suggesting this ideal or the best solution, thinking ahead about possible changes and managing compatibility is quite hard.

This proposal adds a new feature (dimensions) to the Zarr storage spec. This feature is optional in two senses. First it is optional in that it specifies elements that do not need to be present in the array or group metadata. Second it is optional for the implementation, i.e., an implementation can ignore these elements if present in the metadata and still be conformant with the storage spec.

When I wrote the v2 storage spec and was thinking about mechanisms for managing change, for better or worse, I did not allow any mechanisms for adding optional features to the spec. There is no concept of minor spec versions, only major versions (single version number). The only way the spec can change is via a major version change, which implies a break in compatibility. If the current implementation finds anything other than “2” as the spec version number in array metadata, it raises an exception. The spec does not define any concept of optional features or leave open the possibility of introducing them (other than via a major version change).

If I had been farsighted, I might have seen this coming, and I might have defined a notion of optional features, which could be introduced via a minor version increment to the spec, and implementations could include some flexibility in matching the format minor version number when deciding if they can read some data or not. To be fair I did give this some thought, although I couldn’t have articulated it very well at the time. In the end I decided on a simple versioning policy I think partly because it was simple to articulate and implement, and also because I thought that the user attributes (.zattrs) always provided a means for optional functionality to be layered on. Also the separation between .zattrs and core metadata (.zarray, .zgroup) is nice in a way because it makes it very clear where is the line between optional and required features. I.e., to be conformant, a minimal implementation has to understand everything in .zarray, and can ignore everything in .zattrs.

So given all this, there are at least three options for how to introduce this feature. In below, by “old code” I mean the current version of the zarr package (which does not implement this feature), by “old data” I mean data created using old code, by “new code” I mean the next version of the zarr package (which does implement this feature), by “new data” I mean data created using new code.

Option 1: Use .zattrs, write this as a separate spec. Full compatibility, old code will be able to read new data, and new code will be able to read old data.

Option 2: Use .zarray/.zgroup, incorporate into the storage spec, major version bump (v3). Old code will not be able to read new data. New code can read old data if data is migrated (which just requires replacing the version number in metadata files) or if new code is allowed to read both v2 and v3.

Option 3: Use .zarray/.zgroup, incorporate into the storage spec but leave spec version unchanged (v2). Full compatibility, old code will be able to read new data, and new code will be able to read old data. However, this is potentially confusing because the spec has changed but the spec version number hasn’t.

Hence I lean towards option 1 because it has maximum compatibility and seems simplest/least disruptive. But very happy to discuss. And I’m sure there are other options too that I haven’t thought of.

@rabernat
Copy link
Contributor

The other thing that typically comes to mind when discussing dimensions, which I don't think has come up thus far is units.

For tracking time in some cases we have timestamps.

@jakirkham - both of these issues arise in geoscience use cases. We handle them by providing metadata that follows CF conventions and then using xarray to decode the metadata into appropriate types (like a `numpy.datetime64'). This works today with zarr + xarray and doesn't require any changes to the spec.

@shoyer
Copy link
Contributor Author

shoyer commented Jul 16, 2018

From the point of view of the Unidata netcdf group, named dimensions (shared dimensions in netcdf parlance) is essential for managing coordinate variables. So the netcdf extension to Zarr (or possibly TIleDB) will include only named dimensions and anonymous dimensions will probably be suppressed. We went around about this with the HDF5 group long ago. One of the sticking points was multi-dimensional coordinate variables.

Yes, this is why I want named dimensions.

I don't think we need explicit support for multi-dimensional coordinate variables in Zarr. NetCDF doesn't have explicit support for coordinates at all, and we get along just fine using CF conventions.

HDF5's dimension scales include coordinate values as well as dimension names. But in my opinion this is unnecessarily complex. Simple conventions like treating variables with the same name as a dimension as supplying coordinate values are sufficient.

@rabernat
Copy link
Contributor

It might be useful for the discussion if I explain what xarray currently does to add dimension support to zarr stores. This might help clarify some of the tradeoffs between option 1 (just use .zattrs) vs. options 2/3.

When xarray creates a zarr store from an xarray dataset, it always creates a group. On each array in the group, it creates an attribute called _ARRAY_DIMENSIONS. The contents of the attribute are a list whose length is the same as the array ndim. The items correspond to the dimension name of each axis.

When the group is loaded, xarray checks for the presence of this key in the attributes of each array. If it is missing it raises an error--xarray can't read arbitrary zarr stores, only those that match its de-facto spec. If it finds the _ARRAY_DIMENSIONS, it uses it to populate the variable dimensions. (Xarray's internal consistency checks would raise an error if there were a conflict in sizes or if the dimension coordinate variables were not present in the group.) Xarray also has to hide this attribute from the user so that it can't be directly read or modified at the xarray level. This last step is a price we pay for the fact that the dimensions property is part of the "user space metadata" (.zattrs) rather than the core zarr metadata (.zarray).

@DennisHeimbigner
Copy link

NetCDF doesn't have explicit support for coordinates at all
I do not believe this is completely correct; there is no syntactic
support, but if you look at the netcdf 3 and 4 specifications,
it is part of the netcdf semantics.

@DennisHeimbigner
Copy link

WRT to things like units, you need to be very careful about embedding
domain specific semantics into the data model. Our experience is that this
is best left to metadata conventions.

@DennisHeimbigner
Copy link

Remember that the same dimension may be used in multiple variables, so it is
probably not a good idea to attach dimension information (other than the name)
to a variable.

@mrocklin
Copy link
Contributor

Just wanted to briefly chime in that I'm very happy to see NetCDF folks active in this discussion.

@DennisHeimbigner
Copy link

BTW, one common example of multidimensional coordinate variables is when
defining a trajectory data set.

@shoyer
Copy link
Contributor Author

shoyer commented Jul 16, 2018

@alimanfoo I suspect this will not be the last change we will want in the zarr spec (e.g., to support complex numbers), so it might make sense to "bite the bullet" now with a major version number increase, and at the same time establish a clear policy on forwards compatibility for Zarr. I am confident that Zarr will only become more popular in the future!

I would suggest looking at the forwards compatibility policies from HDF5 and protocol buffers for inspiration:

  • HDF5 files write the minimum possible version number that supplies all needed features, to ensure that old clients can read files written with newer versions of the HDF5 library.
  • Protocol buffers are a domain specific language for writing custom file formats with automatically generated interfaces. It is widely used at Google and elsewhere (e.g., Apache Arrow uses a protocol buffer successor called flatbuffers). Hard experience has taught us that the right way to handle forward compatibility concerns is to ensure that protocol buffer implementations ignore but preserve unknown fields. Protocol buffers are designed to evolve by adding new fields, but changing the meaning of existing fields is strongly discouraged (this would correspond to a major version bump).

Going forward, I would suggest the following forward and backwards compatibility policies, which we can add to the spec:

  • Backwards compatibility: As much as practical, new versions of the Zarr library should support reading files generated with old versions of the spec (e.g., the zarr library version 3 should still support reading version 2 stores).
  • Forwards compatibility: New versions of the Zarr library should write the minimum possible version number.
  • Minor version numbers: The Zarr spec should specify versions with a string (e.g., "2.1") instead of a number. Minor version numbers indicate forwards compatible changes (e.g., the use of new optional features, such as dimension names). Older versions of the Zarr library should support reading newer files and simply ignore/preserve unknown fields. Issuing a warning would be appropriate.

Doing a little more searching, it appears that such a convention is actually widely used. E.g., see "Versioning Conventions" for ASDF and this page on "Designing File Formats" (it calls what I've described "major/minor" versioning).

@shoyer
Copy link
Contributor Author

shoyer commented Jul 16, 2018

@DennisHeimbigner

NetCDF doesn't have explicit support for coordinates at all
I do not believe this is completely correct; there is no syntactic
support, but if you look at the netcdf 3 and 4 specifications,
it is part of the netcdf semantics.

In the netCDF spec I find "coordinates" only mentioned for netCDF4, specifically for the _Netcdf4Coordinates. That said, I don't really understand why this exists: netCDF's public APIs (e.g., nc_def_var) don't reference coordinates at all.

I see that internally, netCDF4 maintains its own notion of "dimension scales" that support more than 1 dimension (beyond what HDF5 supports), which it appears to use for variables if their first dimension matches the name of the variable:
https://github.com/Unidata/netcdf-c/blob/7196dfd6064d778a9973797200d8e64c999d63c5/libsrc4/nc4var.c#L594-L598

Note that this definition of a multi-dimensional coordinate does not even match the typical interpretation of "coordinates" by software that reads netCDF files. Per CF Conventions, "coordinates" are defined merely by being referenced by a "coordinates" attribute on anothe rvariable, without any requirements on their name matching a dimension.

I'm getting a little off track here, but I think the conclusions we can draw from the netCDF4 experience for Zarr are:

  1. Dimensions scales as implemented in HDF5 aren't even a particularly good fit for the netCDF data model, given the lengths to which netCDF4 had to go to adapt its data model to HDF5.
  2. We don't need to expose an explicit notion of coordinates (as understood by CF Conventions and xarray) for a netCDF-like API. This can be handled by downstream conventions.

@DennisHeimbigner
Copy link

I stand corrected. One discussion of coordinate variables is here as a convention.
https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_set_components.html#coordinate_variables
That reference says that it is a convention, but the source code does take them
into account.
You are correct. We have long recignized that using dimension scales in the netcdf-4
code was probably a mistake and it contorts the code in a number of places.

The multi-dimensional issue is complex because it is most often used with what amounts
to point data (like trajectories) and others have noted that indexed array storage is not very efficient at handling point data: relational systems work much better.
So the multi-dim coordinate variable may be a red-herring wrt this discussion.

@ambrosejcarr
Copy link

Thanks @mrocklin for ccing me onto this thread. By way of introduction, I'm contributing to a package to process image-based transcriptomics (biological microscopy) data, which is easiest to think about as a bunch of two dimensional tiles that sample the z-dimension of the tissue being imaged for multiple color channels and time points. The tiles can be stuck together as a 5-d tensor.

We're defining a simple file format for this data which right now is a JSON schema that specifies how to piece the tensor together from a collection of TIFF files stored on a disk or file system, and stores relevant metadata about each tile. This looks a lot like zarr (and Z5), and we're excited by the prospect of contributing to an existing ecosystem instead of rolling our own.

One concern we had was that the zarr format could be harder for someone to pick up and use outside the zarr/python ecosystem (we expect to have many R users, for example). The addition of column names is a really nice step towards greater expressivity and self-description, so we like this change quite a lot.

Off-topic, we have some more general questions about the zarr format. Is there someone who would be a good contact?

tagging @dganguli @freeman-lab @ttung

@alimanfoo
Copy link
Member

Thanks @ambrosejcarr for getting in touch and sharing your use case, very interesting.

Regarding usage from R, I've raised a separate issue (#279), would be great to explore ways of making that possible.

If you have general questions please feel free to ask via the issue tracker, no problem if an issue is a bag of different questions.

@shoyer
Copy link
Contributor Author

shoyer commented Jul 20, 2018

My current thought for revising this proposal is at the very least the "netZDF" or "Zarr-netCDF format" (no new jargon) should be separated into another doc.

However, I still think that optional dimension names as described here (on arrays and groups) could have a place in the Zarr format itself -- assuming we figure out backwards/forwards compatibility concerns.

As for additional conventions themselves (referenced in my draft netZDF spec), I'm thinking that it could make sense to define a group-level conventions attribute (either private or public) that should be a mapping from convention names to versions, e.g., {'conventions': {'netZDF': 1.0, 'CF': 1.6}}. That's definitely a better choice than mere boolean values, and is nicely extensible by other domain specific conventions without needing to hard wire "netZDF" into the Zarr spec.

@shoyer
Copy link
Contributor Author

shoyer commented Jul 20, 2018

The addition of column names is a really nice step towards greater expressivity and self-description, so we like this change quite a lot.

@ambrosejcarr could you clarify what you mean by "column names" here? These changes currently only refer to dimension names. In principle, column names could either be represented by different arrays (with different names) or "column" dimension with an additional array providing coordinate labels. In netCDF files, the typical convention is to use a variable with the same name as its sole dimension: the values of that 1D array provide labels for each point.

@ambrosejcarr
Copy link

ambrosejcarr commented Jul 21, 2018

@shoyer Sorry, I was imprecise in my language, dimension names are what I meant to refer to. If I'm interpreting dimension names properly, they are a vector of values equal in length to the number of dimensions in the array. For our data, that would be channel, imaging_round, z, y, and x.

@clbarnes
Copy link
Contributor

clbarnes commented Jul 23, 2018

I can't speak for @constantinpape , who is really the architect of z5py while I've just been tinkering round the edges, but I think it's unlikely that a new zarr spec which created a significant divergence from N5 would make it into z5py any time soon - at v1.0, it only supports a subset of the fairly simple (and flexible) N5 spec. @axtimwalde would need to confirm but I would guess that N5 is likely to stay minimal, with individual applications defining their own schema for handling more complicated attributes.

This shouldn't discourage you from making these changes, of course - dimension labelling is certainly something which would be helpful to us in the neuroimaging field where the tool set uncomfortably straddles the fortran-ordered java/imageJ and C-ordered python realms, and enforcing dimension parity between raw volumes, labels and so on might be nice too.

The combination of these factors makes me personally lean towards

Option 1: Use .zattrs, write this as a separate spec. Full compatibility, old code will be able to read new data, and new code will be able to read old data.

where there is a well-defined way (and probably library) to represent a netCDF-like schema in zarr but it purely uses zarr as a backend rather than being built into the format itself. Naming collisions with user-defined attributes should obviously be avoided; in N5 this is done simply by keeping all of the application-specific optional fields in a dict within the attributes JSON with a name like "_netZDF".

P.S. In general I'm also in favour of supporting minor version changes; serialising it as a string is fine but it's quite nice to have at least the option of returning a __version_info__-like tuple for comparison purposes.

@constantinpape
Copy link

Regarding z5py: I am close to a 1.0 release (only need some time and get hands on a Mac and Windows machine to make sure cross platform compatability)
This version will support subsets of the zarr v2.0 (I think the only thing missing is F-Order) and the N5 specification.

@clbarnes is unlikely that we will support a zarr format that diverges too far from the N5 specification any time soon (or at least I won't have time to implement this any time soon).
From my point of view, this would of course mean that forward compatability would be desirable.

@alimanfoo
Copy link
Member

Thanks @clbarnes for the comments, much appreciated.

Thanks also for raising the point about naming collisions, I was just thinking about that. It would be good to figure out some best practices for how metadata conventions are defined and used. I'm more than happy for that conversation to continue here, but I've also surfaced that as a separate issue to give it some prominence: #280.

The following keys MAY be present:

dimensions
A list of string or ``null`` values providing optional names for each ofthe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ofthe/of the/

@rabernat
Copy link
Contributor

Do we want to make a decision about what to do here?

Maybe @WardF and the Unidata team might want to weigh in with their thoughts about the future direction of netCDF / zarr integration?

@WardF
Copy link

WardF commented Oct 18, 2018

Absolutely. Reviewing now for comment :)

@shoyer
Copy link
Contributor Author

shoyer commented Oct 18, 2018

It sounds like the consensus was learning towards making this as a separate "Zarr Dimensions Spec", which we could feature in the Zarr docs.

I'd be happy with that, with my main concern being that we should have an unambiguous way to avoid name collisions being spec attributes and arbitrary user-defined attributes.

@WardF
Copy link

WardF commented Oct 18, 2018

I would agree with making this as a separate spec as part of the Zarr documentation.

We are working towards adding a new format in the core C library; the underlying file storage adheres to the Zarr spec (stored either in object storage or on local disk), with a corresponding data-model represented by the intersection of the NetCDF enhanced data model and the Zarr data model. Zarr already supports the "big" features that are primarily used in the Enhanced file format; compression and chunking. (As an aside, I was in a meeting yesterday and the topic of alternatives to the enhanced file format in an HPC setting was raised. I have no idea if Zarr is a potential solution for HPC in a posix environment, yet, but if it is, there seems to be a lot of interest).

In regards to Zarr/NetCDF integration in the future: I think it would be great to have a defined spec or convention that would be adhered to by those writing netCDF model data using Zarr, as well as those writing model data using the core C library/NetCDF API. Our thinking has been that by adhering to the functionality provided by Zarr, it would be a matter of documenting this new format so that anybody writing the model data via Zarr would have the information they need to ensure compatibility.

From a technical standpoint, we need to suss out the architecture of the netCDF-C dispatch layer to map into whatever API we end up using or, as appears more and more likely, implement ourselves. While we'd love to use z5, we need to stick to C-compatible implementations for the time being.

So in a nutshell, we are very interested in the potential integration of Zarr/NetCDF, and will participate as best we can on the Zarr side of things in addition to focusing on the NetCDF library development. I'll also try to broadcast our plans a little more clearly; I shared them at a Pangeo workshop last month(?), and at a couple other meetings since then. I'll see about writing a blog post or finding some other signal-boosting platform on our side.

Thanks to everybody who tagged me, and then emailed me, and then emailed me again to make sure I stayed focused on weighing in on this, it shouldn't be necessary now that the 2018 workshop is over :).

@jakirkham
Copy link
Member

Thanks for the update @WardF.

A pure C library that implements the Zarr spec would be a huge win. If this were open sourced with a sufficiently permissive license, I think the community would take to that very quickly; adding various language bindings on top of it. Some discussion along those lines came up in issue ( https://github.com/zarr-developers/zarr/issues/285 ).

FWIW most of the people I know (myself included) use Zarr/N5 for large image data processing in an HPC setting. So this is not only a reasonable path for others using HPC, but one that is being actively exercised.

Sticking with a pure C base implementation makes perfect sense. The z5 route was mainly interesting from the standpoint of getting something simple up and running quickly and then evolving based on community feedback. Starting with a pure C implementation is definitely the more principled way of doing this and has all the benefits that a pure C implementation comes with. (On a side note: The libxnd container library seems very useful for this effort.)

Glad to hear we will be seeing you around more. Looking forward to reading the blog post. Thanks again for the update. :)

@WardF
Copy link

WardF commented Oct 19, 2018

With the caveat that I haven’t spoken with @DennisHeimbigner about it, I wonder if having that stand-alone Zarr C library would make the most sense, as opposed to baked-in to the rest of the C library. It feels like a good idea insofar as it would be more likely to enjoy broader adoption.

@clbarnes
Copy link
Contributor

I wonder if having that stand-alone Zarr C library would make the most sense, as opposed to baked-in to the rest of the C library

+1_000_000

Especially if the specification strategy is going to be a base zarr spec with different projects defining their own attribute schemas on top of it, having a base C implementation that everyone can use with whatever language they choose to build the reference implementation of their schema in, would be of immense value compared to every schema having to work from the ground up.

So long as the base library could be distributed in such a way that this modular approach is easy, that is! I haven't worked with C myself but I understand the packaging and dependency management ecosystems are non-trivial.

I have worked a little with rust, where such concerns are very much trivial. It also seems that generating C bindings at build time is not difficult, and much of the logic could probably be lifted straight from https://github.com/aschampion/rust-n5 .

@rabernat
Copy link
Contributor

rabernat commented Oct 19, 2018

It's great to see so much support for the C library. I am all in on this.

Who would be a good candidate to actually develop such a library? This is important enough to enough [funded] projects that we are not necessarily talking about a volunteer effort.

@jhamman
Copy link
Member

jhamman commented Oct 19, 2018

Also +1 on a stand-alone Zarr C library. I'll advocate that such a library should live in the zarr-developers org and that UNIDATA devs should take an active role in helping build/maintain it. I suspect we'll find other interested groups that can help with the development/maintenance burden. I recognize that is somewhat orthogonal to the business as usual development strategy but I think the long term benefits to UNIDATA and other Zarr communities will pay off.

@WardF
Copy link

WardF commented Oct 19, 2018

The ownership and maintenance of any project certainly merits discussion. It would be fantastic if there were an externally-owned project that we could contribute to; there's no inherent reason it should be a Unidata product. There would be questions about how much responsibility we can take for non-Unidata products in the long term, but in the immediate future it is work that needs to be done regardless of who owns the product. It is certainly something we would be interested in collaborating on. Any thoughts, @DennisHeimbigner ?

@WardF
Copy link

WardF commented Oct 19, 2018

Some stream-of-consciousness thoughts about a stand-alone library, and what we need from external libraries adopted for use in netCDF.

  1. The licensing is a big thing; BSD 3-Clause is what we're standardizing on here for many of our products (stay tuned for a related announcement with the next netCDF release).
  2. Can be built in a cross-platform manner, for at least the big 3 (Linux, OSX, Windows+Visual Studio). This means cmake-based build configuration to support the latter case, although not necessarily exclusively; autoconf-based tools still work well enough on Linux & OSX.
  3. Longevity concerns; to be honest, if this is work we participate in (and we are interested in collaborating), and the owning organization makes changes either to the licensing or goes away, Unidata will need to be in a position to take ownership of the library in the absence of a good alternative. This doesn't seem likely of course, but it is on our minds given recent developments with other libraries we depend on.

No point to these other than they are considerations from the netCDF team, moving forward with any solution.

@mrocklin
Copy link
Contributor

From @jhamman

I'll advocate that such a library should live in the zarr-developers org and that UNIDATA devs should take an active role in helping build/maintain it

From @WardF

The ownership and maintenance of any project certainly merits discussion. It would be fantastic if there were an externally-owned project that we could contribute to; there's no inherent reason it should be a Unidata product. There would be questions about how much responsibility we can take for non-Unidata products in the long term, but in the immediate future it is work that needs to be done regardless of who owns the product

From @WardF

Longevity concerns; to be honest, if this is work we participate in (and we are interested in collaborating), and the owning organization makes changes either to the licensing or goes away, Unidata will need to be in a position to take ownership of the library in the absence of a good alternative. This doesn't seem likely of course, but it is on our minds given recent developments with other libraries we depend on.

From my perspective if this project is likely to be mostly used by the earth science community then I have no concern with Unidata owning the code. They've proven to be effective stewards in the past and have a longer time horizon than typical OSS groups. My only concern here would be that people outside of Unidata would need an easy to way to also maintain some level of control and activity. This is hard.

However, if this project is likely to be used by groups outside of the earth science community like @jakirkham 's imaging groups or @ambrosejcarr 's genomics/bio groups then I would suggest that the project be legally owned by a stewardship organization like NumFOCUS, but provide permissive commit/repo-owernship rights to at least one representative of each of the scientific domains so that no one gets blocked by the inaction of others should they choose to take a different path in the future.

Regarding @WardF 's other points, I don't anticpate permissive OSS licenses or cross-platform build objectives to be contentious issues in this crowd.

@jakirkham
Copy link
Member

This discussion is great! Thanks all for sharing your thoughts.

Am thinking we should probably segue into a different thread to discuss the pure C implementation of Zarr. Have raised issue ( https://github.com/zarr-developers/zarr/issues/317 ) for this purpose. Sorry for not doing this sooner. Look forward to discussing with you over there.

@jakirkham
Copy link
Member

Should have added in regards to @mrocklin's point about organizational structure. We have been discussing this broadly w.r.t. Zarr in issue ( https://github.com/zarr-developers/zarr/issues/291 ) and more detailed discussions on specific points have been broken out from there.

@ambrosejcarr
Copy link

ambrosejcarr commented Dec 27, 2018

However, if this project is likely to be used by groups outside of the earth science community like @jakirkham 's imaging groups or @ambrosejcarr 's genomics/bio groups

Checking back in here -- I'm prototyping a zarr-based library for our imaging project now, and we've already swapped our sequencing-based output data to zarr. It seems very likely that we will use zarr in production for at least one aspect of our project. I'll chime in on #291 and will likely have a few features to suggest the next few months. 👍

@jhamman
Copy link
Member

jhamman commented Dec 7, 2023

I suggest we close this as it is now quite stale. The v3 spec conversation, along with ZEP 4 and NCZarr have surpassed this design. Feel free to reopen if there is more to do here.

@jhamman jhamman closed this Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.