Pure C implementation #9

jakirkham · 2018-10-22T16:29:07Z

Raising based on feedback in a number of different issues (xref'd below). The suggestion is to implement a pure C Zarr library. There are some questions that range from where the code should live all the way down to details about the code (licensing, build system, binary package availability, etc.). Others include scope, dependencies, etc.. Am raising here to provide a forum focused on discussing these.

xref: https://github.com/zarr-developers/zarr/issues/285
xref: zarr-developers/zarr-python#276
xref: constantinpape/z5#68

Update: #9 (comment) as of 4.8.0 the NetCDF-C library does have support for Zarr: https://twitter.com/zarr_dev/status/1379875020059635713

rabernat · 2018-10-22T17:40:00Z

It will be great to have a pure C Zarr library. I really hope to see this happen.

Here is a conceptual design issue that I have not sorted out. Here is how the Zarr spec defines a storage system:

A Zarr array can be stored in any storage system that provides a key/value interface, where a key is an ASCII string and a value is an arbitrary sequence of bytes, and the supported operations are read (get the sequence of bytes associated with a given key), write (set the sequence of bytes associated with a given key) and delete (remove a key/value pair).

This is fairly language agnostic. But in practice, the idea of a "key/value interface" is closely coupled to python's MutableMapping interface. This is how all Zarr stores are implemented today from Python. I'm curious about how people imagine implementing Zarr in languages such as C that do not have this sort of abstraction.

jakirkham · 2018-10-22T18:28:40Z

Thanks @rabernat. Great point.

There are a number of ways we could imagine doing this. One might be to use a plugin architecture. This allows users to define their own particular mapping with some way to initialize, register, and destroy each plugin. Each plugin then can provide their own functions for standard operations on their specified storage type (opening/closing a storage format, getting/setting/deleting a key, etc.).

To aggregate things a bit, we can imagine that a generic struct for storage type instances for any plugin, which is defined to hold some generic values (e.g. keys, plugin ID, etc.) and some additional untyped (void*) space for other implementation specific details. Dispatching particular operations through a specific plugin could use the plugin ID to find the plugin and perform the operation. Alternatively the struct could contain function pointers to the specific plugin API functions directly.

Though there are other options. Would be interesting to hear what others think about how best to implement this.

Edit: Here's a great article that goes into more detail about plugin architecture. It also discusses C++ though the basic functionality can be emulated in C without C++.

martindurant · 2018-10-22T18:38:19Z

May be worth researching what tileDB does, which apparently can talk to various cloud back-ends also.

aparamon · 2018-10-22T19:24:14Z

Thanks for extracting this issue!

One thing not to be overlooked is memory management. It is important that plugins never have to free memory allocated inside Zarr lib, and vice versa. While usually malloc(), realloc(), and free() do operate on the same process-global heap, some legacy systems (Windows) make transferring pointer ownership through dll bounds totally unreliable.

HDF5 solution is simple but flawed: plugins making use of H5allocate_memory(), H5resize_memory(), and H5free_memory() have to be linked against HDF5 library; but the actual name/location of run-time HDF5 lib, specifically on the most problematic systems (Windows), is never reliably known in advance. (Basically, for the same reasons as why there is no system-wide glibc). So, pre-built HDF5 compression plugins are of little use, as they are still only applicable to specific HDF5 instance, and not shareable between e.g. HDFView, HDF5 tools, client applications etc.

Please, do not require Zarr plugins to link to Zarr C lib!

What is the best way to correct design is yet to be discovered, but one could e.g. pass pointer to free() as a part of plugin API struct.

Another HDF5 C API problem that was at some point considered serious, but proved a non-issue practically: C struct layout is in fact dependent on the compiler/compiler options.
In reality, default struct ABI is compatible throughout all popular compilers (gcc, Clang, MSVC), and even more exotic beasts like Embarcadero Delphi (where structs are called records).
Still, it wouldn't harm if Zarr struct layout/ABI is completely specified wherever structs are used in API publicly.

rabernat · 2018-10-22T19:32:28Z

The folks at quantstack have been working very hard at writing numerical array stuff in C++, xtensor for example. They might have some good ideas about how to approach this. They might even want to work on it! ;)

Tagging @SylvainCorlay and @maartenbreddels to see if they want to weigh in.

SylvainCorlay · 2018-10-23T10:27:10Z

Hi @rabernat thanks for the heads up!

Looking at zarr, an xtensor-based implementation of the spec seems like a very natural thing to do. xtensor expressions can be backed by any data structure or in-memory representations (or even filesystem operations or database calls). We would love to engage with you on that.

It turns out that there already exists a project involving xtensor and zarr by @constantinpape called z5py. It is not from the xtensor core team but Constantin has been engaging with us on gitter and in GitHub issues and PRs.

It could be interesting to discuss this in greater depth!

We are holding a public developer meeting for xtensor and related projects this Wednesday (5pm Paris time, 11am EST, 8am PST). Would you like to join us?

cc @wolfv @JohanMabille

constantinpape · 2018-10-23T12:58:25Z

Hey,

just to chime in here; as @SylvainCorlay mentioned, z5 / z5py (https://github.com/constantinpape/z5) is using xtensor as multiarray (and for python bindings).
z5 implements (most of) the zarr spec in C++ and z5py provides python bindings to it.

We also have an issue on C bindings (see constantinpape/z5#68) with some
details discussed already.

If you are considering to base the zarr C bindings on this:

I would not mind handing ownership to some other organization (zarr_developers or something else) if this is a concern (as long as I still have contributor rights)
I don't have much time right now to contribute a lot to implementing the C bindings, but I would happily assist anyone who would tackle this.

SylvainCorlay · 2018-10-24T13:13:36Z

@constantinpape @jakirkham is this something you would like to discuss at the xtensor meeting?

constantinpape · 2018-10-24T13:17:21Z

@SylvainCorlay Unfortunately, I cannot make it today.
I am organising a course next week and have a lot to do setting things up.
Will have more time again in November and can spend a bit more time on these things then.

SylvainCorlay · 2018-10-25T08:41:16Z

@constantinpape we should arange a skype meeting. I would love to dive into the internals of z5!

constantinpape · 2018-10-25T08:45:50Z

Sounds great!
I have a pretty packed schedule for the rest of this week and next week, but I might be able to
squeeze something in.
I will contact you on gitter later and we can discuss details then.

rabernat · 2018-10-30T04:34:58Z

Forgive me for asking an ignorant question...

For our purposes here, is C++ the same a C? Z5 already implements zarr in C++. Is an additional C implementation still necessary?

aparamon · 2018-10-30T06:16:03Z

It's Ok as long as the public interface is declared with extern "C" specifiers: such a library would be universally usable from Lisp, Pascal, Go etc. External C++ API is not universally usable.

dopplershift · 2018-10-30T16:26:54Z

Another point: a pure C implementation only needs a C compiler, not a C++. This might be a consideration for simplifying the build environment library clients. I’m thinking specifically about netcdf-c (ping @WardF)

jakirkham · 2018-10-30T17:22:46Z

Thanks all for joining the conversation. Sorry for being slow to reply (was out sick most of last week).

Part of the reason I raised this issue is feedback we got from PR ( zarr-developers/zarr-python#276 ) specifically this comment (though there may have been other comments that I have forgotten), which suggested that we need to supply a pure C implementation of the spec (not using C++). Without knowing the details about the motivation, I can't comment on that (hopefully others can fill this in).

Though there are certainly a few reasons I can think of that one would want a pure C implementation. These could vary from providing an easily accessible API/ABI, working on embedded systems, interfacing with other libraries, portability, easy to build language bindings, organizational requirements, etc.

Now there is certainly a lot of value in having a C++ implementation (and a very effective one at that), which I thinks @constantinpape and others have demonstrated. The C++ ecosystem is very rich and diverse; making it a great place to explore a spec like Zarr.

Certainly it is possible to wrap a C++ library and use extern C, which we were discussing in issue ( constantinpape/z5#68 ). Though it sounds like that doesn't meet the needs of all of our community members. If that's the case, going with C should address this. Not to mention it would be easy for various new and existing implementations to build off of the C implementation either to quickly bootstrap themselves or lighten their existing maintenance burden. So overall I see this as a win for the community.

WardF · 2018-10-30T22:03:15Z

@dopplershift and others are correct, for this to be something usable for netCDF, a pure C library is required. I suspect there are other projects out there which would benefit as well. The C++ interface is fantastic but wouldn't work for our needs.

jakirkham · 2018-11-01T15:41:54Z

Do you have any thoughts on the technical design of this implementation, @WardF? Was thinking about using plugins to handle different MutableMapping-style implementations at the C-level. The same could probably be used for handing Codecs for compression/decompression. Does that sound reasonable or do you have different ideas on the direction we should go?

WardF · 2018-11-01T21:48:50Z

No, but after discussion with @DennisHeimbigner I think we need to sketch out what is needed from the netCDF point of view, infer what would be needed from a more broad point of view, and see what the intersection is. I'll review the plugins link/guide that you linked to, thanks! I hadn't had a chance to review it yet. The focus of discussion on on our end internally have been around the intersection of the data model and an I/O API; I'll let Dennis make his own points here, but he has pointed out several things I hadn't previously considered.

To answer your question, however, plugins do seem like a reasonable approach. In terms of using Codecs for compression/decompression, I infer it would be something similar to how HDF5 uses compression/decompression plugins for compressed I/O? That would make sense in broad terms; there are considerations in regards to the netCDF community and what 'core' compression schemes to support, but from a technical standpoint it seems like a reasonable, path-of-lesser-resistance approach.

I will follow up with any additional thoughts on the technical design as they occur to me :).

DennisHeimbigner · 2018-11-01T22:12:03Z

With respect to compression plugins. I would push for using the actual
HDF5 filter plugin mechanism. That way we automatically gain access to a wide
variety of compressors currently supporting HDF5. The HDF5 compression API
is pretty generic. A (de-)compressor assumes that it is given a buffer of data
and returns a buffer containing the (de-) compressed data.

DennisHeimbigner · 2018-11-01T22:13:43Z

A question. I hear the term "IO API", but I cannot figure out what that term means.
Can someone elaborate?

aparamon · 2018-11-02T06:55:27Z

With respect to compression plugins. I would push for using the actual
HDF5 filter plugin mechanism. That way we automatically gain access to a wide
variety of compressors currently supporting HDF5. The HDF5 compression API
is pretty generic. A (de-)compressor assumes that it is given a buffer of data
and returns a buffer containing the (de-) compressed data.

It is one appealing possibility, but please also be aware of HDF5 compression API drawbacks.

DennisHeimbigner · 2018-11-02T17:26:24Z

I fail to see this as much of a problem from the netcdf-c point of view.
If we build without hdf5, then we need to build in replacements for
H5allocate_memory(), H5resize_memory(), and H5free_memory(),
which seems easy enough.
Also, I do not understand this comment:
"Please, do not require Zarr plugins to link to Zarr C lib"
Since this is a question of dynamic libraries, I do not see the
difficulties.

aparamon · 2018-11-02T19:58:28Z

@DennisHeimbigner Ah, it means you are lucky enough to not possess the peculiarities/ugliness of "Windows way" :-)

Let's have HDF5 as the example. On GNU/Linux filter plugins are easy enough: you build them against the /usr/lib/libhdf5.so, drop the binary to HDF5_PLUGIN_PATH, and all the software (h5cat, HDFView, your applications) starts immediately to understand your compressed data. The plugin is universal: one binary fits all apps.

On my Windows laptop, I now have:

C:\Program Files (x86)\HDF_Group\HDF5\1.8.13\bin\hdf5.dll (32-bit, 1.8)
C:\Program Files (x86)\HDF_Group\HDF5\1.8.20\bin\hdf5.dll (32-bit, 1.8)
C:\Program Files\HDF_Group\HDF5\1.8.20\bin\hdf5.dll (64-bit, 1.8)
C:\Program Files\HDF_Group\HDFView\3.0.0\lib\hdf5_java.dll (64-bit, ???)
C:\Program Files\LLNL\VisIt 2.13.1\hdf5.dll (64-bit, ???)
C:\ProgramData\Anaconda3\Lib\site-packages\h5py\hdf5.dll (64-bit, 1.10)
C:\ACD 2017\hdf5.dll (32-bit, 1.8)
C:\ACD 2017\hdf5-1.10.dll (32-bit, 1.10)
... and the variety is only bounded by software authors' imagination.

To what dll shall I link my plugin? Any answer will be sub-optimal, as it will reliably teach only that one program to understand your compressed data (and for C:\ACD 2017 it's even worse: note the two HDF5 dll flavors actually used from the same process!).

This problem would not be present if plugin dlls didn't link to HDF5 dll.

WardF · 2018-11-02T20:36:59Z

@aparamon Thank you for the illustrative example; this is something we will have to keep in mind, given the crossplatform nature of netCDF.

jakirkham · 2018-11-02T20:52:06Z

What if we had an environment variable that acted as a plugin search path? Think all the major platforms have some C API for loading libraries at runtime.

DennisHeimbigner · 2018-11-02T21:30:10Z

I see. You are correct. From the point of view of zarr embedded in netcdf-c
we only need to compile the filter against netcdf-c library.

DennisHeimbigner · 2018-11-02T22:28:22Z

I think HDF5 has such an variable named something like HDF5_PLUGIN_PATH?
What I wish was the case was that the HDF5 filter struct like below
had field(s) for memory allocation callbacks.
const H5Z_class2_t H5Z_BZIP2[1] = {{
H5Z_CLASS_T_VERS, /* H5Z_class_t version /
(H5Z_filter_t)H5Z_FILTER_BZIP2, / Filter id number /
1, / encoder_present flag (set to true) /
1, / decoder_present flag (set to true) /
"bzip2", / Filter name for debugging /
(H5Z_can_apply_func_t)H5Z_bzip2_can_apply, / The "can apply" callback /
NULL, / The "set local" callback /
(H5Z_func_t)H5Z_filter_bzip2, / The actual filter function */
}};

aparamon · 2018-11-03T10:28:46Z

@DennisHeimbigner You are correct, HDF5 has HDF5_PLUGIN_PATH (link), and additionally H5PLprepend, H5PLappend, H5PLinsert etc in the later versions. That part works good.

Having memory allocation callbacks in H5Z_class_t doesn't seem to help, because multiple allocations/deallocations may be required during decompression of a single chunk (filter pipelines). The next filter must be able to realloc()/free() memory malloc()ed by previous filters.
Instead, the library could provide its "universal" allocation callbacks into every H5Z_func_t call. Hopefully, the re-allocations are rare, so performance will not suffer much from inability to inline malloc()/realloc()/free().

Please note that optional (and rarely used) H5Z_can_apply_func_t, H5Z_set_local_func_t procedures suffer from the same principal architectural drawback: they require to back-link to proper library in order to make use of dcpl_id, type_id, space_id. On Windows, that just doesn't reliably work.

The mutual-reference architecture used in HDF5 seems rather un-elegant, but on *NIX systems it works fine almost always, due to single instance of libhdf5.so typically present in the system. On Windows -- no, it's never the case. More elegant architecture would be only to link to plugins from the library and not vice versa. The library should directly tell the plugin all required information without the need for additional call-backs.

The question is whether proper Windows support is worth the pain? As a one data point, for my company it would still be desirable, but much less so than say 3..5 years earlier.

DennisHeimbigner · 2018-11-03T19:00:06Z

You are correct, I had it backwards. The filter needs to be given a table
of memory allocation/free functions to use.
As for windows support. For us (netcdf) this is almost essential since
we have a fair number of windows users and I would be very reluctant
to cut them out.

DennisHeimbigner · 2018-11-03T19:14:37Z

It appears to be the case that many (most) contributed HDF5 filters
are on github in source form.
So if we are forced to rebuild them, and we assume we can have some kind of
wrapper for them if we need it, then the question might be: what is the minimal
set of source changes we need to make to a filter's source to solve the memory
allocation problem. Also, is there a wrapper that would help this process?

DennisHeimbigner · 2019-01-11T20:42:31Z

I am currently working on the insertion of C-Language Zarr support into Unidata's netcdf-c library. The goal is to provide an extended Zarr that will more closely support the existing netcdf-4 (aka netcdf enhanced) data model.

We also intend to support two forms of interoperability with existing Zarr data and implementations.

An existing Zarr implementation can read an NCZarr dataset.
NCZarr can provide read-only access to existing standard Zarr datasets.

I have created a draft document that is an early description of the extensions of Zarr to what I am calling NCZarr. Ideally, this could eventually become some kind of official standard Zarr extension. It also describes how we propose to represent existing Zarr datasets (our second goal).

The document is currently kept here:
https://github.com/Unidata/netcdf-c/blob/cloud.tmp/docs/nczextend.md
It is currently in Doxygen markdown format, so there may be minor display glitches depending on your viewer.

jakirkham · 2019-01-11T21:58:40Z

Thanks for the update, @DennisHeimbigner. Will mull over this a bit. Have a few questions to start us off.

Could you please share briefly in what ways the NCZarr spec differs (a very rough overview/executive summary here is fine as people can go read the spec for details)? Do these changes overlap with what other groups are looking for (e.g. https://github.com/zarr-developers/zarr/issues/333 and NeurodataWithoutBorders/pynwb#230 or others)? Were there specific pain points you encountered and/or areas, in which you were hoping Zarr could grow? It may even make sense to break these out into a series of GitHub issues that we can discuss independently. Though feel free to add them here first if that is easiest.

Also would be good to hear a bit more about how you are handling things like different key-value stores and compression algorithms. Are users free to bring their own and if so how? Will there be any of either that are preincluded?

DennisHeimbigner · 2019-01-11T22:34:46Z

Could you please share briefly in what ways the NCZarr spec differs (a very rough overview/executive summary here is fine as people can go read the spec for details)?

I guess I had hoped this document would serve as that diff document :-)
I have additional documents giving additional characterizations of NCZarr,
but they are not ready for prime time yet.

Do these changes overlap with what other groups are looking for (e.g. #333 and NeurodataWithoutBorders/pynwb#230 or others)?

Frankly I do not know in detail. My current goal is to get as
close to the netcdf-4 data model as I can while maintaining a
large degree of interoperability with the existing Zarr v2
spec. Analysis and comparison with the other proposed extensions
is important, but probably should be a separate document.

Were there specific pain points you encountered and/or areas, in which you were hoping Zarr could grow?

There are two such "pain points"

The Zarr spec does not conform the "write narrowly, read broadly"
heuristic in that it says that any annotations not specified in the
Zarr spec are prohibited. It preferably should say that unrecognized
keys/objects/etc should be ignored.
From the Unidata point of view, the inability to represent variable
length items is a significant problem. My discussion of handling
of variable length strings shows, I think, the difficulties.

It may even make sense to break these out into a series of GitHub issues that we can discuss independently. Though feel free to add them here first if that is easiest.

I considered starting a new issue, but do not want to pollute the issue space
too much.

Also would be good to hear a bit more about how you are handling things like different key-value stores and compression algorithms. Are users free to bring their own and if so how? Will there be any of either that are preincluded?

I have separate internal architecture documents where I am describing
how I propose to deal with those issues. But roughly, we emulate the existing
Zarr implementation in providing an internal API that separates the key-value
store from the core NCZarr code. I am currently basing it loosely on the
Python MutableMapping API.

In one of these issues, we discussed the Filter problem. My current thinking
is to provide an architecture similar to that provided by HDF5. There
are known problems with this, so I expect that we will need to provide
some extended form of the HDF5 approach. However, I have as a goal
the ability to use existing HDF5 filters without change (this may not
be possible).

alimanfoo · 2019-01-12T00:07:00Z

Thanks Dennis. Just wanted to mention that zarr does support variable length strings, as well as variable length sequences of atomic types. See the sections on string arrays and object arrays in the tutorial. For variable length string arrays I would recommend using the VLenUTF8 encoding, as it should be simplest to implement in plain C. IIRC it is basically the same as parquet encoding.

…

On Fri, 11 Jan 2019, 20:42 Dennis Heimbigner ***@***.*** wrote: I am currently working on the insertion of C-Language Zarr support into Unidata's netcdf-c library. The goal is to provide an extended Zarr that will more closely support the existing netcdf-4 (aka netcdf enhanced) data model. We also intend to support two forms of interoperability with existing Zarr data and implementations. 1. An existing Zarr implementation can read an NCZarr dataset. 2. NCZarr can provide read-only access to existing standard Zarr datasets. I have created a draft document that is an early description of the extensions of Zarr to what I am calling NCZarr. Ideally, this could eventually become some kind of official standard Zarr extension. It also describes how we propose to represent existing Zarr datasets (our second goal). The document is currently kept here: https://github.com/Unidata/netcdf-c/blob/cloud.tmp/docs/nczextend.md It is currently in Doxygen markdown format, so there may be minor display glitches depending on your viewer. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/zarr-developers/zarr/issues/317#issuecomment-453650594>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QsuHrf39Ta7ExzdYZY9kxwHpN5hwks5vCPc4gaJpZM4XztI9> .

alimanfoo · 2019-01-12T00:35:50Z

1. The Zarr spec does not conform the "write narrowly, read broadly" heuristic in that it says that any annotations not specified in the Zarr spec are prohibited. It preferably should say that unrecognized keys/objects/etc should be ignored. Apologies I haven't yet read the NCZarr spec carefully, sorry if I've

misread anything, but I though it might be worth a couple of comments here. For an array at logical path "foo/bar", core metadata for the array are stored as a JSON document under the key "foo/bar/.zarray", and attributes are stored as a separate JSON document under the key "foo/bar/.zattrs". As an extension developer, you are free to use the attributes (i.e. the .zattrs JSON document) however you like. You can put whatever you like in there. This is the natural place to put any netcdf information, in an analogous way to how netcdf4 uses hdf5 attributes. Similarly for the group at logical path "foo", there is a JSON document under the key "foo/.zgroup" for core metadata, and another JSON document under "foo/.zattrs" for attributes. Again, you are free to use the attributes (.zattrs) document however you like. I.e., the Zarr spec constrains the contents of .zarray and .zgroup documents, but does not place any constraint on .zattrs. So you should be able to do whatever you need with .zattrs, without requiring a spec change. Also I saw your doc proposes putting some extra metadata in separate JSON documents under other keys like .zdims and .ztypedefs. There is nothing wrong with this, perfectly legal according to the standard zarr spec as is, these keys would just get ignored by a standard zarr implementation. However you could put everything into .zattrs, in which case it will be visible as attributes via standard zarr APIs. I would suggest using .zattrs unless you have a really strong reason not to. Hope that makes sense, please feel free to follow up if anything isn't immediately clear.

DennisHeimbigner · 2019-01-12T03:23:57Z

I must disagree. I am going from the spec. I consider the tutorial irrelevant.

these keys would just get ignored by a standard zarr implementation
Again, I must disagree. As I read the spec. this is currently specifically disallowed.

I have considered putting some of my extensions as attributes, so that is still
an open question.
are not attributes in the netcdf-4 sense. T

dopplershift · 2019-01-12T04:57:01Z

@alimanfoo Is there room to codify some of those items within the spec? Or is there something being misunderstood here?

DennisHeimbigner · 2019-01-12T04:59:58Z

Well we are all feeling our way on this because we are in a complex design space,
so nothing is set in stone.

jakirkham · 2019-01-12T07:07:53Z

Sorry @DennisHeimbigner I must have missed it. What was the disagreement?

rabernat · 2019-01-12T07:34:57Z

This NCZarr spec seems like a great development! Surely lots of people will have ideas and opinions (pinging @shoyer and @jhamman who have previously weighed in on this, e.g. DOC: zarr spec v3: adds optional dimensions and the "netZDF" format #276)

Regarding shared dimensions, we have already done something with this in an ad-hoc way in xarray. We chose to add an attributed called .ARRAY_DIMENSIONS to zarr arrays, which just lists the dimensions as a list (e.g. ['time', 'y', 'x']). This was all we needed get zarr to work basically like netcdf (as far as xarray is concerned).
https://github.com/pydata/xarray/blob/master/xarray/backends/zarr.py#L14

I know this isn't part of any spec. We just did it so we could get xarray to work with zarr. It may well turn out that this data can't be read properly by NCZarr (the price we pay for moving forward without a spec), but I thought I would at least mention this in hopes of some backwards compatibility.

aparamon · 2019-01-12T13:44:34Z

Thank you @DennisHeimbigner for bringing it up!

Do I understand correctly that in order to locate a type definition, client is expected to walk up the directory/group structure parsing .ztypedefs (if present) there, until the required type definition is found? If so, the design seems flexible and reasonable, although not clear to which extent compatible with Zarr logical paths (see N.B.).

From reading Zarr spec it seems that unlike unknown attributes, unknown files (e.g. .ztypedefs) at any dir are legal. Could Zarr devs please confirm?

alimanfoo · 2019-01-12T13:45:55Z

I must disagree. I am going from the spec. I consider the tutorial irrelevant.

I'm guessing this is regarding storage of variable length strings (and other objects). The zarr Python implementation supports an object ('O') data type, but going back to the spec I see this is not mentioned anywhere. Apologies, this is an omission. The spec should state that an object ('O') data type is supported, and that this should be used for all variable length data types.

When encoding an array with an object data type, there are various options for how objects are encoded. If objects are strings then my recommendation would be to use the VLenUTF8 encoding, defined in numcodecs.

these keys would just get ignored by a standard zarr implementation

Again, I must disagree. As I read the spec. this is currently specifically disallowed.

I think the spec is reasonably clear about the fact that, within the .zarray metadata object, only certain keys are allowed. However, within the .zattrs metadata object you can use any key you like. And within the store, you can store other data under other keys like .zdims or whatever. However, I would still encourage using .zattrs for any extension metadata. This is what the xarray to_zarr() function does.

Sorry for brevity, happy to expand on any of this if helpful.

alimanfoo · 2019-01-12T13:48:39Z

I think the spec is reasonably clear about the fact that, within the .zarray metadata object, only certain keys are allowed. However, within the .zattrs metadata object you can use any key you like. And within the store, you can store other data under other keys like .zdims or whatever. However, I would still encourage using .zattrs for any extension metadata. This is what the xarray to_zarr() function does.

So just to be clear, the word "key" here is being used in two different ways. There are keys within the .zarray metadata objects. And there are keys that are used to store and retrieve data from the store.

alimanfoo · 2019-01-12T13:52:55Z

From reading Zarr spec it seems that unlike unknown attributes, unknown files (e.g. .ztypedefs) at any dir are legal. Could Zarr devs please confirm?

The zarr spec constrains what keys you are allowed to use within .zarray and .zgroup metadata objects. But you can use any key you like within .zattrs metadata objects.

And you can store other objects using store keys like ".ztypedefs", i.e., ".zarray" and ".zgroup" are reserved store keys, and you don't want to clash with chunk keys, but otherwise you can store data under any key you like. Although I would still recommend using .zattrs for metadata wherever possible.

Hth.

shoyer · 2019-01-12T14:05:39Z

I have also been confused about how zarr supports variable length strings. I know the Python library can do it, but how such data is stored is not at all clear from reading the spec alone.

shoyer · 2019-01-12T15:14:06Z

Similarly to what I wrote in #276, I would prefer for both dimensions and named type definitions to be standardized for zarr, as optional metadata fields. NetCDF imposes some high level consistency requirements, but at least dimension names (without consistency requirements) are universal enough that they could be a first class part of zarr's data model. Potentially these specs could be layered, e.g.,

Base zarr spec
Standard zarr extensions:
a. dimensions spec
b. named types spec
NetCDF spec (1+2)

The reason why I'm advocating for laying these specs is that I think there's significant value in standardizing additional optional metadata fields. I think it would be much harder to convince non-geoscience users to use a full netCDF spec, but named dimensions alone would make their data much more self-described.

We don't necessarily need to store these fields in .zarray, but I do like keeping them out of .zattrs to avoid name conflicts. We could also officially reserve some names in .zattrs for spec specific metadata, e.g., all names starting with .. The convention might be that names starting with . are "hidden" attributes, and should match the name of specification, e.g., we'd use .netcdf for netcdf specific metadata.

The Zarr spec does not conform the "write narrowly, read broadly"
heuristic in that it says that any annotations not specified in the
Zarr spec are prohibited. It preferably should say that unrecognized
keys/objects/etc should be ignored.

I agree that the Zarr spec should be updated in this way. In practice, it's hard for me to imagine a sensible implementation not adhering to this spec, but I still think it's good not to preclude the possibility of future (backwards compatible) extensions.

From the Unidata point of view, the inability to represent variable
length items is a significant problem. My discussion of handling
of variable length strings shows, I think, the difficulties.

Agreed that this is important. Given that we'll need to increment the Zarr spec to include discussion of the O dtype, this may be a good time to address other technically backwards incompatible changes (such as explicitly stating that unrecognized keys should be ignored).

alimanfoo · 2019-01-12T15:28:42Z

@alimanfoo Is there room to codify some of those items within the spec? Or is there something being misunderstood here?

Both I think. Yes definitely room to improve the spec in terms of clarifying anything implicit or not entirely clear.

alimanfoo · 2019-01-12T15:33:37Z

Agreed that this is important. Given that we'll need to increment the Zarr spec to include discussion of the O dtype, this may be a good time to address other technically backwards incompatible changes (such as explicitly stating that unrecognized keys should be ignored).

FWIW I'd like to edit the spec to explain the 'O' dtype, without incrementing the version number. It was implicit in the reference to numpy's dtypes, but it could be made explicit. I think it's just a clarification, not a backwards incompatible change.

shoyer · 2019-01-12T16:21:27Z

FWIW I'd like to edit the spec to explain the 'O' dtype, without incrementing the version number. It was implicit in the reference to numpy's dtypes, but it could be made explicit. I think it's just a clarification, not a backwards incompatible change.

NumPy O dtype stores references to Python objects. I don't think the meaning of a "Python object" is at all obvious for a serialization format.

As one example of how this is a incompatible deviation, every dtype explicitly listed in Zarr's spec has fixed size. So an implementation of Zarr might reasonably assume that array elements always have fixed sized.

DennisHeimbigner · 2019-01-12T19:04:03Z

If I understand this correctly then it is being proposed to include Python specific
objects into the Zarr spec. Do I understand correctly?

DennisHeimbigner · 2019-01-12T19:53:09Z

The question was raised if we could store NCzarr specific metadata as attributes rather than in e.g. .zdims.

The answer is yes because that is exactly how it is done with the existing netcdf-4 over HDF5 implementation. One reason for this is because we had no way to access a lower level representation that can co-exist with the HDF5 storage format.

That is not true for the Zarr spec. That spec pretty much exposes the underlying S3-like key-value-pair (aka KVP) storage format. So I exploited that exposure to store the NCZarr metadata at the KVP level.

As an aside, the Zarr spec should probably be divided into two parts: (1) a specification of the Zarr data model and (2) a specification of the underlying KVP model and the mapping of Zarr to that model. Instead, the current spec combinges both 1 and 2 into a single document. It is not explicit about the KVP model, but it should be.

In any case, ideally, NCZarr should extend Zarr, not the underlying KVP. This would suggest that using Zarr-level attributes is better than using KVP-level objects like .zdims.

If one uses Zarr-level attributes to store netcdf-4 metadata (like our existing HDF5 implementation), then the cost is that if some other pure-Zarr reader looks at the NCZarr dataset, it will see a bunch of attributes that mean nothing to it. You can see this for netcdf-4 over HDF5 by doing an h5dump command on a netcdf-4 file. There a number of HDF5 objects that are netcdf-4 specific and are just irritating noise to any client reading the file as an HDF5 file.

The saving grace is that it turn out to be uncommon for HDF5 code to be reading a netcdf-4 file. I am not sure if that is going to be the case with Zarr. One of the goals for NCZarr is to be able have existing Zarr readers access NCZarr datasets. So I expect that NCZarr datasets will be routinely read by pure-zarr software. So I do not know if this extra metadata will be more than a minor issue or not.

jakirkham · 2019-01-12T21:07:42Z

It sounds like there are a few different needs that are diverging from the question of building a pure C implementation of Zarr. Would suggest we open some new issues for these specific needs. Does this sound reasonable?

DennisHeimbigner · 2019-01-12T21:14:01Z

Ok, I am opening a NCZarr specific issue to continue discussion:
https://github.com/zarr-developers/zarr/issues/388

alimanfoo · 2019-01-14T09:03:23Z

NumPy O dtype stores references to Python objects. I don't think the meaning of a "Python object" is at all obvious for a serialization format. As one example of how this is a incompatible deviation, every dtype explicitly listed in Zarr's spec has fixed size. So an implementation of Zarr might reasonably assume that array elements always have fixed sized.

I think I'll raise a separate issue to unpack this in more detail.

alimanfoo · 2019-01-14T09:07:16Z

If I understand this correctly then it is being proposed to include Python specific objects into the Zarr spec. Do I understand correctly?

No, we definitely want the spec to remain language agnostic. And it is possible with the current zarr implementation to encode "object" arrays (including arrays of variable length strings) in a way that is portable across languages. But I think we do need to clarify the status and semantics of the "object" data type in the v2 spec, as well as how it is currently implemented in the Python zarr implementation. I'll break this out into a separate issue.

alimanfoo · 2019-01-16T09:16:44Z

I've created an issue on the new zarr spec repo to follow up discussion of the object data type: zarr-developers/zarr-specs#6

steven-varga · 2019-07-31T15:30:22Z

Great discussion! I am wondering if the interest in CAPI is still on? Also if anyone considered a zarr capi be compatible with the existing HDF5 CAPI. Curious to hear your opinion on both pros, cons.
best wishes
steven
h5cpp.org

joshmoore · 2021-05-19T09:30:32Z

Since much of this discussion has now migrated to zarr-developers/zarr-specs#41, "NCZarr - Netcdf Support for Zarr", which is concerned with the finer details of supporting the full NetCDF format, I'd like to point out on this issue that as of 4.8.0 the NetCDF-C library does have support for Zarr: https://twitter.com/zarr_dev/status/1379875020059635713

If others have C libraries they would like to list, please feel free to add them.

alimanfoo transferred this issue from zarr-developers/zarr-python Jul 3, 2019

cmeyer mentioned this issue Mar 26, 2021

Add zarr handler and use it as default for new large-format data items. nion-software/nionswift#637

Draft

Pure C implementation #9

Pure C implementation #9

Comments

jakirkham commented Oct 22, 2018 • edited by joshmoore Loading

rabernat commented Oct 22, 2018

jakirkham commented Oct 22, 2018 • edited Loading

martindurant commented Oct 22, 2018

aparamon commented Oct 22, 2018

rabernat commented Oct 22, 2018

SylvainCorlay commented Oct 23, 2018 • edited Loading

constantinpape commented Oct 23, 2018

SylvainCorlay commented Oct 24, 2018

constantinpape commented Oct 24, 2018

SylvainCorlay commented Oct 25, 2018

constantinpape commented Oct 25, 2018

rabernat commented Oct 30, 2018

aparamon commented Oct 30, 2018 • edited Loading

dopplershift commented Oct 30, 2018

jakirkham commented Oct 30, 2018 • edited Loading

WardF commented Oct 30, 2018

jakirkham commented Nov 1, 2018

WardF commented Nov 1, 2018 • edited Loading

DennisHeimbigner commented Nov 1, 2018

DennisHeimbigner commented Nov 1, 2018

aparamon commented Nov 2, 2018

DennisHeimbigner commented Nov 2, 2018

aparamon commented Nov 2, 2018

WardF commented Nov 2, 2018

jakirkham commented Nov 2, 2018

DennisHeimbigner commented Nov 2, 2018

DennisHeimbigner commented Nov 2, 2018

aparamon commented Nov 3, 2018 • edited Loading

DennisHeimbigner commented Nov 3, 2018

DennisHeimbigner commented Nov 3, 2018

DennisHeimbigner commented Jan 11, 2019

jakirkham commented Jan 11, 2019

DennisHeimbigner commented Jan 11, 2019

alimanfoo commented Jan 12, 2019 via email

alimanfoo commented Jan 12, 2019 via email

DennisHeimbigner commented Jan 12, 2019

dopplershift commented Jan 12, 2019

DennisHeimbigner commented Jan 12, 2019

jakirkham commented Jan 12, 2019

rabernat commented Jan 12, 2019

aparamon commented Jan 12, 2019

alimanfoo commented Jan 12, 2019 • edited Loading

alimanfoo commented Jan 12, 2019

alimanfoo commented Jan 12, 2019

shoyer commented Jan 12, 2019

shoyer commented Jan 12, 2019

alimanfoo commented Jan 12, 2019

alimanfoo commented Jan 12, 2019

shoyer commented Jan 12, 2019

DennisHeimbigner commented Jan 12, 2019

DennisHeimbigner commented Jan 12, 2019

jakirkham commented Jan 12, 2019

DennisHeimbigner commented Jan 12, 2019

alimanfoo commented Jan 14, 2019 via email

alimanfoo commented Jan 14, 2019 via email

alimanfoo commented Jan 16, 2019

steven-varga commented Jul 31, 2019

joshmoore commented May 19, 2021 • edited Loading

jakirkham commented Oct 22, 2018 •

edited by joshmoore

Loading

jakirkham commented Oct 22, 2018 •

edited

Loading

SylvainCorlay commented Oct 23, 2018 •

edited

Loading

aparamon commented Oct 30, 2018 •

edited

Loading

jakirkham commented Oct 30, 2018 •

edited

Loading

WardF commented Nov 1, 2018 •

edited

Loading

aparamon commented Nov 3, 2018 •

edited

Loading

alimanfoo commented Jan 12, 2019 •

edited

Loading

joshmoore commented May 19, 2021 •

edited

Loading