v2: interpretation of files outside of the spec #112

joshmoore · 2021-03-29T11:23:23Z

The Zarr specification currently only defines 4 leaf nodes: .zarray, .zgroups, .zattrs, and chunks.

One OME-Zarr client (bioformats2raw) currently places an XML file within the Zarr hierarchy. This can be accessed at the Store level via:

text = store["METADATA.xml"].decode() # manually decode ascii

but not from the given group. Questions:

Was there previously any discussion about possible Zarr-internal interpretations of files?
Does anyone else have similar use cases?
If this is generally valuable, would it break anyone else's use cases to define such files as "1-dimensional non-chunked arrays"

cc: @manzt

The text was updated successfully, but these errors were encountered:

joshmoore · 2021-04-08T12:31:42Z

Summary of yesterday's community call discussion on this topic:

Options for the v2 spec:
a. Close the loop-hole, by disallowing other files. ("MUST-NOT place additional files..." in (in v2.rst))
b. No change (i.e. unspecified)
c. Specify (in v2.rst) how such files should be interpreted (e.g. as a 1-d array without metadata)
d. Specify ... that such files should be ignored.
Options for the v3 spec:
a. Disallow
b. No change (i.e. unspecified)
c. Write an extension for files
d. Specify (in this repo) the behavior in the core, e.g. that such files should be interpreted as 1-d arrays
e. Specify (in this repo) that the files should be ignored.
In the case of 1a-b and 2a-b, an additional option would be to encourage (SHOULD) the use of a more official method for storing files:
a. serialized as a string in .zattrs
b as a full .zarray

The feeling was that this use case may be too fringe for backporting to v2. There was some seemingly satisfied mumbling around 2c. (an extension). Other opinions?

cc: @alimanfoo

p.s.: one thing I had not realized is that with the separate metadata/data hierarchies in v3, a file can also have metadata attached to it (mimetype, etc.) Thanks, @jakirkham.

rabernat · 2021-04-08T13:28:07Z

Just noting the the newly announced "nczarr" spec also uses additional files in the zarr hierarchy, so its compliance with the Zarr v2 spec is ambiguous. cc @DennisHeimbigner

DennisHeimbigner · 2021-04-08T19:24:47Z

I recall discussing this on the zarr telecon some months ago when I first joined.
At that time, I asked that the spec be changed to say that unrecognized objects
be ignored. I do not recall any resolution.

joshmoore · 2021-04-09T08:50:19Z

Thanks for the reminder, @DennisHeimbigner. I'll add that as an additional option.

joshmoore · 2021-04-23T14:16:49Z

Managed to discuss this issue with @alimanfoo today. The original intent in the spec was to be hacker friendly and not prohibit storing other keys as long as they don't negatively influence the interpretation of existing objects.

On the question of round-tripping to HDF5, we would need to suggest a specification/protocol on the HDF5 side to make this feasible.

Looking forward to comments, but as this starts to crystallize, I'll open a PR to suggest v2.rst language (if no one beats me to it).

cc: @manzt @JacksonMaxfield @chris-allan

DennisHeimbigner · 2021-04-23T23:29:52Z

On the question of round-tripping to HDF5, we would need to suggest a specification/protocol on the HDF5 side to make this feasible.

Can someone clarify what is meant by the HDF5 round-tripping? Is there
an issue describing this?

joshmoore · 2021-04-24T08:55:49Z

@DennisHeimbigner: No, mea culpa. If you convert a Zarr file to HDF5, zarrays, zgroups and attributes all have a clear mapping. As far as I can tell, these additional files do not. One could certainly map them to a dataset on the HDF5 side, but then when mapping back to Zarr, without additional metadata, they will become zarrays rather than additional keys.

DennisHeimbigner · 2021-04-24T17:21:16Z

My speculation is that translating HDF5 will run into the same extra metadata issues I have been addressing in NCZarr/netcdf-c. As as I can tell, the two options are to add extra metadata as new objects or to add new keys to existing objects. Or a combination.

If the HDF5 round-trip issue reaches critical mass, then perhaps we should consider a conversation on what approach to adding extra metadata is best with respect to other implementations of Zarr.

joshmoore · 2021-04-26T06:28:33Z

Correct me if I'm wrong, @DennisHeimbigner, but it sounds like the issue your pointing out is that HDF5 is a superset of Zarr, right? 👍 for having a conversation about those mappings so that the round-tripping in the other direction isn't lossy. As far as I know though, this is the only case of Zarr allowing something that's not possible to describe in HDF5.

shoyer · 2021-05-06T03:26:40Z

.zmetadata for consolidated metadata (supported by Zarr-Python) is another good example of such extra files: zarr-developers/zarr-python#720

joshmoore · 2021-05-06T14:31:46Z

@shoyer, yup, but that's certainly one I could see adding to v2.rst to explain the format in case other implementations want to add support.

adair-kovac · 2021-07-21T21:01:59Z

I just came across this thread and wanted to chime in that we've been taking advantage of the fact that it's fine to store other file types in a zarr group. One use case is adding root groups on an ad-hoc basis––it's nice that if you have a bunch of zarr groups that were written separately to a shared directory, all you have to do to combine them into one store is to zarr.open the parent directory.

E.g. in one case we had some data on S3 like this:

root -> subdir1 -> hdf5
     -> subdir2 (.zgroup) -> various zarr groups

We converted the hdf5 data in subdir1 to zarr, left the hdf5 for backwards compatibility, and put a .zgroup file into the root directory. So now the whole S3 bucket is a zarr store without affecting users of other data types that were already there.

root (.zgroup) -> subdir1 (.zgroup) -> hdf5
                                    -> converted data (.zgroup) -> zarr arrays
               -> subdir2 (.zgroup) -> various zarr groups

joshmoore · 2021-08-03T09:18:31Z

Thanks for the use case, @adair-kovac!

joshmoore mentioned this issue Mar 29, 2021

OME-XML equivalent data ome/ngff#27

Open

joshmoore mentioned this issue May 6, 2021

NCZarr - Netcdf Support for Zarr #41

Open

joshmoore mentioned this issue May 14, 2021

Storing arbitrary objects in a zarr directory zarr-developers/zarr-python#750

Open

joshmoore mentioned this issue Feb 25, 2022

OME Metadata Support ome/ngff#104

Open

2 tasks

joshmoore mentioned this issue Apr 5, 2022

read_zarr_dataset errors on non-zarr content napari/napari#4358

Open

jstriebel added the v2 label Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2: interpretation of files outside of the spec #112

v2: interpretation of files outside of the spec #112

joshmoore commented Mar 29, 2021

joshmoore commented Apr 8, 2021 •

edited

Loading

rabernat commented Apr 8, 2021 •

edited

Loading

DennisHeimbigner commented Apr 8, 2021

joshmoore commented Apr 9, 2021

joshmoore commented Apr 23, 2021

DennisHeimbigner commented Apr 23, 2021

joshmoore commented Apr 24, 2021

DennisHeimbigner commented Apr 24, 2021

joshmoore commented Apr 26, 2021

shoyer commented May 6, 2021

joshmoore commented May 6, 2021

adair-kovac commented Jul 21, 2021

joshmoore commented Aug 3, 2021

v2: interpretation of files outside of the spec #112

v2: interpretation of files outside of the spec #112

Comments

joshmoore commented Mar 29, 2021

joshmoore commented Apr 8, 2021 • edited Loading

rabernat commented Apr 8, 2021 • edited Loading

DennisHeimbigner commented Apr 8, 2021

joshmoore commented Apr 9, 2021

joshmoore commented Apr 23, 2021

DennisHeimbigner commented Apr 23, 2021

joshmoore commented Apr 24, 2021

DennisHeimbigner commented Apr 24, 2021

joshmoore commented Apr 26, 2021

shoyer commented May 6, 2021

joshmoore commented May 6, 2021

adair-kovac commented Jul 21, 2021

joshmoore commented Aug 3, 2021

joshmoore commented Apr 8, 2021 •

edited

Loading

rabernat commented Apr 8, 2021 •

edited

Loading