Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2: interpretation of files outside of the spec #112

Open
joshmoore opened this issue Mar 29, 2021 · 13 comments
Open

v2: interpretation of files outside of the spec #112

joshmoore opened this issue Mar 29, 2021 · 13 comments
Labels

Comments

@joshmoore
Copy link
Member

see: ome/ngff#27

The Zarr specification currently only defines 4 leaf nodes: .zarray, .zgroups, .zattrs, and chunks.

One OME-Zarr client (bioformats2raw) currently places an XML file within the Zarr hierarchy. This can be accessed at the Store level via:

text = store["METADATA.xml"].decode() # manually decode ascii

but not from the given group. Questions:

  • Was there previously any discussion about possible Zarr-internal interpretations of files?
  • Does anyone else have similar use cases?
  • If this is generally valuable, would it break anyone else's use cases to define such files as "1-dimensional non-chunked arrays"

cc: @manzt

@joshmoore
Copy link
Member Author

joshmoore commented Apr 8, 2021

Summary of yesterday's community call discussion on this topic:

  1. Options for the v2 spec:
    a. Close the loop-hole, by disallowing other files. ("MUST-NOT place additional files..." in (in v2.rst))
    b. No change (i.e. unspecified)
    c. Specify (in v2.rst) how such files should be interpreted (e.g. as a 1-d array without metadata)
    d. Specify ... that such files should be ignored.
  2. Options for the v3 spec:
    a. Disallow
    b. No change (i.e. unspecified)
    c. Write an extension for files
    d. Specify (in this repo) the behavior in the core, e.g. that such files should be interpreted as 1-d arrays
    e. Specify (in this repo) that the files should be ignored.
  3. In the case of 1a-b and 2a-b, an additional option would be to encourage (SHOULD) the use of a more official method for storing files:
    a. serialized as a string in .zattrs
    b as a full .zarray

The feeling was that this use case may be too fringe for backporting to v2. There was some seemingly satisfied mumbling around 2c. (an extension). Other opinions?

cc: @alimanfoo


p.s.: one thing I had not realized is that with the separate metadata/data hierarchies in v3, a file can also have metadata attached to it (mimetype, etc.) Thanks, @jakirkham.

@rabernat
Copy link
Contributor

rabernat commented Apr 8, 2021

Just noting the the newly announced "nczarr" spec also uses additional files in the zarr hierarchy, so its compliance with the Zarr v2 spec is ambiguous. cc @DennisHeimbigner

@DennisHeimbigner
Copy link

I recall discussing this on the zarr telecon some months ago when I first joined.
At that time, I asked that the spec be changed to say that unrecognized objects
be ignored. I do not recall any resolution.

@joshmoore
Copy link
Member Author

Thanks for the reminder, @DennisHeimbigner. I'll add that as an additional option.

@joshmoore
Copy link
Member Author

Managed to discuss this issue with @alimanfoo today. The original intent in the spec was to be hacker friendly and not prohibit storing other keys as long as they don't negatively influence the interpretation of existing objects.

On the question of round-tripping to HDF5, we would need to suggest a specification/protocol on the HDF5 side to make this feasible.

Looking forward to comments, but as this starts to crystallize, I'll open a PR to suggest v2.rst language (if no one beats me to it).

cc: @manzt @JacksonMaxfield @chris-allan

@DennisHeimbigner
Copy link

On the question of round-tripping to HDF5, we would need to suggest a specification/protocol on the HDF5 side to make this feasible.

Can someone clarify what is meant by the HDF5 round-tripping? Is there
an issue describing this?

@joshmoore
Copy link
Member Author

@DennisHeimbigner: No, mea culpa. If you convert a Zarr file to HDF5, zarrays, zgroups and attributes all have a clear mapping. As far as I can tell, these additional files do not. One could certainly map them to a dataset on the HDF5 side, but then when mapping back to Zarr, without additional metadata, they will become zarrays rather than additional keys.

@DennisHeimbigner
Copy link

My speculation is that translating HDF5 will run into the same extra metadata issues I have been addressing in NCZarr/netcdf-c. As as I can tell, the two options are to add extra metadata as new objects or to add new keys to existing objects. Or a combination.

If the HDF5 round-trip issue reaches critical mass, then perhaps we should consider a conversation on what approach to adding extra metadata is best with respect to other implementations of Zarr.

@joshmoore
Copy link
Member Author

Correct me if I'm wrong, @DennisHeimbigner, but it sounds like the issue your pointing out is that HDF5 is a superset of Zarr, right? 👍 for having a conversation about those mappings so that the round-tripping in the other direction isn't lossy. As far as I know though, this is the only case of Zarr allowing something that's not possible to describe in HDF5.

@shoyer
Copy link

shoyer commented May 6, 2021

.zmetadata for consolidated metadata (supported by Zarr-Python) is another good example of such extra files: zarr-developers/zarr-python#720

@joshmoore
Copy link
Member Author

@shoyer, yup, but that's certainly one I could see adding to v2.rst to explain the format in case other implementations want to add support.

@adair-kovac
Copy link

I just came across this thread and wanted to chime in that we've been taking advantage of the fact that it's fine to store other file types in a zarr group. One use case is adding root groups on an ad-hoc basis––it's nice that if you have a bunch of zarr groups that were written separately to a shared directory, all you have to do to combine them into one store is to zarr.open the parent directory.

E.g. in one case we had some data on S3 like this:

root -> subdir1 -> hdf5
     -> subdir2 (.zgroup) -> various zarr groups

We converted the hdf5 data in subdir1 to zarr, left the hdf5 for backwards compatibility, and put a .zgroup file into the root directory. So now the whole S3 bucket is a zarr store without affecting users of other data types that were already there.

root (.zgroup) -> subdir1 (.zgroup) -> hdf5
                                    -> converted data (.zgroup) -> zarr arrays
               -> subdir2 (.zgroup) -> various zarr groups                    

@joshmoore
Copy link
Member Author

Thanks for the use case, @adair-kovac!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants