v2 spec: add optional dimension_separator (see #707) #715

joshmoore · 2021-04-08T08:23:08Z

Various implementations allow for defining the separator between the
dimension indexes when writing chunks:

n5-zarr defines a dimensionSeparator parameter;
zarr-python's NestedDirectoryStore does so by default
and FSStore provides a key_separator parameter;
tensorstore has a key_encoding parameter; and
jzarr is looking to add the same functionality.

When writing an array, it is straight-forward to set this separator
and have arrays properly configured. Consumers of such arrays,
however, must either know a priori if their arrays use a
non-default separator or must loop through all possible chunks keys
searching for the right one.

By defining adding an optional metadata key to the .zarray, we:

preserve the efficient configuration of arrays
while keeping the v2 spec backwards compatible.

The primary downsides are that this will be the first optional metadata
value in the v2 spec and therefore we don't have a strong understanding of how that will play out, and datasets which were previously written with non-default separators will need updating in order to enable the detection though that is no worse than the current situation.

cc: @alimanfoo @chris-allan @SabineEmbacher @jbms @axtimwalde @manzt @gzuidhof @haileyajohnson @DennisHeimbigner @zarr-developers/core-devs

(If I've missed any Zarr v2 implementors, please add them)

Various implementations allow for defining the separator between the dimension indexes when writing chunks: * n5-zarr defines a `dimensionSeparator` parameter; * zarr-python's NestedDirectoryStore does so by default * and FSStore provides a `key_separator` parameter; * tensorstore has a `key_encoding` parameter; and * jzarr is looking to add the same functionality. When writing an array, it is straight-forward to set this separator and have arrays properly configured. Consumers of such arrays, however, must either know *a priori* if their arrays use a non-default separator or must loop through all possible chunks keys searching for the right one. By defining adding an optional metadata key to the .zarray, we: * preserve the efficient configuration of arrays * while keeping the v2 spec backwards compatible. The primary downsides are that this will be the first optional metadata value in the v2 spec and therefore we don't have a strong understanding of how that will play out, and datasets which were previously written with non-default separators will need updating in order to enable the detection though that is no worse than the current situation.

martindurant · 2021-04-08T13:35:26Z

So presumably you would be able to still pass the separator argument OR look for the metadata key. The first would take priority and would remain the recommendation for the current and older versions of zarr and for datasets that were written by them.

joshmoore · 2021-04-08T13:38:28Z

@martindurant: exactly. Wasn't sure if that was too much for this paragraph, but I can look for a spot to go into more details. I imagine if someone passes a separator that doesn't match the expected one we could warn in mode="r". For mode="a|w", I'd personally favor an exception but an alternative would be to update the metadata, effectively hiding the existing chunks.

martindurant · 2021-04-08T13:43:21Z

It would be reasonable to supply a helper function somewhere to convert an existing array from one style of metadata to another.

jbms · 2021-04-08T16:24:11Z

I'm happy with this and can add support to tensorstore and neuroglancer. Is this the confirmed metadata format or are we waiting on additional discussion? I'd rather not add this only to then have to also support something else in the future.

martindurant · 2021-04-08T16:30:24Z

I think we are hoping for more +1 from the other maintainers.
I think I'd be OK with merging if there is no opposition within, say, a week.

DennisHeimbigner · 2021-04-08T18:55:44Z

The change in the spec says this:

The most common non-default value is "/", which leads to nested keys of the form "0/0" that in most implementations produces a directory structure.

I think the last part "that in most implementations produces a directory structure" should be removed since it is inserting implementation claims into what should be an implementation free document.

martindurant · 2021-04-08T19:04:13Z

"that in most implementations produces a directory structure" should be removed

Would different wording help, like "in some implementation is a special character used to mark directories" or "the directory delimiter in posix-like storage", or "commonly used as a directory delimiter" ?

DennisHeimbigner · 2021-04-08T19:09:02Z

Why is it needed at all?

shoyer · 2021-04-08T20:17:40Z

I like including an example of a non-default value because provides a succinct motivation for why this feature exists. The Zarr spec already has a number of such examples for explaining other features.

DennisHeimbigner · 2021-04-08T20:31:17Z

Reasonable. Then ignore this suggested change. Thanks.

joshmoore · 2021-04-13T16:45:55Z

Note: 5 days have passed. https://github.com/zarr-developers/governance/blob/master/GOVERNANCE.md states

Changes to the specification require a dedicated issue on our issue tracker and even more time than an API change. The appropriate length of time is currently under discussion.

Considering this is an optional addition, I'd propose that we leave this call for objections open until the end of the week and then start moving forward? (Any objections to that plan today please. 😉 )

DennisHeimbigner · 2021-04-14T01:59:33Z

Can the separator be a UTF-8 character => multi-byte?

axtimwalde · 2021-04-14T02:08:06Z

I think that's a valid question. I vote for UTF-8 arbitrary length string. ASCII encouraged.

jbms · 2021-04-14T02:18:17Z

For simplicity it might be best to limit it to one of a small set of characters for now, e.g. {"." and "/"}, and possibly expand that later if there is a need for it.

On the implementation side it is not terribly difficult to just concatenate whatever arbitrary string is specified, but there are a few ways things can go wrong:

If the string starts or ends with a digit, then it is possible that a given key could correspond to more than one chunk.
If the string is length 0 obviously that leads to ambiguity.
Currently the allowed characters are [0-9./], which is pretty widely supported by various underlying storage systems. If you allow arbitrary unicode or even arbitrary ascii you can quickly run into characters not supported by the storage system (e.g. local filesystem), or which may have special behavior when opened --- e.g. on MS Windows there are a bunch of reserved names that cause problems when opened, and they aply to every directory, not just the root.
Some filesystems do various forms of normalization to unicode characters.

While in general the user should be free to make a bad choice and suffer the consequences, in some cases software that processes zarr may be intended to handle an untrusted zarr volume and then care must be taken to handle these issues.

Unless there is currently a use case for a separator other than {"." and "/"} it seems like it would be reasonable to restrict it to those options for now.

DennisHeimbigner · 2021-04-14T02:20:54Z

Also by putting it into .zarray, it implies that the separator can differ for each
array. I would think it better if the dimension separator was constrained
to be the same for all arrays in the dataset.

jbms · 2021-04-14T02:33:38Z

Putting the separator into .zarray seems helpful since currently .zarray stands on its own. If you have to look in parent directory .zattr files that makes it more complicated.

As I imagine implementing this, and generally in what seems like the most natural implementation, there wouldn't be any problem with different separators within the same hierarchy. However I see that having different separators within the same hierarchy would not work with the existing implementation of NestedDirectoryStore. However I would suppose that once this feature is implemented NestedDirectoryStore would no longer be needed except for API compatibility.

axtimwalde · 2021-04-14T02:35:16Z

I like that it can be different for each dataset, it's simpler to parse and may be useful.

DennisHeimbigner · 2021-04-14T02:52:51Z

I would suggest for version 3 that you introduce a single "superblock" object, ".zarr" say,
to hold metadata that applies to the whole dataset. This could also hold
extension information and the dimension separator could be put there as well.
p.s. I use the term "dataset" to refer to the whole collection of arrays and metadata.
Is there a preferred term for that collection?

DennisHeimbigner · 2021-04-14T02:58:45Z

This is somewhat off-topic, but it sounds like the real unit here is the .zarray+.zattr+chunks.
There really is no notion of a "dataset" as a collection of arrays and groups have no
semantics at all; they are purely for namespaces.
Under that model, then indeed, everything has to be attached to .zarray, including dimension separators and extensions.

axtimwalde · 2021-04-14T03:46:29Z

I used the term dataset for 'array' as in HDF5 and N5, sorry for the confusion. I like the term 'container' for the entirety of groups and arrays that belong to the same root. That would be the equivalent of an HDF5 file.

SabineEmbacher · 2021-04-14T05:31:19Z

Also by putting it into .zarray, it implies that the separator can differ for each
array. I would think it better if the dimension separator was constrained
to be the same for all arrays in the dataset.

The separator should be individually adjustable for each array. A zarr data set can thus also be compiled manually from already existing data.
For example, if already existing data from different processing years are to be merged into a time series data set. Or if subsets of already existing different satellite sensors are merged to one data set, because these extracts from different sensors are needed for certain climate models or other calculations.

chris-allan · 2021-04-14T09:37:08Z

Having some experience dealing with arbitrary unicode both from an implementation and security perspective I would largely parrot @jbms. Many of us are using, or have the desire to use, Zarr in a trusted environments and in particular on large shared filesystems. Allowing arbitrary unicode dimension separators sets up a whole series of cross-language programming safety, path normalization, and subpath containment scenarios that must be dealt with to be safe in such an environment. Having a defined set of allowed ASCII separator characters is both easier for implementers and safer.

…developers#715) * All top-level storage classes now take an optional `dimension_separator` parameter which defaults to `None`, but can also be `.` or `/`. * A ValueError is raised at normalization time if this is not the case. * `None`s are normalized to the default of `.` in all except the NestedDirectoryStore case. * The value is stored as `self._dimension_separator` on participating classes so that array creation can lookup the value. * This value deprecates the `key_separator` value from FSStore. * Wrapper classes like LRUCacheStore and ConsolidatedMetadataStore *do not* follow this pattern and instead rely on the value in the underlying store.

jbms · 2021-04-15T21:30:15Z

Currently tensorstore does fail if .zarray contains extra fields. However I will remove that check at the same time that I implement this feature.

DennisHeimbigner · 2021-04-15T23:41:41Z

That restriction should probably be removed from the spec and replaced
with one that says that unrecognized keys should be ignored.

This implements the spec change described here: zarr-developers/zarr-python#715 This also adds a query string parameter by the same name. Fixes #241.

joshmoore · 2021-04-16T10:56:55Z

Thanks for the suggestion, @DennisHeimbigner. I've pushed:

-Other keys MUST NOT be present within the metadata object.
+Other keys SHOULD NOT be present within the metadata object and SHOULD be
+ignored by implementations.

* feat(zarr): support dimension_separator in .zarray file This implements the spec change described here: zarr-developers/zarr-python#715 This also adds a query string parameter by the same name. Fixes google#241. * feat(multiscale mesh): add additional debugging checks * fix(multiscale mesh): turn off multiscale mesh fragment debugging by default This was mistakenly left on by default. Co-authored-by: Jeremy Maitin-Shepard <jbms@google.com>

joshmoore · 2021-04-19T20:16:05Z

The zarr-python implementation (#716) is finally green. I'd propose we try to get these both merged this week, unless anyone else who is working on an implementation needs more time.

joshmoore · 2021-04-21T18:51:32Z

Thanks for the review, @Carreau. Considering that there are now several implementations based on this (nczarr, tensorstore, and zarr.js minimally) I'm going to merge. Adding comment to #716 and then the big question will be when and as what to release. (My assumption is 2.8.0)

joshmoore · 2021-04-23T14:22:24Z

Some additional transcribed notes from @alimanfoo:

We discussed whether or not this should actually lead to a version number bump. However v3 is already defined and there's no process for a version in between.
The primary concern would be that existing filesets would become corrupted by older implementations that do not understand the new array key.
However, since that's already the case today without the metadata key, this should be considered a "fix rather than a [breaking] change"

…#716) * Implement `dimension_separator` for Python storage classes (See #715) * All top-level storage classes now take an optional `dimension_separator` parameter which defaults to `None`, but can also be `.` or `/`. * A ValueError is raised at normalization time if this is not the case. * `None`s are normalized to the default of `.` in all except the NestedDirectoryStore case. * The value is stored as `self._dimension_separator` on participating classes so that array creation can lookup the value. * This value deprecates the `key_separator` value from FSStore. * Wrapper classes like LRUCacheStore and ConsolidatedMetadataStore *do not* follow this pattern and instead rely on the value in the underlying store. * Only store `dimension_separator` if not None All hexdigest tests were failing due to updated array metadata. In the case of NestedDirectoryStore and N5Store, this is necessary. If the dimension_separator key is excluded from the .zarray JSON when None, then most standard tests continue to pass. * Fix doctests with optional key * Add separator to missed LDBMStore * Fix linting issue * De-deprecate key_separator as public, non-null API * Add test for normalize_dim_sep to appease codecov * More tests for codecov * Remove key from n5 array metadata * Fix minor typo * Cleanup DIGESTS in test_core.py * Fix cut-n-paste error in test_utils.py * And hopefully on last codecov fix * Apply review changes * Add 2.8.0 release notes

* feat: add state saving/sharing ui allowing users to send their state to a server that accepts a json encoded version of the neuroglancer state. the "state server" is expected to return a url that will respond that same json string * feat(zarr): support dimension_separator in .zarray file This implements the spec change described here: zarr-developers/zarr-python#715 This also adds a query string parameter by the same name. Fixes google#241. * feat(multiscale mesh): add additional debugging checks * fix(multiscale mesh): turn off multiscale mesh fragment debugging by default This was mistakenly left on by default. * simplified state sharing code. replaced dialog with a state server config file and a select element if there is more than 1 server * moved state share code out of viewer into separate file, renamed state key for the selected state server * feat(ui): add support for configurable side panels Previously, only two side panels were supported, the layer side panel and the selection details side panel. Only a static configuration was supported, a single column on the right, with the layer side panel on top and the selection details below, and the vertical height divided equally. This commit adds support for an arbitrary number of side panels, to the left/right/top/bottom of the main data panels. Panels can be moved via drag and drop, and the size can also be adjusted. The side panels for multiple layers can now be shown at once, and individual tabs of side panels can also be dragged to separate panels. * feat: add layer list panel and support for "archived" layers Archived layers are shown only in the new layer list panel; they are not visible and are not shown in the layer bar. They may be useful in the case where there are a large number of layers that would otherwise clutter up the layer bar. This commit also improves the visual feedback during layer and drop operations. * feat(python): add ConfigState options for additional panel buttons * fix(python): fix ManagedLayer.visible property * fix(annotation): only bind properties actually referenced by the shader Annotation rendering maps properties to shader vertex attributes, and WebGL implementations limit the number of attributes to ~16. Furthermore, previously all properties were bound even if not referenced. With this commit, only properties actually referenced by the shader are bound. This allows annotation layers to support data sources with a large number of properties, provided that only a small number are actually referenced by the shader at once. * fix(annotation): fix rendering tab to better handle a large number of properties * fix(annotation): prevent collapse of annotation property list * fix(perspective_panel): eliminate separate transparentPickEnabled behavior Segmentation layers have a "pick" boolean option that previously controlled two things: 1. When off, double clicking does not select/deselect segments in the layer. 2. When on, picking of transparent meshes behaves exactly the same as if the mesh were opaque --- i.e. it is not possible to pick objects behind/inside the mesh. When off, picking of transparent meshes has lower precendence than opaque objects behind. In practice, the picking behavior for transparent objects with "picking on", which made it impossible to pick opaque objects behind, is not very useful. With this change, the second behavior change no longer applies: transparent objects now always have lower picking precendence than opaque objects, making it possible to pick opaque objects behind transparent objects. * fix(ui/side_panel): only allow dragging from title bar This allows slider controls in the body of side panels to work correctly. Also disables auto-select on focus behavior for the name input field of the layer side panels. The previous auto-select on focus behavior meant that attempting to start a drag from the text area (which takes up most of the title bar and is the most natural place to drag from) worked, but showed the text itself being dragged rather than the entire panel. Co-authored-by: Chris Jordan <jordanchriss@gmail.com> Co-authored-by: Jeremy Maitin-Shepard <jbms@google.com> Co-authored-by: Hannah Gooden <hannah.gooden@jhuapl.edu>

Historically zarr has named chunk files using dots by default, e.g. "1.2.3", but has also supported slashes, e.g. "1/2/3". However, the choice was not indicated in the .zarray metadata file, and instead needed to be specified out-of-band. In TensorStore, that was accomplished using the "key_encoding" spec member. Zarr recently added a dimension_separator field to the metadata format: zarr-developers/zarr-python#715 This commit adds support for that field. Note that previously TensorStore did not accept .zarray files that contain unknown members, meaning that reading arrays with the dimension_separator field would fail. The existing "key_encoding" spec member is still supported, but deprecated in favor of specifying "dimension_separator" as part of the "metadata" field. When creating a new array, TensorStore will always write the "dimension_separator" field, in order to allow the array to be re-opened without specifying the "dimension_separator". Additionally, this change also makes TensorStore accept (and preserve) extra .zarray members, as the zarr spec has now also changed to require that. PiperOrigin-RevId: 384998911 Change-Id: I485141b7786084d57c28005e3076f70462cd5b9e

This implements the spec change described here: zarr-developers/zarr-python#715 This also adds a query string parameter by the same name. Fixes #241.

joshmoore mentioned this pull request Apr 8, 2021

Hierarchical chunk storage ome/ngff#29

Closed

manzt mentioned this pull request Apr 8, 2021

Add dimension separator gzuidhof/zarr.js#89

Merged

chris-allan added a commit to chris-allan/jzarr that referenced this pull request Apr 9, 2021

Use upstream agreed metadata (See zarr-developers/zarr-python#715)

44bf50e

chris-allan added a commit to chris-allan/jzarr that referenced this pull request Apr 9, 2021

Use upstream agreed metadata (See zarr-developers/zarr-python#715)

781e4f7

joshmoore pushed a commit to joshmoore/jzarr that referenced this pull request Apr 9, 2021

Use upstream agreed metadata (See zarr-developers/zarr-python#715)

5877f3f

chris-allan added a commit to glencoesoftware/jzarr that referenced this pull request Apr 12, 2021

Use upstream agreed metadata (See zarr-developers/zarr-python#715)

0be6d3a

manzt mentioned this pull request Apr 12, 2021

Use Proxy store for nested chunks hms-dbmi/vizarr#85

Merged

This was referenced Apr 15, 2021

Implement dimension_separator for Python storage classes (See #715) #716

Merged

Include nested/flat tests zarr-developers/zarr_implementations#26

Open

Remove MUST NOT restriction for other keys

049e4af

Merge tag 'v2.7.1' into dim_sep

7f039f3

davidbrochart mentioned this pull request Apr 20, 2021

Support zarr v2 dimension separator xtensor-stack/xtensor-zarr#40

Merged

joshmoore mentioned this pull request Apr 20, 2021

Modify nested detection loop joshmoore/jzarr#1

Merged

Carreau approved these changes Apr 21, 2021

View reviewed changes

joshmoore merged commit 4d4d833 into zarr-developers:master Apr 21, 2021

joshmoore deleted the dim_sep branch April 21, 2021 18:51

This was referenced May 5, 2021

jzarr read/write (close #36) zarr-developers/zarr_implementations#37

Merged

Add axes field to multiscale metadata ome/ngff#46

Merged

This was referenced May 13, 2021

Sharding array chunks across hashed sub-directories zarr-developers/zarr-specs#115

Open

da.from_zarr ignores Zarr's dimension_separator dask/dask#7673

Open

constantinpape mentioned this pull request Jun 10, 2021

WIP (support for ome.zar) mobie/mobie-viewer-fiji#297

Closed

joshmoore mentioned this pull request Jun 14, 2021

Confusion about the dimension_separator keyword #769

Closed

joshmoore mentioned this pull request Aug 3, 2021

NCZarr - Netcdf Support for Zarr zarr-developers/zarr-specs#41

Open

joshmoore mentioned this pull request Feb 9, 2022

Blog suggestions zarr-developers/blog#5

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2 spec: add optional dimension_separator (see #707) #715

v2 spec: add optional dimension_separator (see #707) #715

joshmoore commented Apr 8, 2021

martindurant commented Apr 8, 2021

joshmoore commented Apr 8, 2021

martindurant commented Apr 8, 2021

jbms commented Apr 8, 2021

martindurant commented Apr 8, 2021

DennisHeimbigner commented Apr 8, 2021

martindurant commented Apr 8, 2021

DennisHeimbigner commented Apr 8, 2021

shoyer commented Apr 8, 2021

DennisHeimbigner commented Apr 8, 2021

joshmoore commented Apr 13, 2021

DennisHeimbigner commented Apr 14, 2021

axtimwalde commented Apr 14, 2021

jbms commented Apr 14, 2021

DennisHeimbigner commented Apr 14, 2021

jbms commented Apr 14, 2021

axtimwalde commented Apr 14, 2021

DennisHeimbigner commented Apr 14, 2021

DennisHeimbigner commented Apr 14, 2021

axtimwalde commented Apr 14, 2021

SabineEmbacher commented Apr 14, 2021

chris-allan commented Apr 14, 2021

jbms commented Apr 15, 2021

DennisHeimbigner commented Apr 15, 2021

joshmoore commented Apr 16, 2021

joshmoore commented Apr 19, 2021

joshmoore commented Apr 21, 2021

joshmoore commented Apr 23, 2021

v2 spec: add optional dimension_separator (see #707) #715

v2 spec: add optional dimension_separator (see #707) #715

Conversation

joshmoore commented Apr 8, 2021

martindurant commented Apr 8, 2021

joshmoore commented Apr 8, 2021

martindurant commented Apr 8, 2021

jbms commented Apr 8, 2021

martindurant commented Apr 8, 2021

DennisHeimbigner commented Apr 8, 2021

martindurant commented Apr 8, 2021

DennisHeimbigner commented Apr 8, 2021

shoyer commented Apr 8, 2021

DennisHeimbigner commented Apr 8, 2021

joshmoore commented Apr 13, 2021

DennisHeimbigner commented Apr 14, 2021

axtimwalde commented Apr 14, 2021

jbms commented Apr 14, 2021

DennisHeimbigner commented Apr 14, 2021

jbms commented Apr 14, 2021

axtimwalde commented Apr 14, 2021

DennisHeimbigner commented Apr 14, 2021

DennisHeimbigner commented Apr 14, 2021

axtimwalde commented Apr 14, 2021

SabineEmbacher commented Apr 14, 2021

chris-allan commented Apr 14, 2021

jbms commented Apr 15, 2021

DennisHeimbigner commented Apr 15, 2021

joshmoore commented Apr 16, 2021

joshmoore commented Apr 19, 2021

joshmoore commented Apr 21, 2021

joshmoore commented Apr 23, 2021