Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2 spec: add optional dimension_separator (see #707) #715

Merged
merged 4 commits into from
Apr 21, 2021

Conversation

joshmoore
Copy link
Member

Various implementations allow for defining the separator between the
dimension indexes when writing chunks:

  • n5-zarr defines a dimensionSeparator parameter;
  • zarr-python's NestedDirectoryStore does so by default
  • and FSStore provides a key_separator parameter;
  • tensorstore has a key_encoding parameter; and
  • jzarr is looking to add the same functionality.

When writing an array, it is straight-forward to set this separator
and have arrays properly configured. Consumers of such arrays,
however, must either know a priori if their arrays use a
non-default separator or must loop through all possible chunks keys
searching for the right one.

By defining adding an optional metadata key to the .zarray, we:

  • preserve the efficient configuration of arrays
  • while keeping the v2 spec backwards compatible.

The primary downsides are that this will be the first optional metadata
value in the v2 spec and therefore we don't have a strong understanding of how that will play out, and datasets which were previously written with non-default separators will need updating in order to enable the detection though that is no worse than the current situation.

cc: @alimanfoo @chris-allan @SabineEmbacher @jbms @axtimwalde @manzt @gzuidhof @haileyajohnson @DennisHeimbigner @zarr-developers/core-devs

(If I've missed any Zarr v2 implementors, please add them)

Various implementations allow for defining the separator between the
dimension indexes when writing chunks:

 * n5-zarr defines a `dimensionSeparator` parameter;
 * zarr-python's NestedDirectoryStore does so by default
 * and FSStore provides a `key_separator` parameter;
 * tensorstore has a `key_encoding` parameter; and
 * jzarr is looking to add the same functionality.

When writing an array, it is straight-forward to set this separator
and have arrays properly configured. Consumers of such arrays,
however, must either know *a priori* if their arrays use a
non-default separator or must loop through all possible chunks keys
searching for the right one.

By defining adding an optional metadata key to the .zarray, we:

 * preserve the efficient configuration of arrays
 * while keeping the v2 spec backwards compatible.

The primary downsides are that this will be the first optional metadata
value in the v2 spec and therefore we don't have a strong understanding
of how that will play out, and datasets which were previously written
with non-default separators will need updating in order to enable the
detection though that is no worse than the current situation.
@martindurant
Copy link
Member

So presumably you would be able to still pass the separator argument OR look for the metadata key. The first would take priority and would remain the recommendation for the current and older versions of zarr and for datasets that were written by them.

@joshmoore
Copy link
Member Author

@martindurant: exactly. Wasn't sure if that was too much for this paragraph, but I can look for a spot to go into more details. I imagine if someone passes a separator that doesn't match the expected one we could warn in mode="r". For mode="a|w", I'd personally favor an exception but an alternative would be to update the metadata, effectively hiding the existing chunks.

@martindurant
Copy link
Member

It would be reasonable to supply a helper function somewhere to convert an existing array from one style of metadata to another.

@jbms
Copy link

jbms commented Apr 8, 2021

I'm happy with this and can add support to tensorstore and neuroglancer. Is this the confirmed metadata format or are we waiting on additional discussion? I'd rather not add this only to then have to also support something else in the future.

@martindurant
Copy link
Member

I think we are hoping for more +1 from the other maintainers.
I think I'd be OK with merging if there is no opposition within, say, a week.

@DennisHeimbigner
Copy link

The change in the spec says this:

The most common non-default value is "/", which leads to nested keys of the form "0/0" that in most implementations produces a directory structure.

I think the last part "that in most implementations produces a directory structure" should be removed since it is inserting implementation claims into what should be an implementation free document.

@martindurant
Copy link
Member

"that in most implementations produces a directory structure" should be removed

Would different wording help, like "in some implementation is a special character used to mark directories" or "the directory delimiter in posix-like storage", or "commonly used as a directory delimiter" ?

@DennisHeimbigner
Copy link

Why is it needed at all?

@shoyer
Copy link
Contributor

shoyer commented Apr 8, 2021

I like including an example of a non-default value because provides a succinct motivation for why this feature exists. The Zarr spec already has a number of such examples for explaining other features.

@DennisHeimbigner
Copy link

Reasonable. Then ignore this suggested change. Thanks.

chris-allan added a commit to chris-allan/jzarr that referenced this pull request Apr 9, 2021
chris-allan added a commit to chris-allan/jzarr that referenced this pull request Apr 9, 2021
joshmoore pushed a commit to joshmoore/jzarr that referenced this pull request Apr 9, 2021
chris-allan added a commit to glencoesoftware/jzarr that referenced this pull request Apr 12, 2021
@joshmoore
Copy link
Member Author

Note: 5 days have passed. https://github.com/zarr-developers/governance/blob/master/GOVERNANCE.md states

Changes to the specification require a dedicated issue on our issue tracker and even more time than an API change. The appropriate length of time is currently under discussion.

Considering this is an optional addition, I'd propose that we leave this call for objections open until the end of the week and then start moving forward? (Any objections to that plan today please. 😉 )

@DennisHeimbigner
Copy link

Can the separator be a UTF-8 character => multi-byte?

@axtimwalde
Copy link
Contributor

I think that's a valid question. I vote for UTF-8 arbitrary length string. ASCII encouraged.

@jbms
Copy link

jbms commented Apr 14, 2021

For simplicity it might be best to limit it to one of a small set of characters for now, e.g. {"." and "/"}, and possibly expand that later if there is a need for it.

On the implementation side it is not terribly difficult to just concatenate whatever arbitrary string is specified, but there are a few ways things can go wrong:

  • If the string starts or ends with a digit, then it is possible that a given key could correspond to more than one chunk.
  • If the string is length 0 obviously that leads to ambiguity.
  • Currently the allowed characters are [0-9./], which is pretty widely supported by various underlying storage systems. If you allow arbitrary unicode or even arbitrary ascii you can quickly run into characters not supported by the storage system (e.g. local filesystem), or which may have special behavior when opened --- e.g. on MS Windows there are a bunch of reserved names that cause problems when opened, and they aply to every directory, not just the root.
  • Some filesystems do various forms of normalization to unicode characters.

While in general the user should be free to make a bad choice and suffer the consequences, in some cases software that processes zarr may be intended to handle an untrusted zarr volume and then care must be taken to handle these issues.

Unless there is currently a use case for a separator other than {"." and "/"} it seems like it would be reasonable to restrict it to those options for now.

@DennisHeimbigner
Copy link

Also by putting it into .zarray, it implies that the separator can differ for each
array. I would think it better if the dimension separator was constrained
to be the same for all arrays in the dataset.

@jbms
Copy link

jbms commented Apr 14, 2021

Putting the separator into .zarray seems helpful since currently .zarray stands on its own. If you have to look in parent directory .zattr files that makes it more complicated.

As I imagine implementing this, and generally in what seems like the most natural implementation, there wouldn't be any problem with different separators within the same hierarchy. However I see that having different separators within the same hierarchy would not work with the existing implementation of NestedDirectoryStore. However I would suppose that once this feature is implemented NestedDirectoryStore would no longer be needed except for API compatibility.

@axtimwalde
Copy link
Contributor

I like that it can be different for each dataset, it's simpler to parse and may be useful.

@DennisHeimbigner
Copy link

I would suggest for version 3 that you introduce a single "superblock" object, ".zarr" say,
to hold metadata that applies to the whole dataset. This could also hold
extension information and the dimension separator could be put there as well.
p.s. I use the term "dataset" to refer to the whole collection of arrays and metadata.
Is there a preferred term for that collection?

@DennisHeimbigner
Copy link

This is somewhat off-topic, but it sounds like the real unit here is the .zarray+.zattr+chunks.
There really is no notion of a "dataset" as a collection of arrays and groups have no
semantics at all; they are purely for namespaces.
Under that model, then indeed, everything has to be attached to .zarray, including dimension separators and extensions.

@axtimwalde
Copy link
Contributor

I used the term dataset for 'array' as in HDF5 and N5, sorry for the confusion. I like the term 'container' for the entirety of groups and arrays that belong to the same root. That would be the equivalent of an HDF5 file.

@SabineEmbacher
Copy link

Also by putting it into .zarray, it implies that the separator can differ for each
array. I would think it better if the dimension separator was constrained
to be the same for all arrays in the dataset.

The separator should be individually adjustable for each array. A zarr data set can thus also be compiled manually from already existing data.
For example, if already existing data from different processing years are to be merged into a time series data set. Or if subsets of already existing different satellite sensors are merged to one data set, because these extracts from different sensors are needed for certain climate models or other calculations.

@chris-allan
Copy link

Having some experience dealing with arbitrary unicode both from an implementation and security perspective I would largely parrot @jbms. Many of us are using, or have the desire to use, Zarr in a trusted environments and in particular on large shared filesystems. Allowing arbitrary unicode dimension separators sets up a whole series of cross-language programming safety, path normalization, and subpath containment scenarios that must be dealt with to be safe in such an environment. Having a defined set of allowed ASCII separator characters is both easier for implementers and safer.

joshmoore added a commit to joshmoore/zarr-python that referenced this pull request Apr 15, 2021
…developers#715)

* All top-level storage classes now take an optional `dimension_separator`
  parameter which defaults to `None`, but can also be `.` or `/`.

* A ValueError is raised at normalization time if this is not the case.

* `None`s are normalized to the default of `.` in all except the
  NestedDirectoryStore case.

* The value is stored as `self._dimension_separator` on participating classes
  so that array creation can lookup the value.

* This value deprecates the `key_separator` value from FSStore.

* Wrapper classes like LRUCacheStore and ConsolidatedMetadataStore *do not*
  follow this pattern and instead rely on the value in the underlying store.
@jbms
Copy link

jbms commented Apr 15, 2021

Currently tensorstore does fail if .zarray contains extra fields. However I will remove that check at the same time that I implement this feature.

@DennisHeimbigner
Copy link

That restriction should probably be removed from the spec and replaced
with one that says that unrecognized keys should be ignored.

jbms added a commit to google/neuroglancer that referenced this pull request Apr 16, 2021
This implements the spec change described here:
zarr-developers/zarr-python#715

This also adds a query string parameter by the same name.

Fixes #241.
@joshmoore
Copy link
Member Author

Thanks for the suggestion, @DennisHeimbigner. I've pushed:

-Other keys MUST NOT be present within the metadata object.
+Other keys SHOULD NOT be present within the metadata object and SHOULD be
+ignored by implementations.

sandyhider pushed a commit to jhuapl-boss/neuroglancer that referenced this pull request Apr 16, 2021
* feat(zarr): support dimension_separator in .zarray file

This implements the spec change described here:
zarr-developers/zarr-python#715

This also adds a query string parameter by the same name.

Fixes google#241.

* feat(multiscale mesh): add additional debugging checks

* fix(multiscale mesh): turn off multiscale mesh fragment debugging by default

This was mistakenly left on by default.

Co-authored-by: Jeremy Maitin-Shepard <jbms@google.com>
@joshmoore
Copy link
Member Author

The zarr-python implementation (#716) is finally green. I'd propose we try to get these both merged this week, unless anyone else who is working on an implementation needs more time.

@joshmoore
Copy link
Member Author

Thanks for the review, @Carreau. Considering that there are now several implementations based on this (nczarr, tensorstore, and zarr.js minimally) I'm going to merge. Adding comment to #716 and then the big question will be when and as what to release. (My assumption is 2.8.0)

@joshmoore joshmoore merged commit 4d4d833 into zarr-developers:master Apr 21, 2021
@joshmoore joshmoore deleted the dim_sep branch April 21, 2021 18:51
@joshmoore
Copy link
Member Author

Some additional transcribed notes from @alimanfoo:

  • We discussed whether or not this should actually lead to a version number bump. However v3 is already defined and there's no process for a version in between.
  • The primary concern would be that existing filesets would become corrupted by older implementations that do not understand the new array key.
  • However, since that's already the case today without the metadata key, this should be considered a "fix rather than a [breaking] change"

joshmoore added a commit that referenced this pull request Apr 24, 2021
…#716)

* Implement `dimension_separator` for Python storage classes (See #715)

* All top-level storage classes now take an optional `dimension_separator`
  parameter which defaults to `None`, but can also be `.` or `/`.

* A ValueError is raised at normalization time if this is not the case.

* `None`s are normalized to the default of `.` in all except the
  NestedDirectoryStore case.

* The value is stored as `self._dimension_separator` on participating classes
  so that array creation can lookup the value.

* This value deprecates the `key_separator` value from FSStore.

* Wrapper classes like LRUCacheStore and ConsolidatedMetadataStore *do not*
  follow this pattern and instead rely on the value in the underlying store.

* Only store `dimension_separator` if not None

All hexdigest tests were failing due to updated array metadata.
In the case of NestedDirectoryStore and N5Store, this is necessary.
If the dimension_separator key is excluded from the .zarray JSON
when None, then most standard tests continue to pass.

* Fix doctests with optional key

* Add separator to missed LDBMStore

* Fix linting issue

* De-deprecate key_separator as public, non-null API

* Add test for normalize_dim_sep to appease codecov

* More tests for codecov

* Remove key from n5 array metadata

* Fix minor typo

* Cleanup DIGESTS in test_core.py

* Fix cut-n-paste error in test_utils.py

* And hopefully on last codecov fix

* Apply review changes

* Add 2.8.0 release notes
hannah-martinez added a commit to jhuapl-boss/neuroglancer that referenced this pull request Apr 30, 2021
* feat: add state saving/sharing ui allowing users to send their state to a server that accepts a json encoded version of the neuroglancer state. the "state server" is expected to return a url that will respond that same json string

* feat(zarr): support dimension_separator in .zarray file

This implements the spec change described here:
zarr-developers/zarr-python#715

This also adds a query string parameter by the same name.

Fixes google#241.

* feat(multiscale mesh): add additional debugging checks

* fix(multiscale mesh): turn off multiscale mesh fragment debugging by default

This was mistakenly left on by default.

* simplified state sharing code. replaced dialog with a state server config file and a select element if there is more than 1 server

* moved state share code out of viewer into separate file, renamed state key for the selected state server

* feat(ui): add support for configurable side panels

Previously, only two side panels were supported, the layer side panel
and the selection details side panel.  Only a static configuration was
supported, a single column on the right, with the layer side panel on
top and the selection details below, and the vertical height divided
equally.

This commit adds support for an arbitrary number of side panels, to
the left/right/top/bottom of the main data panels.  Panels can be
moved via drag and drop, and the size can also be adjusted.  The side
panels for multiple layers can now be shown at once, and individual
tabs of side panels can also be dragged to separate panels.

* feat: add layer list panel and support for "archived" layers

Archived layers are shown only in the new layer list panel; they are
not visible and are not shown in the layer bar.  They may be useful in
the case where there are a large number of layers that would otherwise
clutter up the layer bar.

This commit also improves the visual feedback during layer and drop
operations.

* feat(python): add ConfigState options for additional panel buttons

* fix(python): fix ManagedLayer.visible property

* fix(annotation): only bind properties actually referenced by the shader

Annotation rendering maps properties to shader vertex attributes, and
WebGL implementations limit the number of attributes to ~16.
Furthermore, previously all properties were bound even if not
referenced.  With this commit, only properties actually referenced by
the shader are bound.  This allows annotation layers to support data
sources with a large number of properties, provided that only a small
number are actually referenced by the shader at once.

* fix(annotation): fix rendering tab to better handle a large number of properties

* fix(annotation): prevent collapse of annotation property list

* fix(perspective_panel): eliminate separate transparentPickEnabled behavior

Segmentation layers have a "pick" boolean option that previously
controlled two things:

1. When off, double clicking does not select/deselect segments in the
layer.

2. When on, picking of transparent meshes behaves exactly the same as
if the mesh were opaque --- i.e. it is not possible to pick objects
behind/inside the mesh.  When off, picking of transparent meshes has
lower precendence than opaque objects behind.

In practice, the picking behavior for transparent objects with
"picking on", which made it impossible to pick opaque objects behind,
is not very useful.

With this change, the second behavior change no longer applies:
transparent objects now always have lower picking precendence than
opaque objects, making it possible to pick opaque objects behind
transparent objects.

* fix(ui/side_panel): only allow dragging from title bar

This allows slider controls in the body of side panels to work correctly.

Also disables auto-select on focus behavior for the name input field
of the layer side panels.  The previous auto-select on focus behavior
meant that attempting to start a drag from the text area (which takes
up most of the title bar and is the most natural place to drag from)
worked, but showed the text itself being dragged rather than the
entire panel.

Co-authored-by: Chris Jordan <jordanchriss@gmail.com>
Co-authored-by: Jeremy Maitin-Shepard <jbms@google.com>
Co-authored-by: Hannah Gooden <hannah.gooden@jhuapl.edu>
jbms added a commit to google/tensorstore that referenced this pull request Jul 15, 2021
Historically zarr has named chunk files using dots by default,
e.g. "1.2.3", but has also supported slashes, e.g. "1/2/3".  However,
the choice was not indicated in the .zarray metadata file, and instead
needed to be specified out-of-band.  In TensorStore, that was
accomplished using the "key_encoding" spec member.

Zarr recently added a dimension_separator field to the metadata
format:

zarr-developers/zarr-python#715

This commit adds support for that field.  Note that previously
TensorStore did not accept .zarray files that contain unknown
members, meaning that reading arrays with the dimension_separator
field would fail.

The existing "key_encoding" spec member is still supported, but
deprecated in favor of specifying "dimension_separator" as part of the
"metadata" field.  When creating a new array, TensorStore will always
write the "dimension_separator" field, in order to allow the array to
be re-opened without specifying the "dimension_separator".

Additionally, this change also makes TensorStore accept (and preserve)
extra .zarray members, as the zarr spec has now also changed to
require that.

PiperOrigin-RevId: 384998911
Change-Id: I485141b7786084d57c28005e3076f70462cd5b9e
@joshmoore joshmoore mentioned this pull request Feb 9, 2022
19 tasks
madwilliam pushed a commit to ActiveBrainAtlas2/neuroglancer that referenced this pull request May 13, 2022
This implements the spec change described here:
zarr-developers/zarr-python#715

This also adds a query string parameter by the same name.

Fixes #241.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants