Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop dimension_names from v3 #219

Closed
normanrz opened this issue Mar 6, 2023 · 6 comments
Closed

Drop dimension_names from v3 #219

normanrz opened this issue Mar 6, 2023 · 6 comments
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec

Comments

@normanrz
Copy link
Contributor

normanrz commented Mar 6, 2023

I was going through the v3 spec and noticed the dimension_names attribute. I was wondering if that might better be placed in the upcoming metadata convention ZEP? There are already community-specific conventions for assigning names (and other metadata) to dimensions, e.g. OME-Zarr or xarray zarr dimension encoding.

@jbms
Copy link
Contributor

jbms commented Mar 8, 2023

There was already quite a bit of discussion regarding whether to;

  1. add this as a core metadata attribute;
  2. add this as a user metadata attribute;
  3. just leave it up to external tools / specifications to define something, like xarray and ome-zarr.

See previous discussion:
#73
#144
#162

While there wasn't a strong reason to favor 1 vs 2, given that dimension names are quite broadly applicable across many different zarr implementations and use cases, and some zarr implementations (such as Neuroglancer and TensorStore) directly make use of dimension names, there was strong support for choosing either 1 or 2, in favor of 3, in order to promote a single way to specify dimension names for better common denominator interoperability between e.g. xarray, netcdf, and OME-zarr.

My hope is that for zarr v3, xarray, netcdf and OME-zarr all make use of this dimension_names metadata field for specifying array dimension names rather than introducing a new user-defined attribute.

I think the general argument in favor of (1) rather than (2) is that dimension names are more universally applicable than other metadata like units.

@normanrz
Copy link
Contributor Author

normanrz commented Mar 8, 2023

Thanks for explaining the previous discussion. I agree that dimension names are broadly applicable. However, a flat list of strings may not be enough to specify all the metadata that a community needs, which will create parallel metadata. For example, in OME-Zarr we already store metadata for axes (=dimensions) with additional fields such as type (e.g. time, space, channel) and unit.
To solve this, dimension names could become user metadata (as mentioned by you) or made extensible, e.g., "dimensions": [{"name": "x", "attributes": { "type": "space", "unit": "nanometer" }].

@jbms
Copy link
Contributor

jbms commented Mar 8, 2023

Agreed that there are many other per-dimension attributes that may be useful.

For example, in addition to the ones you mentioned, I'm planning to propose a zarr extension for non-zero origins, which would need a way to specify a lower bound and grid offset for each dimension.

I'm also planning to propose a "resizable" attribute (#212).

In general, there is the choice of whether to represent per-dimension attributes using a "row"-based organization similar to your example above, e.g.:

{"dimensions": [
  {"name": "x", "attributes": {"unit": "meter", "type": "space"}},
  {"name": "y", "attributes": {"unit": "meter", "type": "space"}},
  {"name":"c"},
  {}
],
...
}

or to use an equivalent columnar representation:

{"dimension_names": ["x", "y", "c", null],
  "attributes": {
    "dimension_units": ["meter", "meter", null, null],
    "dimension_type": ["space", "space", null, null]
  },
  ...
}

For the row-based organization we are also effectively adding the concept of per-dimension user-defined attributes.

As I see it, we can basically accomplish the same thing with either representation, but there are pros and cons to each approach:

  • The existing per-dimension zarr metadata, namely shape and chunk_shape, uses a columnar representation. For shape we could easily switch to row representation. For chunk_shape that would not be very natural unless we want to make the grid type a per-dimension attribute (which actually might make sense).
  • If we want to add a must_understand = False per-dimension attribute to the core metadata, it would become rather verbose:
{"dimensions": [{"name": "x", "coordinate_array": {"must_understand": false, "path": "xxxxx"}, ...],
 ...}

However, it is not clear whether adding an optional core metadata per-dimension attribute is particularly useful given that it could instead be added as a user attribute.

  • Some attributes, like the inner chunk grid for sharding, really don't work well with a row representation.
  • If there is an explicit concept of per-dimension user-defined attributes, then if the zarr implementation supports "virtual views" for operations like transpose, then these attributes can be properly mapped even without knowledge of any particular attribute. However, in some cases this might only partially transform per-dimension metadata. For example, some attributes might relate to multiple dimensions, e.g. a transformation matrix, and could not be easily represented as per-dimension (scalar) metadata. If these virtual views only partially transform the metadata, that may be more confusing than not transforming it at all.
  • The row representation is more human readable in some cases, especially if the same per-dimension attributes are not present for all dimensions. On the other hand for some attributes like shape it may be less human readable.
  • The row representation may be problematic for some future extensions: for example, a way to specify certain integer coordinate transforms, such as transposing the dimension order, reversing dimensions, and adding/removing singleton dimensions. With such an extension, some per-dimension attributes might relate to the "input" space while other per-dimension attributes might relate to the "output" space. In my mind this is the biggest problem with the row representation./

@jstriebel jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Mar 13, 2023
@jstriebel
Copy link
Member

To solve this, dimension names could become user metadata (as mentioned by you) or made extensible, e.g., "dimensions": [{"name": "x", "attributes": { "type": "space", "unit": "nanometer" }].

This is something that could still be added later, as an alternative to dimension_names, if needed. Also, this could could have the following form (either as a top-level field, or in the user attributes):

"dimensions": {
  "x": {                                  // "x" refers to a dimension name
    "type": "space",
    "unit": "nanometer"
  }
}

This would then be in addition to dimension_names, decoupling the order of dimensions and their properties. This might even work nicely for dimensions with the same name, which often have the same attributes (e.g. when being nodes in a graph).

Simply having dimension names without further properties was requested by multiple people in different threads, which I think justifies adding it to v3 atm. I don't think that this is the case for any more complex fields. Since this can be changed and added later as well, I'm strongly in favor of keeping the spec as-is for now.

@normanrz
Copy link
Contributor Author

I don't think that this is the case for any more complex fields.

Well, OME-NGFF has that. But, I don't want this to block ZEP1, either.

@jstriebel
Copy link
Member

@normanrz Can we close this issue?

@normanrz normanrz closed this as completed May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec
Projects
Status: Done
Development

No branches or pull requests

3 participants