Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of resilience towards missing _ARRAY_DIMENSIONS xarray's special zarr attribute #280

Closed
eschalkargans opened this issue Nov 23, 2023 · 2 comments
Labels
IO Representation of particular file formats as trees

Comments

@eschalkargans
Copy link

Hello,

Bug Description

I am currently experimenting with datatree (xarray-datatree==0.0.13) to open a Zarr folder.

I assume that datatree should be able to open all of the Zarr files. However, in the current situation, it seems that datatree can only open zarr files that were generated with xarray. Indeed, when the _ARRAY_DIMENSIONS attribute is missing from the metadata contained in the .zmetadata file present at the root of the Zarr, datatree is unable to load the Zarr file. A KeyError: '_ARRAY_DIMENSIONS' is thrown.

Reproduce the Bug

You can find in the following gist a small python script reproducing the issue:

https://gist.github.com/eschalkargans/6c8708370ad6b7b58eebe95aa95084ab

Here is the sequence:

  • I create a dummy DataTree containing a single (label, z) dimensional DataArray named my_xda.
  • I persist the DataTree to Zarr
  • I read successfully the DataTree
  • Then, I alter the Zarr by removing successively all of the _ARRAY_DIMENSIONS from all of the variables .zattrs: z, label, my_xda, and try to reopen the Zarr. It is in all cases a success. ✔️
  • However, the last alteration, which is removing the _ARRAY_DIMENSIONS key-value pair from one of the variables in the .zmetadata file present at the root of the zarr, results in an exception when reading. The error message is explicit: KeyError: '_ARRAY_DIMENSIONS'

Discussion

Because of these choices, Xarray cannot read arbitrary array data, but only Zarr data with valid _ARRAY_DIMENSIONS or NCZarr attributes on each array (NCZarr dimension names are defined in the .zarray file).

More information about _ARRAY_DIMENSIONS: Zarr Encoding Specification

The documentation explicitly states that Xarray cannot read arbitrary array data. So, this issue is more a feature request than a bug description. It is currently expected that such files are not readable.

However, developers may find themselves at one point or another with plain Zarr files that are incompatible with the current xarray implementation. So, I think there should be a way to open these Zarr files with no dimension-names. Maybe the user can provide themselves a mapping for missing dimensions, eg

only missing attributes, merging the read .zmetadata with the user-provided _array_dimensions

open_datatree(zarr_path, engine="zarr", _array_dimensions={
    "z": "z"
})

or even proposing a full mapping from path of the variable into the Zarr hierarchy to their list of dimension names:

open_datatree(zarr_path, engine="zarr", _array_dimensions={
    "z": "z", "label": "label", "my_xda": ["label", "z"]
})

Or, maybe do you wait for an update of the Zarr specification in the future that would fully incorporate named dimensions? In that case, what strategy would you recommend for users of datatree to fix their Zarr? Updating directly the .zmetadata?


Thanks!

@TomNicholas TomNicholas added the IO Representation of particular file formats as trees label Nov 24, 2023
@TomNicholas
Copy link
Collaborator

This is definitely an xarray-level issue, not a datatree-specific issue. All datatree does is open each group of a zarr store using xarray.open_dataset and put them in a tree.

However, developers may find themselves at one point or another with plain Zarr files that are incompatible with the current xarray implementation. So, I think there should be a way to open these Zarr files with no dimension-names.

I have some thoughts about this but I think you should re-raise it on the xarray issue tracker instead!

@eschalkargans
Copy link
Author

Close because is tracked by pydata/xarray#8749

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Representation of particular file formats as trees
Projects
None yet
Development

No branches or pull requests

2 participants