Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure KeyError raised for zarr datasets missing dim names #10025

Merged
merged 14 commits into from
Mar 7, 2025

Conversation

oliverwm1
Copy link
Contributor

@oliverwm1 oliverwm1 commented Feb 5, 2025

Add check to ensure that a KeyError is raised for zarr datasets that were not created by xarray.

Copy link

welcome bot commented Feb 5, 2025

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@oliverwm1
Copy link
Contributor Author

Locally I am getting a failure I don't understand for default_zarr_format=2 case, related to the NCZarr code path.

/Users/oliverwm/repos/xarray/xarray/backends/zarr.py:1495: in open_zarr
    ds = open_dataset(
/Users/oliverwm/repos/xarray/xarray/backends/api.py:685: in open_dataset
    backend_ds = backend.open_dataset(
/Users/oliverwm/repos/xarray/xarray/backends/zarr.py:1585: in open_dataset
    ds = store_entrypoint.open_dataset(
/Users/oliverwm/repos/xarray/xarray/backends/store.py:44: in open_dataset
    vars, attrs = filename_or_obj.load()
/Users/oliverwm/repos/xarray/xarray/backends/common.py:312: in load
    (_decode_variable_name(k), v) for k, v in self.get_variables().items()
/Users/oliverwm/repos/xarray/xarray/backends/zarr.py:861: in get_variables
    return FrozenDict((k, self.open_store_variable(k)) for k in self.array_keys())
/Users/oliverwm/repos/xarray/xarray/core/utils.py:457: in FrozenDict
    return Frozen(dict(*args, **kwargs))
/Users/oliverwm/repos/xarray/xarray/backends/zarr.py:861: in <genexpr>
    return FrozenDict((k, self.open_store_variable(k)) for k in self.array_keys())
/Users/oliverwm/repos/xarray/xarray/backends/zarr.py:819: in open_store_variable
    dimensions, attributes = _get_zarr_dims_and_attrs(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

zarr_obj = <Array file:///private/var/folders/kn/7wksmxbj6zdgztrgmk6y12w00000gp/T/pytest-of-oliverwm/pytest-221/test_raises_key_error_on_inval0/tmp.zarr/bar shape=(3, 5) dtype=float32>
dimension_key = '_ARRAY_DIMENSIONS', try_nczarr = True

    def _get_zarr_dims_and_attrs(zarr_obj, dimension_key, try_nczarr):
        # Zarr V3 explicitly stores the dimension names in the metadata
        try:
            # if this exists, we are looking at a Zarr V3 array
            # convert None to empty tuple
            dimensions = zarr_obj.metadata.dimension_names or ()
        except AttributeError:
            # continue to old code path
            pass
        else:
            attributes = dict(zarr_obj.attrs)
            if len(zarr_obj.shape) != len(dimensions):
                raise KeyError(
                    "Zarr object is missing the `dimension_names` metadata which is "
                    "required for xarray to determine variable dimensions."
                )
            return dimensions, attributes
    
        # Zarr arrays do not have dimensions. To get around this problem, we add
        # an attribute that specifies the dimension. We have to hide this attribute
        # when we send the attributes to the user.
        # zarr_obj can be either a zarr group or zarr array
        try:
            # Xarray-Zarr
            dimensions = zarr_obj.attrs[dimension_key]
        except KeyError as e:
            if not try_nczarr:
                raise KeyError(
                    f"Zarr object is missing the attribute `{dimension_key}`, which is "
                    "required for xarray to determine variable dimensions."
                ) from e
    
            # NCZarr defines dimensions through metadata in .zarray
            zarray_path = os.path.join(zarr_obj.path, ".zarray")
>           zarray = json.loads(zarr_obj.store[zarray_path])
E           TypeError: 'LocalStore' object is not subscriptable

/Users/oliverwm/repos/xarray/xarray/backends/zarr.py:401: TypeError
==================================================================== short test summary info =====================================================================
FAILED test_backends.py::test_raises_key_error_on_invalid_zarr_store[2] - TypeError: 'LocalStore' object is not subscriptable
========================================================== 1 failed, 1 passed, 2277 deselected in 0.57s ==========================================================

@jhamman

@TomNicholas TomNicholas added the topic-zarr Related to zarr storage library label Feb 5, 2025
@oliverwm1
Copy link
Contributor Author

Test still failing for case zarr-python>=3 and default_zarr_format=2. With latest change, test fails because LocalStore.get is an async function in zarr-python 3:

/home/runner/work/xarray/xarray/xarray/tests/test_backends.py:6143: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/runner/work/xarray/xarray/xarray/backends/zarr.py:1495: in open_zarr
    ds = open_dataset(
/home/runner/work/xarray/xarray/xarray/backends/api.py:685: in open_dataset
    backend_ds = backend.open_dataset(
/home/runner/work/xarray/xarray/xarray/backends/zarr.py:1585: in open_dataset
    ds = store_entrypoint.open_dataset(
/home/runner/work/xarray/xarray/xarray/backends/store.py:44: in open_dataset
    vars, attrs = filename_or_obj.load()
/home/runner/work/xarray/xarray/xarray/backends/common.py:312: in load
    (_decode_variable_name(k), v) for k, v in self.get_variables().items()
/home/runner/work/xarray/xarray/xarray/backends/zarr.py:861: in get_variables
    return FrozenDict((k, self.open_store_variable(k)) for k in self.array_keys())
/home/runner/work/xarray/xarray/xarray/core/utils.py:457: in FrozenDict
    return Frozen(dict(*args, **kwargs))
/home/runner/work/xarray/xarray/xarray/backends/zarr.py:861: in <genexpr>
    return FrozenDict((k, self.open_store_variable(k)) for k in self.array_keys())
/home/runner/work/xarray/xarray/xarray/backends/zarr.py:819: in open_store_variable
    dimensions, attributes = _get_zarr_dims_and_attrs(
/home/runner/work/xarray/xarray/xarray/backends/zarr.py:401: in _get_zarr_dims_and_attrs
    zarray = json.loads(zarr_obj.store.get(zarray_path))

E               TypeError: the JSON object must be str, bytes or bytearray, not coroutine

/home/runner/micromamba/envs/xarray-tests/lib/python3.12/json/__init__.py:339: TypeError

Not sure best solution—should we check zarr version here and handle async function appropriately? Is there a way to get this info that has same API in zarr-python 2 and 3? Or can we just not test with this combination of zarr-python version and default zarr format?

@jhamman
@malmans2 tagging you since it is nczarr-related code raising this issue.

@oliverwm1 oliverwm1 changed the title Ensure KeyError is raised for zarr v3 datasets missing dim names Ensure KeyError raised for zarr datasets missing dim names Feb 18, 2025
@oliverwm1 oliverwm1 marked this pull request as ready for review February 18, 2025 04:24
zarray_str = asyncio.run(zarr_obj.store.get(zarray_path)).to_bytes()
else:
zarray_str = zarr_obj.store.get(zarray_path)
zarray = json.loads(zarray_str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you find it useful if zarr let you do this with a single function? because we could totally add that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not needed to access the 'zarray' info before, so that wouldn't make a big difference to me. But it seems nczarr does use it and I was just trying to get this existing code to work with zarr-python 3.

if _zarr_v3():
import asyncio

zarray_str = asyncio.run(zarr_obj.store.get(zarray_path)).to_bytes()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no check previously guaranteeing that this is zarr_format=2, an additional check for zarray_paths ending with zarr.json will likely be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the first try/except block in this function is intended to ensure we have zarr_format=2 here:

https://github.com/oliverwm1/xarray/blob/c1a1c70310f256a1125b08f089e5688075f89d78/xarray/backends/zarr.py#L368-L375

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll merge once we fix our test failures.

@dcherian dcherian merged commit 70d2f1d into pydata:main Mar 7, 2025
31 checks passed
Copy link

welcome bot commented Mar 7, 2025

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError raised when opening a 'plain' zarr with xarray+zarr 3
5 participants