-
-
Notifications
You must be signed in to change notification settings - Fork 364
Description
Zarr version
3.0.0a8
Numcodecs version
0.13.0
Python Version
3.11
Operating System
Mac
Installation
pip in virtual environment
Description
I am reading a kerchunk reference filesystem as a zarr v2 store with zarr python. The entire reference file is attached, but an example .zarray
is as follows:
{
"shape": [
1
],
"fill_value": null,
"zarr_format": 2,
"order": "C",
"filters": [
{
"id": "zlib",
"level": 2
}
],
"dimension_separator": ".",
"compressor": null,
"chunks": [
1
],
"dtype": "<i4"
}
Notably, the filters
contains zlib
which is a BytesBytesCodec
in zarr-python v3+. The issue comes, when we look at the codec pipeline created for V2 arrays:
zarr-python/src/zarr/core/array.py
Lines 105 to 113 in 9bce890
def create_codec_pipeline(metadata: ArrayV2Metadata | ArrayV3Metadata) -> CodecPipeline: | |
if isinstance(metadata, ArrayV3Metadata): | |
return get_pipeline_class().from_codecs(metadata.codecs) | |
elif isinstance(metadata, ArrayV2Metadata): | |
return get_pipeline_class().from_codecs( | |
[V2Filters(metadata.filters), V2Compressor(metadata.compressor)] | |
) | |
else: | |
raise TypeError |
This defines the pipeline as two codecs, filters
and compressor
. The problem here is that V2Filters
is defined as a ArrayArrayCodec
and V2Compressor
is defined as a ArrayBytesCodec
. Because of this, all codecs defined in the metadata as filters
are expected to be ArrayArrayCodecs
and applied once the buffer is translated to an array. Further, the compressor
can only define one codecs that is an ArrayBytesCodec
, which leaves no place to define a BytesBytesCodec
in v2 metadata.
With the current .zarray
above, the pipeline crashes because the zlib
codec outputs bytes and not an array as is expected.
Steps to reproduce
pip install git+https://github.com/TomAugspurger/zarr-python@xarray-compat git+https://github.com/TomAugspurger/xarray/@fix/zarr-v3
import json
import fsspec
import xarray as xr
import zarr
test_dict = json.load('test_dict.json')
fs = fsspec.implementations.reference.ReferenceFileSystem(fo=test_dict, remote_options=remote_options)
store = zarr.storage.RemoteStore(fs, mode="r")
# This will crash in the codec pipeline
ds = xr.open_dataset(store, engine="zarr", zarr_format=2, backend_kwargs=dict(consolidated=False))
Additional output
No response