Skip to content

[v3] V2 Codec pipeline is not consistent with legacy usage of filters #2325

@mpiannucci

Description

@mpiannucci

Zarr version

3.0.0a8

Numcodecs version

0.13.0

Python Version

3.11

Operating System

Mac

Installation

pip in virtual environment

Description

I am reading a kerchunk reference filesystem as a zarr v2 store with zarr python. The entire reference file is attached, but an example .zarray is as follows:

    {
        "shape": [
            1
        ],
        "fill_value": null,
        "zarr_format": 2,
        "order": "C",
        "filters": [
            {
                "id": "zlib",
                "level": 2
            }
        ],
        "dimension_separator": ".",
        "compressor": null,
        "chunks": [
            1
        ],
        "dtype": "<i4"
    }

Notably, the filters contains zlib which is a BytesBytesCodec in zarr-python v3+. The issue comes, when we look at the codec pipeline created for V2 arrays:

def create_codec_pipeline(metadata: ArrayV2Metadata | ArrayV3Metadata) -> CodecPipeline:
if isinstance(metadata, ArrayV3Metadata):
return get_pipeline_class().from_codecs(metadata.codecs)
elif isinstance(metadata, ArrayV2Metadata):
return get_pipeline_class().from_codecs(
[V2Filters(metadata.filters), V2Compressor(metadata.compressor)]
)
else:
raise TypeError

This defines the pipeline as two codecs, filters and compressor. The problem here is that V2Filters is defined as a ArrayArrayCodec and V2Compressor is defined as a ArrayBytesCodec. Because of this, all codecs defined in the metadata as filters are expected to be ArrayArrayCodecs and applied once the buffer is translated to an array. Further, the compressor can only define one codecs that is an ArrayBytesCodec, which leaves no place to define a BytesBytesCodec in v2 metadata.

With the current .zarray above, the pipeline crashes because the zlib codec outputs bytes and not an array as is expected.

test_dict.json

cc @jhamman @martindurant

Steps to reproduce

pip install git+https://github.com/TomAugspurger/zarr-python@xarray-compat git+https://github.com/TomAugspurger/xarray/@fix/zarr-v3
import json
import fsspec
import xarray as xr
import zarr

test_dict = json.load('test_dict.json')
fs = fsspec.implementations.reference.ReferenceFileSystem(fo=test_dict, remote_options=remote_options)
store =  zarr.storage.RemoteStore(fs, mode="r")

# This will crash in the codec pipeline
ds = xr.open_dataset(store, engine="zarr", zarr_format=2, backend_kwargs=dict(consolidated=False))

Additional output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugPotential issues with the zarr-python library

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions