Skip to content

Conversation

@d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Oct 24, 2025

Ensures that the blosc codec cannot be created with invalid parameters. fixes #3427.

In main, the default typesize and shuffle values in BloscCodec.__init__ are None. As None is not a valid value for these parameters, BloscCodec().to_dict() fails with an exception:

>>> from zarr.codecs import BloscCodec                                           
>>> BloscCodec().to_dict()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/d-v-b/dev/zarr-python/src/zarr/codecs/blosc.py", line 127, in to_dict
    raise ValueError("`typesize` needs to be set for serialization.")
ValueError: `typesize` needs to be set for serialization.

That's problematic, so in this PR the default values for BloscCodec.__init__ are valid. You can still pass typesize=None, but this emits a deprecation warning, and the None value is replaced by the real default.

But I needed to preserve the behavior that motivated making typesize and shuffle nullable in the first place, which is the need to tune typesize and shuffle based on the data type of the array in a routine called evolve_from_array_spec. We might ultimately want to avoid this tuning1, but we should deal with that in a different PR. Preserving the old behavior required adding a new attribute tunable_attrs to the BloscCodec which tracks the names of attributes that can be tuned by evolve_from_array_spec. Setting tunable_attrs to the empty set {} allows someone to use default the typesize and shuffle parameters without having them tuned by evolve_from_array_spec. Happy to explain more but the tl;dr is that we needed this new tunable_attrs parameter to keep the old behavior.

Footnotes

  1. BloscCodec is a bytes -> bytes codec, so there's no guarantee that input is partitioned into array-data-type-sized words. The byte stream it consumes is based on the behavior of the codecs that come before it, which might do lots of different things to the byte stream.

@d-v-b d-v-b requested a review from normanrz October 24, 2025 10:11
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Oct 24, 2025
@codecov
Copy link

codecov bot commented Oct 24, 2025

Codecov Report

❌ Patch coverage is 37.73585% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.74%. Comparing base (fc8e8ad) to head (f2f41d8).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/zarr/codecs/blosc.py 41.66% 28 Missing ⚠️
src/zarr/core/common.py 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3545      +/-   ##
==========================================
- Coverage   61.84%   61.74%   -0.10%     
==========================================
  Files          85       85              
  Lines       10145    10179      +34     
==========================================
+ Hits         6274     6285      +11     
- Misses       3871     3894      +23     
Files with missing lines Coverage Δ
src/zarr/core/common.py 47.20% <0.00%> (-1.97%) ⬇️
src/zarr/codecs/blosc.py 41.40% <41.66%> (-1.02%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Oct 24, 2025
@d-v-b d-v-b merged commit fe42655 into zarr-developers:main Oct 24, 2025
31 checks passed
@d-v-b d-v-b deleted the chore/fix-blosc-defaults branch October 24, 2025 15:32
@ilan-gold
Copy link
Contributor

ilan-gold commented Oct 24, 2025

Did I do something incorrect here to get this error?

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
from __future__ import annotations

import numpy as np
import zarr
from zarr.codecs import BloscCodec

compressor = BloscCodec(cname="zstd", clevel=3, shuffle="bitshuffle")
data = np.arange(10)
zarr.create_array("foo.zarr", data=data, compressors=(compressor, ), overwrite=True)
f = zarr.open("foo.zarr")
assert f.compressors[0].to_dict() == compressor.to_dict(), (
    f"read back: {f.compressors[0].to_dict()} and actual {compressor.to_dict()}"
)

Here is the assertion (with the two configs):

Traceback (most recent call last):
  File "/Users/ilangold/Projects/Theis/anndata/shards_type.py", line 19, in <module>
    assert f.compressors[0].to_dict() == compressor.to_dict(), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: read back: {'name': 'blosc', 'configuration': {'typesize': 8, 'cname': 'zstd', 'clevel': 3, 'shuffle': 'shuffle', 'blocksize': 0}} and actual {'name': 'blosc', 'configuration': {'typesize': 1, 'cname': 'zstd', 'clevel': 3, 'shuffle': 'bitshuffle', 'blocksize': 0}}

Apologies if this is just noise, maybe I missed something here

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 24, 2025

I think this is expected -- the evolve_from_array_spec method on the BloscCodec is used internally at array metadata creation time to create a copy of the original BloscCodec with the typesize and shuffle parameters tuned to the array's data type. So in your case, the information that the data has 64 bit data type leads to a new BloscCodec with the typesize set to 8. I'm not sure this behavior is right, but it's what we have been doing with since 3.0

@ilan-gold
Copy link
Contributor

ilan-gold commented Oct 24, 2025

Good point - but what of shuffle (which is shuffle in the read-back-in codec but bitshuffle in the setting one)?

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 24, 2025

Good point - but what of shuffle (which is shuffle in the read-back-in codec but bitshuffle in the setting one)?

that's also consistent with the old implementation:

    def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self:
        item_size = 1
        if isinstance(array_spec.dtype, HasItemSize):
            item_size = array_spec.dtype.item_size
        new_codec = self
        if new_codec.typesize is None:
            new_codec = replace(new_codec, typesize=item_size)
        if new_codec.shuffle is None:
            new_codec = replace(
                new_codec,
                shuffle=(BloscShuffle.bitshuffle if item_size == 1 else BloscShuffle.shuffle),
            )

        return new_codec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BloscCodec is broken

3 participants