- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 366
fix invalid blosc defaults #3545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Codecov Report❌ Patch coverage is  
 Additional details and impacted files@@            Coverage Diff             @@
##             main    #3545      +/-   ##
==========================================
- Coverage   61.84%   61.74%   -0.10%     
==========================================
  Files          85       85              
  Lines       10145    10179      +34     
==========================================
+ Hits         6274     6285      +11     
- Misses       3871     3894      +23     
 🚀 New features to boost your workflow:
 | 
| Did I do something incorrect here to get this error? # /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
from __future__ import annotations
import numpy as np
import zarr
from zarr.codecs import BloscCodec
compressor = BloscCodec(cname="zstd", clevel=3, shuffle="bitshuffle")
data = np.arange(10)
zarr.create_array("foo.zarr", data=data, compressors=(compressor, ), overwrite=True)
f = zarr.open("foo.zarr")
assert f.compressors[0].to_dict() == compressor.to_dict(), (
    f"read back: {f.compressors[0].to_dict()} and actual {compressor.to_dict()}"
)Here is the assertion (with the two configs): Traceback (most recent call last):
  File "/Users/ilangold/Projects/Theis/anndata/shards_type.py", line 19, in <module>
    assert f.compressors[0].to_dict() == compressor.to_dict(), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: read back: {'name': 'blosc', 'configuration': {'typesize': 8, 'cname': 'zstd', 'clevel': 3, 'shuffle': 'shuffle', 'blocksize': 0}} and actual {'name': 'blosc', 'configuration': {'typesize': 1, 'cname': 'zstd', 'clevel': 3, 'shuffle': 'bitshuffle', 'blocksize': 0}}Apologies if this is just noise, maybe I missed something here | 
| I think this is expected -- the evolve_from_array_spec method on the  | 
| Good point - but what of  | 
| 
 that's also consistent with the old implementation:     def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self:
        item_size = 1
        if isinstance(array_spec.dtype, HasItemSize):
            item_size = array_spec.dtype.item_size
        new_codec = self
        if new_codec.typesize is None:
            new_codec = replace(new_codec, typesize=item_size)
        if new_codec.shuffle is None:
            new_codec = replace(
                new_codec,
                shuffle=(BloscShuffle.bitshuffle if item_size == 1 else BloscShuffle.shuffle),
            )
        return new_codec | 
Ensures that the blosc codec cannot be created with invalid parameters. fixes #3427.
In
main, the defaulttypesizeandshufflevalues inBloscCodec.__init__areNone. AsNoneis not a valid value for these parameters,BloscCodec().to_dict()fails with an exception:That's problematic, so in this PR the default values for
BloscCodec.__init__are valid. You can still passtypesize=None, but this emits a deprecation warning, and theNonevalue is replaced by the real default.But I needed to preserve the behavior that motivated making
typesizeandshufflenullable in the first place, which is the need to tunetypesizeandshufflebased on the data type of the array in a routine calledevolve_from_array_spec. We might ultimately want to avoid this tuning1, but we should deal with that in a different PR. Preserving the old behavior required adding a new attributetunable_attrsto theBloscCodecwhich tracks the names of attributes that can be tuned byevolve_from_array_spec. Settingtunable_attrsto the empty set{}allows someone to use default thetypesizeandshuffleparameters without having them tuned byevolve_from_array_spec. Happy to explain more but the tl;dr is that we needed this newtunable_attrsparameter to keep the old behavior.Footnotes
BloscCodec is a
bytes -> bytescodec, so there's no guarantee that input is partitioned into array-data-type-sized words. The byte stream it consumes is based on the behavior of the codecs that come before it, which might do lots of different things to the byte stream. ↩