use compressor, filters, post_compressor for Array v3 create #1944

d-v-b · 2024-06-02T13:54:41Z

In terms of abstraction levels, this pushes the codecs kwarg below the array creation API. Instead, we use the kwarg "filters" to denote ArrayArray codecs, "compressor" to denote the ArrayBytes codec, and "post_compressors" to denote the BytesBytesCodecs. This makes the top-level array creation API more explicit AND more similar to v2. Implementation of ideas expressed in #1943.

rabernat · 2024-06-02T15:25:11Z

👍 I like this as an API.

normanrz · 2024-06-04T18:50:19Z

Not sure I am a fan. Codecs are a main difference between the v2 and v3 spec. I think the v3 codec pipeline is superior to the v2 compressors+filters. I would like to see a v3-first interface instead of shoehorning the new pipeline in the legacy interface.

I think this will be more confusing than helpful. All typical compressors for v2 arrays (e.g. blosc, zstd, gzip) are actually BytesBytesCodecs. Now, that would need to go in compressor for v2 arrays and post_compressor for v3 arrays? Essentially, the only compressors for v3 would be bytes and sharding_indexed. That just doesn't make sense.

I don't mind adding compressor and filters as optional kwargs to the v3 create interface. If set, they could be used to build a codec pipeline: filters + [BytesCodec(), compressor]. But, I would like to keep the codecs kwarg (which should take precedence over compressor+filters) and promote that as the recommended way of setting codecs for v3.

d-v-b · 2024-06-04T19:04:50Z

I think this will be more confusing than helpful. All typical compressors for v2 arrays (e.g. blosc, zstd, gzip) are actually BytesBytesCodecs. Now, that would need to go in compressor for v2 arrays and post_compressor for v3 arrays?

This is a good point -- I had forgotten that the old compressors are not actually ArrayBytesCodec, and I do think this fact would be a source of confusion in the proposed API. But I also think users will find the codecs list confusing. It has structure, but we do not expose that structure in the user-facing API; I think users will find it frustrating to intuit an API that we could make explicit.

Essentially, the only compressors for v3 would be bytes and sharding_indexed. That just doesn't make sense.

Given that there are only 2 options for the ArrayBytesCodec, and one of them is sharding, then I wonder if it would make sense to have a top-level sharding keyword argument that defaults to None (resulting in the BytesCodec getting used). I'm not sure exactly how users would customize the BytesCodec here, but this would at least guide users to sharding more easily than the opaque codecs list.

Are any other ideas for what a "v3 first" API would look like here? I think codecs: tuple[Codec, ...] will not be popular.

normanrz · 2024-06-04T19:16:54Z

But I also think users will find the codecs list confusing. It has structure, but we do not expose that structure in the user-facing API; I think users will find it frustrating to intuit an API that we could make explicit.

It is closest to the spec, though.

Given that there are only 2 options for the ArrayBytesCodec, and one of them is sharding, then I wonder if it would make sense to have a top-level sharding keyword argument that defaults to None (resulting in the BytesCodec getting used).

Adding sharding as a kwarg sounds interesting.

Are any other ideas for what a "v3 first" API would look like here? I think codecs: tuple[Codec, ...] will not be popular.

I think it could work well with some with some UX tweaks such as automatic BytesCodecs insertion and single value support. That would automatically turn codecs=BloscCodec() into codecs=(BytesCodec(), BloscCodec()) or codecs=TransposeCodec() into codecs=(TransposeCodec(), BytesCodec()) or codecs=(TransposeCodec(), ZstdCodec()) into codecs=(TransposeCodec(), BytesCodecs(), ZstdCodec()).

rabernat · 2024-06-05T02:17:42Z

It is closest to the spec, though.

I do not think that we need to constrain the Python API so closely to the spec. We should think about what would be most clear and convenient for our users. Specs are invisible implementation details to 99% of users. They are necessary for interoperability but not something users need to be exposed to directly. Do you think about the HTTP spec when you submit a comment on GitHub? 😆

I definitely think we need an API compatibility layer with the V2 syntax ("compressor", "codecs").

jhamman · 2024-06-05T06:27:47Z

Might we consider experimenting with the top level API (e.g. #1884) rather than the Array class constructors? I've been thinking of separate signatures (a la mypy overloads) for v2 and v3 arrays. I suspect we may find a reasonable path there but if not, we could always provide a different API that abstracts over the two sets of spec-specific keywords.

rabernat · 2024-06-05T13:44:27Z

That sounds very reasonable to me. Maybe the class constructors can adhere more strictly to the spec and internal structure, while the top-level API provides backwards compatibility and syntactic sugar.

The main downside is that these two APIs violate the Zen of Python: "There should be one-- and preferably only one --obvious way to do it."

d-v-b · 2024-06-05T13:59:09Z

I still believe that there is a model that can express v2 and v3, see the following table:

v2	v3
filters : `tuple[ArrayArrayCodec]`	filters: `tuple[ArrayArrayCodec]`
N/A	pre_compressor: `ArrayBytesCodec`
compressor : `BytesBytesCodec`	compressor: `tuple[BytesBytesCodec]`

as @normanrz pointed out, "compressor" actually maps better to BytesBytesCodec, so we go with that, and remove the "post_compressor" kwarg.
V2 compressor will be internally normalized to a tuple with a single element, so that it plays nice with the v3 compression machinery
We introduce "pre_compressor", which we can rename if we want. "pre_compressor" must be empty for v2 arrays. Otherwise, this API handles both v2 and v3.

Thoughts? I will update this PR along these lines. I really want this API to be good. If it's painful or opaque for users, they will make mistakes or fail to use features in the library.

d-v-b · 2024-06-05T14:08:26Z

I am also thinking about how we can make the sharding conceptualization simple. One idea would be to express unsharded arrays as simply a special case of sharded arrays.

normanrz · 2024-06-05T16:19:14Z

Thoughts? I will update this PR along these lines. I really want this API to be good. If it's painful or opaque for users, they will make mistakes or fail to use features in the library.

pre_compressor could just be array_bytes_codec. I doubt most users will care about it, if there is a dedicated way to set sharding.

I am also thinking about how we can make the sharding conceptualization simple. One idea would be to express unsharded arrays as simply a special case of sharded arrays.

How would that look like? filters and compressor could be used for the internal codecs. How would things like index_location or index_codecs (or chunk_layout in the future) be controlled? While a bit of an edge case, there is also the possibility to have nested sharding.

I don't think the API needs to be the codecs tuple we have right now, but it shouldn't be more confusing and less expressive.

rabernat · 2024-06-05T16:29:48Z

One idea would be to express unsharded arrays as simply a special case of sharded arrays.

I think that's a great idea.

When it's time to fetch data to satisfy a user query, we have a data structures kind of like this:

class ChunkReference:
    store_path: StorePath
    range: tuple[int, int] | None  # optional range within the path to fetch

ChunkRequest  # type: dict[ChunkKey, ChunkReference]

We can produce this data structure after scanning the shard index. It's also the same sort of information that is generated by kerchunk-style virtual Zarr datasets. For non-sharded data, the range would be unknown.

Once we have this data structure, we can make two potential optimizations:

Coalescing: combining multiple small requests for the same StorePath into a single request
Splitting: taking large requests (e.g. > 8 MB) and splitting them into smaller requests (to improve throughput)

Is this at all compatible with how the sharding codec is currently implemented?

normanrz · 2024-06-05T16:33:42Z

Is this at all compatible with how the sharding codec is currently implemented?

Currently, the codec issues separate partial get requests, but it could be turned into batched fetching.

LDeakin · 2024-06-06T06:18:35Z

I think it is a good idea to simplify the API with something like filters/compressor, but compressor should not strictly map to BytesBytesCodec. Possible future compressors like zfp and pcodec make sense as ArrayBytesCodecs, as they compress arrays/elements, rather than bytes directly.

normanrz · 2024-06-06T10:51:03Z

I think we could use compressor: tuple[ArrayBytesCodec | BytesBytesCodec, ...] | ArrayBytesCodec | BytesBytesCodec | None with runtime validation that there is max 1 ArrayBytesCodec. If no ArrayBytesCodec is supplied, we can auto-add a BytesCodec.

…y_create_codec_kwargs

d-v-b added 5 commits June 2, 2024 15:46

use compressor, filters, post_compressor for Array v3 create

c6b6d55

narrow types of filters and post_compressors

33f0df3

empty list -> empty tuple

0f75f17

convert lists to tuples in parse_filters

6c1c419

restore deleted test, and update the test condition

7796194

Merge branch 'v3' into array_create_codec_kwargs

faa6396

d-v-b requested review from normanrz and jhamman and removed request for normanrz June 4, 2024 18:29

fix missing import and fix an additional test

17de339

d-v-b added 3 commits June 14, 2024 15:31

Merge branch 'v3' of github.com:zarr-developers/zarr-python into arra…

1382066

…y_create_codec_kwargs

top-level api adjustments

42e7d4c

style

e6014f1

d-v-b marked this pull request as draft June 14, 2024 14:31

jhamman added the V3 Related to compatibility with V3 spec label Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use compressor, filters, post_compressor for Array v3 create #1944

use compressor, filters, post_compressor for Array v3 create #1944

d-v-b commented Jun 2, 2024 •

edited

Loading

rabernat commented Jun 2, 2024

normanrz commented Jun 4, 2024

d-v-b commented Jun 4, 2024

normanrz commented Jun 4, 2024

rabernat commented Jun 5, 2024

jhamman commented Jun 5, 2024

rabernat commented Jun 5, 2024

d-v-b commented Jun 5, 2024 •

edited

Loading

d-v-b commented Jun 5, 2024

normanrz commented Jun 5, 2024

rabernat commented Jun 5, 2024

normanrz commented Jun 5, 2024

LDeakin commented Jun 6, 2024 •

edited

Loading

normanrz commented Jun 6, 2024

use compressor, filters, post_compressor for Array v3 create #1944

Are you sure you want to change the base?

use compressor, filters, post_compressor for Array v3 create #1944

Conversation

d-v-b commented Jun 2, 2024 • edited Loading

rabernat commented Jun 2, 2024

normanrz commented Jun 4, 2024

d-v-b commented Jun 4, 2024

normanrz commented Jun 4, 2024

rabernat commented Jun 5, 2024

jhamman commented Jun 5, 2024

rabernat commented Jun 5, 2024

d-v-b commented Jun 5, 2024 • edited Loading

d-v-b commented Jun 5, 2024

normanrz commented Jun 5, 2024

rabernat commented Jun 5, 2024

normanrz commented Jun 5, 2024

LDeakin commented Jun 6, 2024 • edited Loading

normanrz commented Jun 6, 2024

d-v-b commented Jun 2, 2024 •

edited

Loading

d-v-b commented Jun 5, 2024 •

edited

Loading

LDeakin commented Jun 6, 2024 •

edited

Loading