Adds index_codecs to the sharding codec #253

normanrz · 2023-07-19T18:37:16Z

This PR adds the index_codecs configuration to the sharding codec. It also adds the crc32c checksum bytes-to-bytes codec.

Index codecs have been proposed by @jbms: #152 (comment)
There is a prototype implementation in zarrita: https://github.com/scalableminds/zarrita/blob/codec-pipeline/zarrita/sharding.py#L507-L518

normanrz · 2023-07-19T18:45:57Z

@jbms @jstriebel Please take a look. If we decide to go with this, it would be good to merge this before we send ZEP2 to the ZIC.

docs/v3/codecs/sharding-indexed/v1.0.rst

jbms · 2023-07-19T18:47:44Z

Thanks! Looks good to me other than the comment I raised.

mkitti · 2023-07-22T00:11:02Z

I'm a bit confused how c.compute_encoded_size will work. To use this, I have to compute the byte size of the decoded representation? Is there some function to compute the size of the decoded representation?

Rather I think the input should just be n, the number of inner chunks. Perhaps there should also be a c.compute_decoded_size as well which also takes n as an input.

jbms · 2023-07-22T04:53:37Z

I'm a bit confused how c.compute_encoded_size will work. To use this, I have to compute the byte size of the decoded representation? Is there some function to compute the size of the decoded representation?

I think this could indeed be clarified, since we have "array -> array", "array -> bytes", and "bytes -> bytes" codecs to consider. For "array -> array" and "array -> bytes", the input is an array and therefore the relevant size would be the shape of the array, not its size in bytes. For "bytes -> bytes" codecs the input size would be in bytes.

Rather I think the input should just be n, the number of inner chunks. Perhaps there should also be a c.compute_decoded_size as well which also takes n as an input.

normanrz · 2023-07-22T04:53:52Z

I'm a bit confused how c.compute_encoded_size will work. To use this, I have to compute the byte size of the decoded representation? Is there some function to compute the size of the decoded representation?

The size of the decoded representation is product(shape) * dtype.itemsize. The shape would be the chunk shape or index shape depending on the context. c.compute_encoded_size can be implemented for all codecs that produce non-variable-length outputs.

Please note, that these are just implementation suggestions. Some implementations may choose to do this differently.

normanrz · 2023-07-22T04:55:05Z

I'm a bit confused how c.compute_encoded_size will work. To use this, I have to compute the byte size of the decoded representation? Is there some function to compute the size of the decoded representation?

I think this could indeed be clarified, since we have "array -> array", "array -> bytes", and "bytes -> bytes" codecs to consider. For "array -> array" and "array -> bytes", the input is an array and therefore the relevant size would be the shape of the array, not its size in bytes. For "bytes -> bytes" codecs the input size would be in bytes.

That makes sense. I'll clarify that.

normanrz · 2023-07-22T05:11:06Z

For "array -> array" and "array -> bytes", the input is an array and therefore the relevant size would be the shape of the array, not its size in bytes.

Actually, this can be complicated for array -> array codecs that change the dtype. For these case, c.compute_encoded_size would need to track the dtype.

jbms · 2023-07-22T05:26:16Z

Yes, the "array -> array" codec should perhaps also indicate the output dtype.

Possibly it is easier to just leave these implementation details unspecified.

jstriebel

LGTM, the generalization of the shard index representation is great 🎉

docs/v3/codecs/crc32c/v1.0.rst

Co-authored-by: Jonathan Striebel <jstriebel@users.noreply.github.com>

normanrz · 2023-07-25T10:25:22Z

Thanks for the feedback! I added some more text to clarify the c.compute_encoded_size signatures.

mkitti · 2023-07-25T18:24:14Z

The documentation looks improved.

What is the current method to identify if a codec is one of array -> array, array -> bytes, or bytes -> bytes?

jbms · 2023-07-25T18:32:44Z

The documentation looks improved.

What is the current method to identify if a codec is one of array -> array, array -> bytes, or bytes -> bytes?

You can't tell just from looking at the json, but any supported codec must be known to the implementation, and it is up to the implementation exactly how this is managed.

In tensorstore there is a class for each codec, which inherits from one of ZarrArrayToArrayCodecSpec, ZarrArrayToBytesCodecSpec, or ZarrBytesToBytesCodecSpec; these in turn all inherit from ZarrCodecSpec, which has a virtual method ZarrCodecKind kind() where ZarrCodecKind is enum ZarrCodecKind { kArrayToArray, kArrayToBytes, kBytesToBytes };.

To parse the json codec list, it first parses as a generic list of ZarrCodecSpec and then sorts them into a separate list of array -> array codecs, the array -> bytes codec, and a list of bytes -> bytes codecs.

docs/v3/core/v3.0.rst

normanrz · 2023-07-26T18:57:04Z

Please review and merge if approved. Thanks!

normanrz added 2 commits July 19, 2023 11:31

adds index_codecs to the sharding spec

065cb0c

tweaks

25ef408

jbms reviewed Jul 19, 2023

View reviewed changes

docs/v3/codecs/sharding-indexed/v1.0.rst Outdated Show resolved Hide resolved

applied pr feedback

b8e3995

jstriebel previously approved these changes Jul 24, 2023

View reviewed changes

docs/v3/codecs/crc32c/v1.0.rst Outdated Show resolved Hide resolved

normanrz dismissed jstriebel’s stale review via 72165cc July 25, 2023 09:31

normanrz and others added 2 commits July 25, 2023 11:31

Update docs/v3/codecs/crc32c/v1.0.rst

72165cc

Co-authored-by: Jonathan Striebel <jstriebel@users.noreply.github.com>

clarified c.compute_encoded_size signature

f1cca01

normanrz requested review from jbms and jstriebel July 25, 2023 10:25

jstriebel previously approved these changes Jul 25, 2023

View reviewed changes

jbms reviewed Jul 25, 2023

View reviewed changes

docs/v3/core/v3.0.rst Outdated Show resolved Hide resolved

Update v3.0.rst

5924fc1

normanrz dismissed jstriebel’s stale review via 5924fc1 July 26, 2023 00:09

normanrz requested review from jbms and jstriebel July 26, 2023 18:55

jbms approved these changes Jul 26, 2023

View reviewed changes

jbms merged commit 36b4832 into zarr-developers:main Jul 26, 2023
1 check passed

normanrz mentioned this pull request Jul 28, 2023

Support index_codecs in zarr sharding scalableminds/webknossos#7241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds index_codecs to the sharding codec #253

Adds index_codecs to the sharding codec #253

normanrz commented Jul 19, 2023

normanrz commented Jul 19, 2023

jbms commented Jul 19, 2023

mkitti commented Jul 22, 2023

jbms commented Jul 22, 2023

normanrz commented Jul 22, 2023

normanrz commented Jul 22, 2023

normanrz commented Jul 22, 2023

jbms commented Jul 22, 2023

jstriebel left a comment

normanrz commented Jul 25, 2023

mkitti commented Jul 25, 2023

jbms commented Jul 25, 2023

normanrz commented Jul 26, 2023

Adds index_codecs to the sharding codec #253

Adds index_codecs to the sharding codec #253

Conversation

normanrz commented Jul 19, 2023

normanrz commented Jul 19, 2023

jbms commented Jul 19, 2023

mkitti commented Jul 22, 2023

jbms commented Jul 22, 2023

normanrz commented Jul 22, 2023

normanrz commented Jul 22, 2023

normanrz commented Jul 22, 2023

jbms commented Jul 22, 2023

jstriebel left a comment

Choose a reason for hiding this comment

normanrz commented Jul 25, 2023

mkitti commented Jul 25, 2023

jbms commented Jul 25, 2023

normanrz commented Jul 26, 2023