Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds index_codecs to the sharding codec #253

Merged
merged 6 commits into from Jul 26, 2023

Conversation

normanrz
Copy link
Contributor

This PR adds the index_codecs configuration to the sharding codec. It also adds the crc32c checksum bytes-to-bytes codec.

Index codecs have been proposed by @jbms: #152 (comment)
There is a prototype implementation in zarrita: https://github.com/scalableminds/zarrita/blob/codec-pipeline/zarrita/sharding.py#L507-L518

@normanrz
Copy link
Contributor Author

@jbms @jstriebel Please take a look. If we decide to go with this, it would be good to merge this before we send ZEP2 to the ZIC.

@jbms
Copy link
Contributor

jbms commented Jul 19, 2023

Thanks! Looks good to me other than the comment I raised.

@mkitti
Copy link

mkitti commented Jul 22, 2023

I'm a bit confused how c.compute_encoded_size will work. To use this, I have to compute the byte size of the decoded representation? Is there some function to compute the size of the decoded representation?

Rather I think the input should just be n, the number of inner chunks. Perhaps there should also be a c.compute_decoded_size as well which also takes n as an input.

@jbms
Copy link
Contributor

jbms commented Jul 22, 2023

I'm a bit confused how c.compute_encoded_size will work. To use this, I have to compute the byte size of the decoded representation? Is there some function to compute the size of the decoded representation?

I think this could indeed be clarified, since we have "array -> array", "array -> bytes", and "bytes -> bytes" codecs to consider. For "array -> array" and "array -> bytes", the input is an array and therefore the relevant size would be the shape of the array, not its size in bytes. For "bytes -> bytes" codecs the input size would be in bytes.

Rather I think the input should just be n, the number of inner chunks. Perhaps there should also be a c.compute_decoded_size as well which also takes n as an input.

@normanrz
Copy link
Contributor Author

I'm a bit confused how c.compute_encoded_size will work. To use this, I have to compute the byte size of the decoded representation? Is there some function to compute the size of the decoded representation?

The size of the decoded representation is product(shape) * dtype.itemsize. The shape would be the chunk shape or index shape depending on the context. c.compute_encoded_size can be implemented for all codecs that produce non-variable-length outputs.

Please note, that these are just implementation suggestions. Some implementations may choose to do this differently.

@normanrz
Copy link
Contributor Author

I'm a bit confused how c.compute_encoded_size will work. To use this, I have to compute the byte size of the decoded representation? Is there some function to compute the size of the decoded representation?

I think this could indeed be clarified, since we have "array -> array", "array -> bytes", and "bytes -> bytes" codecs to consider. For "array -> array" and "array -> bytes", the input is an array and therefore the relevant size would be the shape of the array, not its size in bytes. For "bytes -> bytes" codecs the input size would be in bytes.

That makes sense. I'll clarify that.

@normanrz
Copy link
Contributor Author

For "array -> array" and "array -> bytes", the input is an array and therefore the relevant size would be the shape of the array, not its size in bytes.

Actually, this can be complicated for array -> array codecs that change the dtype. For these case, c.compute_encoded_size would need to track the dtype.

@jbms
Copy link
Contributor

jbms commented Jul 22, 2023

Yes, the "array -> array" codec should perhaps also indicate the output dtype.

Possibly it is easier to just leave these implementation details unspecified.

jstriebel
jstriebel previously approved these changes Jul 24, 2023
Copy link
Member

@jstriebel jstriebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the generalization of the shard index representation is great 🎉

docs/v3/codecs/crc32c/v1.0.rst Outdated Show resolved Hide resolved
normanrz and others added 2 commits July 25, 2023 11:31
Co-authored-by: Jonathan Striebel <jstriebel@users.noreply.github.com>
@normanrz
Copy link
Contributor Author

Thanks for the feedback! I added some more text to clarify the c.compute_encoded_size signatures.

@normanrz normanrz requested review from jbms and jstriebel July 25, 2023 10:25
@mkitti
Copy link

mkitti commented Jul 25, 2023

The documentation looks improved.

What is the current method to identify if a codec is one of array -> array, array -> bytes, or bytes -> bytes?

@jbms
Copy link
Contributor

jbms commented Jul 25, 2023

The documentation looks improved.

What is the current method to identify if a codec is one of array -> array, array -> bytes, or bytes -> bytes?

You can't tell just from looking at the json, but any supported codec must be known to the implementation, and it is up to the implementation exactly how this is managed.

In tensorstore there is a class for each codec, which inherits from one of ZarrArrayToArrayCodecSpec, ZarrArrayToBytesCodecSpec, or ZarrBytesToBytesCodecSpec; these in turn all inherit from ZarrCodecSpec, which has a virtual method ZarrCodecKind kind() where ZarrCodecKind is enum ZarrCodecKind { kArrayToArray, kArrayToBytes, kBytesToBytes };.

To parse the json codec list, it first parses as a generic list of ZarrCodecSpec and then sorts them into a separate list of array -> array codecs, the array -> bytes codec, and a list of bytes -> bytes codecs.

jstriebel
jstriebel previously approved these changes Jul 25, 2023
docs/v3/core/v3.0.rst Outdated Show resolved Hide resolved
@normanrz
Copy link
Contributor Author

Please review and merge if approved. Thanks!

@jbms jbms merged commit 36b4832 into zarr-developers:main Jul 26, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants