New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename endian codec to bytes #263
Conversation
@jbms @jstriebel please take a look. |
👍 I'm all for this change. But I'd also like us to understand what is happening here at a procedural level. How is it possible to change the core spec now that it has been accepted? At what point is a new ZEP required? I agree we should make room for minor changes like this. However, we should do so under a clear framework. How are we tracking different versions of the spec if the spec continues to evolve? Is this a step towards, e.g. v3.0.1? I have advocated for a more incremental approach to the spec development, as practiced, for example, by STAC. But our current process doesn't seem to allow for that. We should change that. |
The way I understand the ZEP process (and how we discussed it in the ZEP meetings) is that ZEP1 has been provisionally accepted, but not yet finalized. That means there is some room to make some minor changes to the spec that are motivated by feedback from the implementors. Once a number of implementations have shown to implement a ZEP it becomes finalized. I guess with ZEP1 we are still in this limbo phase. We wanted to collect all minor changes (there have been some other minor changes after acceptance) and present them to the ZIC and ZSC once more before finalizing it. Not sure, if that requires another round of voting. |
That makes sense to me. STAC released several 1.0.0 "beta" and "rc" releases of the spec before stamping a final 1.0.0. (https://github.com/radiantearth/stac-spec/releases) Maybe it would be helpful for us to do the same? Otherwise, I worry that we will repeat the current situation where there are many different, formally indistinguishable versions of v3 in the wild. But we can continue that discussion elsewhere. Sorry for hijacking this thread. |
👍 for having a clear idea of how this will be communicated to those who have already started implementing since there won't be, e.g., a clear vote point where we get their affirmation. |
@zarr-developers/implementation-council This is a (minor) breaking change, so approval from existing implementations is needed. |
Yes from me (zarr.js/zarrita.js) |
While I think |
To make this a non-breaking change, could the |
Do you mean a name like bytes_corder? The idea is that you can combine this with the transpose codec to obtain any order. |
It could be kept as an alias easily enough, but are you aware of an actual existing use where this change will be problematic? At this point, tensorstore, zarrita, and zarrita.js have already made this change. The only other implementation of which I'm aware, zarr3-rs, has not, but I don't know if it is in production use yet. |
I'm not sure I would say "any order", but yes I understand that this is meant to be used with My primary concern here is that Looking at this from the ZEP2 perspective, the example is currently as follows. "index_codecs": [
{
"name": "endian",
"configuration": {
"endian": "little",
}
},
{ "name": "crc32c"}
] I agree that this is confusing. It is unclear that Now, with this PR, we have the following. "index_codecs": [
{
"name": "bytes",
"configuration": {
"endian": "little",
}
},
{ "name": "crc32c"}
] While this is better, it is still not quite obvious to someone who has not read the specification thoroughly what this codec operation has performed. If you were a Python user and knew that If I'm allowed to bikeshed, I've settled on the name
I'm guessing that those who have made this change also still support the "endian" codec? Returning to Ryan's process question, pre-implementation makes this seem like it has been decided already and perhaps already hard to change. edit: I removed the name |
I switched over to “bytes” only in zarrita.js to align with zarrita. I don’t have much opinion here, but happy to implement/support what is decided. |
I think flatten_to_bytes is not unreasonable, but also is essentially a description of what any array to bytes codec must do, and therefore may not be particularly helpful to identify this specific codec. |
It is not clear to me that every array to bytes transformation must be reversible or that lossy compression is prohibited at this stage. One could define an array to bytes transformation that retains no information at all or perhaps only the shape of the array. Decoding would just produce an array of zeros. While I agree that |
I don't see much added value from
|
If we wanted to be explicit about the order of the shard index, I suppose one could always write the following, right?
Alternatively,
|
Yes, that's right, though of course the other PR about how Note, for example, that if there is a transpose codec specified prior to the sharding codec, the index, the chunk_shape, and the data chunks are then relative to the transposed order. Here is a complicated example to illustrate some possible interactions: "chunk_grid": {"name": "regular", "configuration": {"chunk_shape": [100, 200, 300]}},
"codecs": [
{"name": "transpose", "configuration": {"order": [0, 2, 1]}},
{"name": "sharding_indexed", "configuration": {
"chunk_shape": [5, 10, 20],
"index_codecs": [
{"name": "transpose", "configuration": {"order": [3,2,1,0]}},
{"name": "bytes", "configuration": {"endian": "little"}},
{"name": "crc32c"}
],
"codecs": [
{"name": "transpose", "configuration": {"order": [1, 0, 2]}},
{"name": "bytes", "configuration": {"endian": "little"}},
{"name": "gzip", "configuration": {"level": 5}}
]
}
] Here the top-level chunk shape is From the perspective of the sharding codec, each data codec has a shape of From the user perspective looking at the top-level array as a whole, the inner-most chunking (granularity at which reads are supported) is |
May I suggest that the configuration.endian key also allow "native" in addition to "big" and "little". |
Shouldn't the configuration for endian also specify the dtype so that it is self-contained? |
Note that this is specifying the stored (i.e. "on-disk") format --- for that purpose, "native" is problematic because the native endianness of the machine reading it may differ from the native endianness of the machine writing (though of course in practice it will almost always be little endian). When creating an array, though, a zarr implementation may provide a way to choose native endianness automatically. E.g. in tensorstore, you can leave "endian" unspecified and it will choose the native endianness (but when actually stored, the "endian" value will always be specified). As far as storing the dtype, I can see how that may be advantageous though a similar argument would apply to storing the shape of the chunk as well in the |
Fair enough. |
@zarr-developers/implementation-council I'd like to proceed with merging this, but please post any objections. @clbarnes I think yours may be the only existing implementation that has not already made this change. |
Ah yes, this slipped past me. The name change isn't a concern; making the order optional opens up some new codec errors types conditional on dtype, which is a bit of a pain (damn rust making you think about these things early...) but shouldn't be too bad, just makes validation more complicated. Who's writing the jsonschema? 😬 |
Do multi-byte raw codecs have an endianness, i.e. do they require the config here? |
Following this PR zarr-developers/zarr-specs#263 Also allows the endianness to be undefined for single-byte dtypes.
No, the "r*" types do not have an endianness. |
Does that get complicated where they're used as a fallback? Out of scope for this PR, I suppose. Endian codec rename PR ready to go, anyway clbarnes/zarr3-rs#18 |
Yes it is a complication, and indeed it is for reasons like this that we removed the concept of fallback data types.
|
Now that this has been implemented by all existing zarr v3 implementations, I think it is time to merge this. |
@normanrz I think this should be merged --- can you resolve the conflicts so that we can merge it? |
Done! |
Renames the
endian
tobytes
based on this discussion: scalableminds/zarrita#8 (comment)