Rename endian codec to bytes #263

normanrz · 2023-08-21T08:28:15Z

Renames the endian to bytes based on this discussion: scalableminds/zarrita#8 (comment)

normanrz · 2023-08-21T08:30:01Z

@jbms @jstriebel please take a look.
Can we set up a redirect from the endian page to the new bytes page?

rabernat · 2023-08-21T13:20:00Z

👍 I'm all for this change.

But I'd also like us to understand what is happening here at a procedural level. How is it possible to change the core spec now that it has been accepted? At what point is a new ZEP required?

I agree we should make room for minor changes like this. However, we should do so under a clear framework. How are we tracking different versions of the spec if the spec continues to evolve? Is this a step towards, e.g. v3.0.1?

I have advocated for a more incremental approach to the spec development, as practiced, for example, by STAC. But our current process doesn't seem to allow for that. We should change that.

normanrz · 2023-08-21T13:46:52Z

But I'd also like us to understand what is happening here at a procedural level. How is it possible to change the core spec now that it has been accepted? At what point is a new ZEP required?

The way I understand the ZEP process (and how we discussed it in the ZEP meetings) is that ZEP1 has been provisionally accepted, but not yet finalized. That means there is some room to make some minor changes to the spec that are motivated by feedback from the implementors. Once a number of implementations have shown to implement a ZEP it becomes finalized.

I guess with ZEP1 we are still in this limbo phase. We wanted to collect all minor changes (there have been some other minor changes after acceptance) and present them to the ZIC and ZSC once more before finalizing it. Not sure, if that requires another round of voting.

rabernat · 2023-08-21T13:49:19Z

That means there is some room to make some minor changes to the spec that are motivated by feedback from the implementors

That makes sense to me.

STAC released several 1.0.0 "beta" and "rc" releases of the spec before stamping a final 1.0.0. (https://github.com/radiantearth/stac-spec/releases)

Maybe it would be helpful for us to do the same? Otherwise, I worry that we will repeat the current situation where there are many different, formally indistinguishable versions of v3 in the wild.

But we can continue that discussion elsewhere. Sorry for hijacking this thread.

joshmoore · 2023-08-24T16:31:05Z

👍 for having a clear idea of how this will be communicated to those who have already started implementing since there won't be, e.g., a clear vote point where we get their affirmation.

jbms · 2023-08-24T16:39:56Z

@zarr-developers/implementation-council This is a (minor) breaking change, so approval from existing implementations is needed.

manzt · 2023-08-25T12:23:35Z

Yes from me (zarr.js/zarrita.js)

mkitti · 2023-09-20T20:38:21Z

While I think bytes is a better than endian, I wonder if it could be made more obvious that this also uses C-order when converting an array to bytes.

mkitti · 2023-09-20T21:06:19Z

To make this a non-breaking change, could the endian codec be deprecated, use is discouraged pending removal, rather than removed?

jbms · 2023-09-20T21:06:22Z

Do you mean a name like bytes_corder? The idea is that you can combine this with the transpose codec to obtain any order.

jbms · 2023-09-20T21:14:06Z

To make this a non-breaking change, could the endian codec be deprecated, use is discouraged pending removal, rather than removed?

It could be kept as an alias easily enough, but are you aware of an actual existing use where this change will be problematic?

At this point, tensorstore, zarrita, and zarrita.js have already made this change. The only other implementation of which I'm aware, zarr3-rs, has not, but I don't know if it is in production use yet.

mkitti · 2023-09-20T22:17:56Z

I'm not sure I would say "any order", but yes I understand that this is meant to be used with transpose to obtain F-order. In the past, "Morton-order" was also discussed, and I'm pretty sure that transpose does not allow for that.

My primary concern here is that bytes still seems to be less than obvious as to what this codec does.

Looking at this from the ZEP2 perspective, the example is currently as follows.

        "index_codecs": [
          {
            "name": "endian",
            "configuration": {
              "endian": "little",
            }
          },
          { "name": "crc32c"}
        ]

I agree that this is confusing. It is unclear that endian codec is flattening the array and serializing to bytes.

Now, with this PR, we have the following.

        "index_codecs": [
          {
            "name": "bytes",
            "configuration": {
              "endian": "little",
            }
          },
          { "name": "crc32c"}
        ]

While this is better, it is still not quite obvious to someone who has not read the specification thoroughly what this codec operation has performed. If you were a Python user and knew that bytes(some_numpy_array) return a bytes object, I think there is a good chance one could guess the codec operation. Otherwise, I'm really not sure if this operation could be easily inferred.

If I'm allowed to bikeshed, I've settled on the name flatten_to_bytes after discussing it with a few colleagues. This name captures that some kind of ordering operation is occuring as well as the encoding in bytes.

At this point, tensorstore, zarrita, and zarrita.js have already made this change. The only other implementation of which I'm aware, zarr3-rs, has not, but I don't know if it is in production use yet.

I'm guessing that those who have made this change also still support the "endian" codec? Returning to Ryan's process question, pre-implementation makes this seem like it has been decided already and perhaps already hard to change.

edit: I removed the name ravel_to_bytes as a suggested name for the codec.

manzt · 2023-09-20T23:29:32Z

I'm guessing that those who have made this change also still support the "endian" codec?

I switched over to “bytes” only in zarrita.js to align with zarrita. I don’t have much opinion here, but happy to implement/support what is decided.

jbms · 2023-09-21T00:30:09Z

I think flatten_to_bytes is not unreasonable, but also is essentially a description of what any array to bytes codec must do, and therefore may not be particularly helpful to identify this specific codec.

mkitti · 2023-09-21T06:21:08Z

It is not clear to me that every array to bytes transformation must be reversible or that lossy compression is prohibited at this stage. One could define an array to bytes transformation that retains no information at all or perhaps only the shape of the array. Decoding would just produce an array of zeros.

While I agree that flatten_to_bytes is a trivial loseless transformation, I'm unclear if that truly describes the entire array to bytes codec space.

normanrz · 2023-09-21T07:33:07Z

I don't see much added value from flatten_to_bytes over just bytes. I think bytes captures pretty well what the codec does. If it it not clear from the name, the specification contains all the details.
Considering that some implementations have already adopted it (I want to added Webknossos to Jeremy's list), if we were to change the name now, I think the specification should mandate implementations to handle the alias. Changing from endian to bytes already has caused some confusion.

In the past, "Morton-order" was also discussed, and I'm pretty sure that transpose does not allow for that.

transpose doesn't allow for that, but there could be a morton codec that flattens the array in morton order and leaves the byte serialization to bytes.

mkitti · 2023-09-21T21:28:02Z

If we wanted to be explicit about the order of the shard index, I suppose one could always write the following, right?

        "index_codecs": [
          {
            "name": "transpose",
            "configuration": {
              "order": "C",
             }
          },
          {
            "name": "bytes",
            "configuration": {
              "endian": "little",
            }
          },
          { "name": "crc32c"}
        ]

Alternatively,

        "index_codecs": [
          {
            "name": "transpose",
            "configuration": {
              "order": "F",
             }
          },
          {
            "name": "bytes",
            "configuration": {
              "endian": "little",
            }
          },
          { "name": "crc32c"}
        ]

jbms · 2023-09-21T21:44:04Z

Yes, that's right, though of course the other PR about how order is specified impacts that.

Note, for example, that if there is a transpose codec specified prior to the sharding codec, the index, the chunk_shape, and the data chunks are then relative to the transposed order.

Here is a complicated example to illustrate some possible interactions:

"chunk_grid": {"name": "regular", "configuration": {"chunk_shape": [100, 200, 300]}},
"codecs": [
  {"name": "transpose", "configuration": {"order": [0, 2, 1]}},
  {"name": "sharding_indexed", "configuration": {
    "chunk_shape": [5, 10, 20],
    "index_codecs": [
      {"name": "transpose", "configuration": {"order": [3,2,1,0]}},
      {"name": "bytes", "configuration": {"endian": "little"}},
      {"name": "crc32c"}
    ],
    "codecs": [
      {"name": "transpose", "configuration": {"order": [1, 0, 2]}},
      {"name": "bytes", "configuration": {"endian": "little"}},
      {"name": "gzip", "configuration": {"level": 5}}
    ]
  }
]

Here the top-level chunk shape is [100, 200, 300], but due to the outer transpose, the sharding codec sees an outer chunk shape of [100, 300, 200]. Given the sub-chunk shape of [5, 10, 20], the sub-chunk grid has a shape of [20, 30, 10]. Therefore, the index has a shape of [20, 30, 10, 2]. Due to the transpose specified in index_codecs, the index is effectively encoded with a shape of [2, 10, 30, 20] (C order).

From the perspective of the sharding codec, each data codec has a shape of [5, 10, 20] but due to the transpose, will be encoded as a shape of [10, 5, 20] (C order).

From the user perspective looking at the top-level array as a whole, the inner-most chunking (granularity at which reads are supported) is [5, 20, 10].

DennisHeimbigner · 2023-10-26T18:31:08Z

May I suggest that the configuration.endian key also allow "native" in addition to "big" and "little".

DennisHeimbigner · 2023-10-26T18:32:33Z

Shouldn't the configuration for endian also specify the dtype so that it is self-contained?

jbms · 2023-10-26T18:38:05Z

Note that this is specifying the stored (i.e. "on-disk") format --- for that purpose, "native" is problematic because the native endianness of the machine reading it may differ from the native endianness of the machine writing (though of course in practice it will almost always be little endian). When creating an array, though, a zarr implementation may provide a way to choose native endianness automatically. E.g. in tensorstore, you can leave "endian" unspecified and it will choose the native endianness (but when actually stored, the "endian" value will always be specified).

As far as storing the dtype, I can see how that may be advantageous though a similar argument would apply to storing the shape of the chunk as well in the bytes codec configuration, and it is not clear that we want to duplicate all of that.

DennisHeimbigner · 2023-10-26T18:57:19Z

Fair enough.

jbms · 2023-10-31T17:19:52Z

@zarr-developers/implementation-council I'd like to proceed with merging this, but please post any objections. @clbarnes I think yours may be the only existing implementation that has not already made this change.

clbarnes · 2023-10-31T17:47:08Z

Ah yes, this slipped past me. The name change isn't a concern; making the order optional opens up some new codec errors types conditional on dtype, which is a bit of a pain (damn rust making you think about these things early...) but shouldn't be too bad, just makes validation more complicated. Who's writing the jsonschema? 😬

clbarnes · 2023-10-31T17:57:20Z

Do multi-byte raw codecs have an endianness, i.e. do they require the config here?

Following this PR zarr-developers/zarr-specs#263 Also allows the endianness to be undefined for single-byte dtypes.

jbms · 2023-10-31T18:06:58Z

Do multi-byte raw codecs have an endianness, i.e. do they require the config here?

No, the "r*" types do not have an endianness.

clbarnes · 2023-10-31T18:15:56Z

Does that get complicated where they're used as a fallback? Out of scope for this PR, I suppose.

Endian codec rename PR ready to go, anyway clbarnes/zarr3-rs#18

jbms · 2023-10-31T18:27:05Z

Does that get complicated where they're used as a fallback? Out of scope for this PR, I suppose.

Yes it is a complication, and indeed it is for reasons like this that we removed the concept of fallback data types.

Endian codec rename PR ready to go, anyway clbarnes/zarr3-rs#18

jbms · 2023-11-15T20:29:28Z

Now that this has been implemented by all existing zarr v3 implementations, I think it is time to merge this.

jbms · 2023-12-14T17:43:13Z

@normanrz I think this should be merged --- can you resolve the conflicts so that we can merge it?

normanrz · 2023-12-17T15:42:09Z

can you resolve the conflicts so that we can merge it?

Done!

docs/v3/codecs/bytes/v1.0.rst

normanrz added 2 commits August 21, 2023 10:27

rename endian codec to bytes

267043d

pr number

14b44fc

rabernat previously approved these changes Aug 21, 2023

View reviewed changes

manzt mentioned this pull request Aug 25, 2023

feat: Rename endian codec to bytes manzt/zarrita.js#99

Merged

jbms mentioned this pull request Sep 17, 2023

Draft ZEP 0007: Strings zarr-developers/zeps#47

Open

mkitti mentioned this pull request Sep 20, 2023

ZEP0001 status clarification, not "Final"? #265

Closed

jbms mentioned this pull request Oct 26, 2023

Require array -> bytes codec be specified in metadata #249

Merged

joshmoore mentioned this pull request Oct 31, 2023

ZEP0002 Review #254

Closed

jstriebel previously approved these changes Oct 31, 2023

View reviewed changes

clbarnes added a commit to clbarnes/zarr3-rs that referenced this pull request Oct 31, 2023

Rename "endian" codec to "bytes"

f25fb81

Following this PR zarr-developers/zarr-specs#263 Also allows the endianness to be undefined for single-byte dtypes.

clbarnes mentioned this pull request Oct 31, 2023

Rename endian codec clbarnes/zarr3-rs#18

Merged

MSanKeys963 mentioned this pull request Oct 31, 2023

Clearly mention the implementation phase or grace period for ZEP proposals zarr-developers/zeps#52

Open

Merge remote-tracking branch 'origin/main' into endian-to-bytes

08eabab

normanrz dismissed stale reviews from jstriebel and rabernat via 08eabab December 17, 2023 15:41

jbms previously approved these changes Dec 17, 2023

View reviewed changes

docs/v3/codecs/bytes/v1.0.rst Outdated Show resolved Hide resolved

clarification on endian attribute

0794d91

normanrz dismissed jbms’s stale review via 0794d91 January 8, 2024 08:47

jbms approved these changes Jan 8, 2024

View reviewed changes

jbms merged commit b9ae65b into zarr-developers:main Jan 11, 2024
1 check passed

MSanKeys963 mentioned this pull request Jan 23, 2024

Some reflections about how to improve the ZEP process zarr-developers/zeps#55

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename endian codec to bytes #263

Rename endian codec to bytes #263

normanrz commented Aug 21, 2023

normanrz commented Aug 21, 2023

rabernat commented Aug 21, 2023

normanrz commented Aug 21, 2023

rabernat commented Aug 21, 2023

joshmoore commented Aug 24, 2023

jbms commented Aug 24, 2023

manzt commented Aug 25, 2023 •

edited

mkitti commented Sep 20, 2023

mkitti commented Sep 20, 2023

jbms commented Sep 20, 2023

jbms commented Sep 20, 2023

mkitti commented Sep 20, 2023 •

edited

manzt commented Sep 20, 2023

jbms commented Sep 21, 2023

mkitti commented Sep 21, 2023 •

edited

normanrz commented Sep 21, 2023

mkitti commented Sep 21, 2023 •

edited

jbms commented Sep 21, 2023 •

edited

DennisHeimbigner commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

jbms commented Oct 31, 2023

clbarnes commented Oct 31, 2023 •

edited

clbarnes commented Oct 31, 2023

jbms commented Oct 31, 2023

clbarnes commented Oct 31, 2023

jbms commented Oct 31, 2023

jbms commented Nov 15, 2023

jbms commented Dec 14, 2023

normanrz commented Dec 17, 2023

Rename endian codec to bytes #263

Rename endian codec to bytes #263

Conversation

normanrz commented Aug 21, 2023

normanrz commented Aug 21, 2023

rabernat commented Aug 21, 2023

normanrz commented Aug 21, 2023

rabernat commented Aug 21, 2023

joshmoore commented Aug 24, 2023

jbms commented Aug 24, 2023

manzt commented Aug 25, 2023 • edited

mkitti commented Sep 20, 2023

mkitti commented Sep 20, 2023

jbms commented Sep 20, 2023

jbms commented Sep 20, 2023

mkitti commented Sep 20, 2023 • edited

manzt commented Sep 20, 2023

jbms commented Sep 21, 2023

mkitti commented Sep 21, 2023 • edited

normanrz commented Sep 21, 2023

mkitti commented Sep 21, 2023 • edited

jbms commented Sep 21, 2023 • edited

DennisHeimbigner commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

jbms commented Oct 31, 2023

clbarnes commented Oct 31, 2023 • edited

clbarnes commented Oct 31, 2023

jbms commented Oct 31, 2023

clbarnes commented Oct 31, 2023

jbms commented Oct 31, 2023

jbms commented Nov 15, 2023

jbms commented Dec 14, 2023

normanrz commented Dec 17, 2023

manzt commented Aug 25, 2023 •

edited

mkitti commented Sep 20, 2023 •

edited

mkitti commented Sep 21, 2023 •

edited

mkitti commented Sep 21, 2023 •

edited

jbms commented Sep 21, 2023 •

edited

clbarnes commented Oct 31, 2023 •

edited