Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename endian codec to bytes #263

Merged
merged 4 commits into from Jan 11, 2024

Conversation

normanrz
Copy link
Contributor

Renames the endian to bytes based on this discussion: scalableminds/zarrita#8 (comment)

@normanrz
Copy link
Contributor Author

@jbms @jstriebel please take a look.
Can we set up a redirect from the endian page to the new bytes page?

@rabernat
Copy link
Contributor

👍 I'm all for this change.

But I'd also like us to understand what is happening here at a procedural level. How is it possible to change the core spec now that it has been accepted? At what point is a new ZEP required?

I agree we should make room for minor changes like this. However, we should do so under a clear framework. How are we tracking different versions of the spec if the spec continues to evolve? Is this a step towards, e.g. v3.0.1?

I have advocated for a more incremental approach to the spec development, as practiced, for example, by STAC. But our current process doesn't seem to allow for that. We should change that.

@normanrz
Copy link
Contributor Author

But I'd also like us to understand what is happening here at a procedural level. How is it possible to change the core spec now that it has been accepted? At what point is a new ZEP required?

The way I understand the ZEP process (and how we discussed it in the ZEP meetings) is that ZEP1 has been provisionally accepted, but not yet finalized. That means there is some room to make some minor changes to the spec that are motivated by feedback from the implementors. Once a number of implementations have shown to implement a ZEP it becomes finalized.

I guess with ZEP1 we are still in this limbo phase. We wanted to collect all minor changes (there have been some other minor changes after acceptance) and present them to the ZIC and ZSC once more before finalizing it. Not sure, if that requires another round of voting.

@rabernat
Copy link
Contributor

That means there is some room to make some minor changes to the spec that are motivated by feedback from the implementors

That makes sense to me.

STAC released several 1.0.0 "beta" and "rc" releases of the spec before stamping a final 1.0.0. (https://github.com/radiantearth/stac-spec/releases)

Maybe it would be helpful for us to do the same? Otherwise, I worry that we will repeat the current situation where there are many different, formally indistinguishable versions of v3 in the wild.

But we can continue that discussion elsewhere. Sorry for hijacking this thread.

rabernat
rabernat previously approved these changes Aug 21, 2023
@joshmoore
Copy link
Member

👍 for having a clear idea of how this will be communicated to those who have already started implementing since there won't be, e.g., a clear vote point where we get their affirmation.

@jbms
Copy link
Contributor

jbms commented Aug 24, 2023

@zarr-developers/implementation-council This is a (minor) breaking change, so approval from existing implementations is needed.

@manzt
Copy link
Member

manzt commented Aug 25, 2023

Yes from me (zarr.js/zarrita.js)

@mkitti
Copy link

mkitti commented Sep 20, 2023

While I think bytes is a better than endian, I wonder if it could be made more obvious that this also uses C-order when converting an array to bytes.

@mkitti
Copy link

mkitti commented Sep 20, 2023

To make this a non-breaking change, could the endian codec be deprecated, use is discouraged pending removal, rather than removed?

@jbms
Copy link
Contributor

jbms commented Sep 20, 2023

Do you mean a name like bytes_corder? The idea is that you can combine this with the transpose codec to obtain any order.

@jbms
Copy link
Contributor

jbms commented Sep 20, 2023

To make this a non-breaking change, could the endian codec be deprecated, use is discouraged pending removal, rather than removed?

It could be kept as an alias easily enough, but are you aware of an actual existing use where this change will be problematic?

At this point, tensorstore, zarrita, and zarrita.js have already made this change. The only other implementation of which I'm aware, zarr3-rs, has not, but I don't know if it is in production use yet.

@mkitti
Copy link

mkitti commented Sep 20, 2023

I'm not sure I would say "any order", but yes I understand that this is meant to be used with transpose to obtain F-order. In the past, "Morton-order" was also discussed, and I'm pretty sure that transpose does not allow for that.

My primary concern here is that bytes still seems to be less than obvious as to what this codec does.

Looking at this from the ZEP2 perspective, the example is currently as follows.

        "index_codecs": [
          {
            "name": "endian",
            "configuration": {
              "endian": "little",
            }
          },
          { "name": "crc32c"}
        ]

I agree that this is confusing. It is unclear that endian codec is flattening the array and serializing to bytes.

Now, with this PR, we have the following.

        "index_codecs": [
          {
            "name": "bytes",
            "configuration": {
              "endian": "little",
            }
          },
          { "name": "crc32c"}
        ]

While this is better, it is still not quite obvious to someone who has not read the specification thoroughly what this codec operation has performed. If you were a Python user and knew that bytes(some_numpy_array) return a bytes object, I think there is a good chance one could guess the codec operation. Otherwise, I'm really not sure if this operation could be easily inferred.

If I'm allowed to bikeshed, I've settled on the name flatten_to_bytes after discussing it with a few colleagues. This name captures that some kind of ordering operation is occuring as well as the encoding in bytes.

At this point, tensorstore, zarrita, and zarrita.js have already made this change. The only other implementation of which I'm aware, zarr3-rs, has not, but I don't know if it is in production use yet.

I'm guessing that those who have made this change also still support the "endian" codec? Returning to Ryan's process question, pre-implementation makes this seem like it has been decided already and perhaps already hard to change.

edit: I removed the name ravel_to_bytes as a suggested name for the codec.

@manzt
Copy link
Member

manzt commented Sep 20, 2023

I'm guessing that those who have made this change also still support the "endian" codec?

I switched over to “bytes” only in zarrita.js to align with zarrita. I don’t have much opinion here, but happy to implement/support what is decided.

@jbms
Copy link
Contributor

jbms commented Sep 21, 2023

I think flatten_to_bytes is not unreasonable, but also is essentially a description of what any array to bytes codec must do, and therefore may not be particularly helpful to identify this specific codec.

@mkitti
Copy link

mkitti commented Sep 21, 2023

It is not clear to me that every array to bytes transformation must be reversible or that lossy compression is prohibited at this stage. One could define an array to bytes transformation that retains no information at all or perhaps only the shape of the array. Decoding would just produce an array of zeros.

While I agree that flatten_to_bytes is a trivial loseless transformation, I'm unclear if that truly describes the entire array to bytes codec space.

@normanrz
Copy link
Contributor Author

I don't see much added value from flatten_to_bytes over just bytes. I think bytes captures pretty well what the codec does. If it it not clear from the name, the specification contains all the details.
Considering that some implementations have already adopted it (I want to added Webknossos to Jeremy's list), if we were to change the name now, I think the specification should mandate implementations to handle the alias. Changing from endian to bytes already has caused some confusion.

In the past, "Morton-order" was also discussed, and I'm pretty sure that transpose does not allow for that.

transpose doesn't allow for that, but there could be a morton codec that flattens the array in morton order and leaves the byte serialization to bytes.

@mkitti
Copy link

mkitti commented Sep 21, 2023

If we wanted to be explicit about the order of the shard index, I suppose one could always write the following, right?

        "index_codecs": [
          {
            "name": "transpose",
            "configuration": {
              "order": "C",
             }
          },
          {
            "name": "bytes",
            "configuration": {
              "endian": "little",
            }
          },
          { "name": "crc32c"}
        ]

Alternatively,

        "index_codecs": [
          {
            "name": "transpose",
            "configuration": {
              "order": "F",
             }
          },
          {
            "name": "bytes",
            "configuration": {
              "endian": "little",
            }
          },
          { "name": "crc32c"}
        ]

@jbms
Copy link
Contributor

jbms commented Sep 21, 2023

Yes, that's right, though of course the other PR about how order is specified impacts that.

Note, for example, that if there is a transpose codec specified prior to the sharding codec, the index, the chunk_shape, and the data chunks are then relative to the transposed order.

Here is a complicated example to illustrate some possible interactions:

"chunk_grid": {"name": "regular", "configuration": {"chunk_shape": [100, 200, 300]}},
"codecs": [
  {"name": "transpose", "configuration": {"order": [0, 2, 1]}},
  {"name": "sharding_indexed", "configuration": {
    "chunk_shape": [5, 10, 20],
    "index_codecs": [
      {"name": "transpose", "configuration": {"order": [3,2,1,0]}},
      {"name": "bytes", "configuration": {"endian": "little"}},
      {"name": "crc32c"}
    ],
    "codecs": [
      {"name": "transpose", "configuration": {"order": [1, 0, 2]}},
      {"name": "bytes", "configuration": {"endian": "little"}},
      {"name": "gzip", "configuration": {"level": 5}}
    ]
  }
]

Here the top-level chunk shape is [100, 200, 300], but due to the outer transpose, the sharding codec sees an outer chunk shape of [100, 300, 200]. Given the sub-chunk shape of [5, 10, 20], the sub-chunk grid has a shape of [20, 30, 10]. Therefore, the index has a shape of [20, 30, 10, 2]. Due to the transpose specified in index_codecs, the index is effectively encoded with a shape of [2, 10, 30, 20] (C order).

From the perspective of the sharding codec, each data codec has a shape of [5, 10, 20] but due to the transpose, will be encoded as a shape of [10, 5, 20] (C order).

From the user perspective looking at the top-level array as a whole, the inner-most chunking (granularity at which reads are supported) is [5, 20, 10].

@DennisHeimbigner
Copy link

May I suggest that the configuration.endian key also allow "native" in addition to "big" and "little".

@DennisHeimbigner
Copy link

Shouldn't the configuration for endian also specify the dtype so that it is self-contained?

@jbms
Copy link
Contributor

jbms commented Oct 26, 2023

Note that this is specifying the stored (i.e. "on-disk") format --- for that purpose, "native" is problematic because the native endianness of the machine reading it may differ from the native endianness of the machine writing (though of course in practice it will almost always be little endian). When creating an array, though, a zarr implementation may provide a way to choose native endianness automatically. E.g. in tensorstore, you can leave "endian" unspecified and it will choose the native endianness (but when actually stored, the "endian" value will always be specified).

As far as storing the dtype, I can see how that may be advantageous though a similar argument would apply to storing the shape of the chunk as well in the bytes codec configuration, and it is not clear that we want to duplicate all of that.

@DennisHeimbigner
Copy link

Fair enough.

@joshmoore joshmoore mentioned this pull request Oct 31, 2023
jstriebel
jstriebel previously approved these changes Oct 31, 2023
@jbms
Copy link
Contributor

jbms commented Oct 31, 2023

@zarr-developers/implementation-council I'd like to proceed with merging this, but please post any objections. @clbarnes I think yours may be the only existing implementation that has not already made this change.

@clbarnes
Copy link
Contributor

clbarnes commented Oct 31, 2023

Ah yes, this slipped past me. The name change isn't a concern; making the order optional opens up some new codec errors types conditional on dtype, which is a bit of a pain (damn rust making you think about these things early...) but shouldn't be too bad, just makes validation more complicated. Who's writing the jsonschema? 😬

@clbarnes
Copy link
Contributor

Do multi-byte raw codecs have an endianness, i.e. do they require the config here?

clbarnes added a commit to clbarnes/zarr3-rs that referenced this pull request Oct 31, 2023
Following this PR zarr-developers/zarr-specs#263

Also allows the endianness to be undefined for single-byte dtypes.
@jbms
Copy link
Contributor

jbms commented Oct 31, 2023

Do multi-byte raw codecs have an endianness, i.e. do they require the config here?

No, the "r*" types do not have an endianness.

@clbarnes
Copy link
Contributor

Does that get complicated where they're used as a fallback? Out of scope for this PR, I suppose.

Endian codec rename PR ready to go, anyway clbarnes/zarr3-rs#18

@jbms
Copy link
Contributor

jbms commented Oct 31, 2023

Does that get complicated where they're used as a fallback? Out of scope for this PR, I suppose.

Yes it is a complication, and indeed it is for reasons like this that we removed the concept of fallback data types.

Endian codec rename PR ready to go, anyway clbarnes/zarr3-rs#18

@jbms
Copy link
Contributor

jbms commented Nov 15, 2023

Now that this has been implemented by all existing zarr v3 implementations, I think it is time to merge this.

@jbms
Copy link
Contributor

jbms commented Dec 14, 2023

@normanrz I think this should be merged --- can you resolve the conflicts so that we can merge it?

@normanrz normanrz dismissed stale reviews from jstriebel and rabernat via 08eabab December 17, 2023 15:41
@normanrz
Copy link
Contributor Author

can you resolve the conflicts so that we can merge it?

Done!

jbms
jbms previously approved these changes Dec 17, 2023
docs/v3/codecs/bytes/v1.0.rst Outdated Show resolved Hide resolved
@jbms jbms merged commit b9ae65b into zarr-developers:main Jan 11, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants