Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2: clarify that unicode uses utf-32 encoding #121

Open
constantinpape opened this issue Oct 1, 2021 · 8 comments
Open

v2: clarify that unicode uses utf-32 encoding #121

constantinpape opened this issue Oct 1, 2021 · 8 comments
Labels

Comments

@constantinpape
Copy link

In data type encoding does not specify the unicode encoding. It appears that this is using UTF-32 (inherited from numpy unicode datatypes). Unfortunately this seems to be not clearly documented in the numpy dtype description as well, but inspection of the serialized data shows that it's UTF-32 encoded:

import numpy
print(numpy.dtype("U1").itemsize)  # prints 4

(and I have also validated by decoding zarr unicode chunks manually).

For supporting zarr unicode data in other languages this information is important, so it should be stated more explicitly in the spec.

@manzt
Copy link
Member

manzt commented Jan 18, 2022

Ran into this today... glad to find this issue!

@DennisHeimbigner
Copy link

Is this a spec issue or a python impl issue. I would assume the spec specifies utf-8

@constantinpape
Copy link
Author

@DennisHeimbigner the spec only states the following:

Simple data types are encoded within the array metadata as a string, following the [NumPy array protocol type string (typestr) format](https://numpy.org/doc/stable/reference/arrays.interface.html#arrays-interface). The format consists of 3 parts:
.... 
"U": unicode (fixed-length sequence of Py_UNICODE)
...

and numpy uses UTF-32 encoding, see the code snippet above. By transitivity the spec currently uses utf-32 (but quite implicitly).

@DennisHeimbigner
Copy link

I see. Should this be changed to be explicitly utf-8 or perhaps to include
non-utf8 encodings, should a string be defined as a sequence of 8-bit bytes.

@constantinpape
Copy link
Author

constantinpape commented Feb 10, 2022

I see. Should this be changed to be explicitly utf-8 or perhaps to include
non-utf8 encodings, should a string be defined as a sequence of 8-bit bytes.

I don't think it can be changed to utf-8 in zarr spec v2; to be compatible with numpy it has to be utf-32 (and this is explicitly used as reference in the zarr-v2 spec). So for v2 I would just explicitly state that it's utf-32.

I am not so up-to-date on the current plans for v3, but it might be a good idea to decouple the data type encoding from numpy there; and also to change to UTF-8 by default and/or allow to specify the encoding sounds like a good idea.

@DennisHeimbigner
Copy link

I wonder how many non-python zarr implementations adhere to using utf-32?

@jbms
Copy link
Contributor

jbms commented Feb 22, 2022

Not sure how many other implementations even support it? A fixed-length sequence of utf-32-enocded code points seems unlikely to be particularly useful as a data type.

@joshmoore
Copy link
Member

https://github.com/zarr-developers/zarr-specs/pull/135/files#diff-6b08b9e843756eb493a5d6ad9817cb5aea38e09d80d1b84ddac2c5f3e37a246dR69 is the likely location for addressing this in v3. On the v2 front, @constantinpape, I assume our best next step is to get a simple test in zarr_implementations? If basically all implementations vary, it will be tricky to specify anything in the v2 spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants