v2: clarify that unicode uses utf-32 encoding #121

constantinpape · 2021-10-01T09:50:32Z

In data type encoding does not specify the unicode encoding. It appears that this is using UTF-32 (inherited from numpy unicode datatypes). Unfortunately this seems to be not clearly documented in the numpy dtype description as well, but inspection of the serialized data shows that it's UTF-32 encoded:

import numpy
print(numpy.dtype("U1").itemsize)  # prints 4

(and I have also validated by decoding zarr unicode chunks manually).

For supporting zarr unicode data in other languages this information is important, so it should be stated more explicitly in the spec.

manzt · 2022-01-18T00:59:29Z

Ran into this today... glad to find this issue!

DennisHeimbigner · 2022-02-10T20:07:02Z

Is this a spec issue or a python impl issue. I would assume the spec specifies utf-8

constantinpape · 2022-02-10T20:23:25Z

@DennisHeimbigner the spec only states the following:

Simple data types are encoded within the array metadata as a string, following the [NumPy array protocol type string (typestr) format](https://numpy.org/doc/stable/reference/arrays.interface.html#arrays-interface). The format consists of 3 parts:
.... 
"U": unicode (fixed-length sequence of Py_UNICODE)
...

and numpy uses UTF-32 encoding, see the code snippet above. By transitivity the spec currently uses utf-32 (but quite implicitly).

DennisHeimbigner · 2022-02-10T20:41:06Z

I see. Should this be changed to be explicitly utf-8 or perhaps to include
non-utf8 encodings, should a string be defined as a sequence of 8-bit bytes.

constantinpape · 2022-02-10T20:45:59Z

I see. Should this be changed to be explicitly utf-8 or perhaps to include
non-utf8 encodings, should a string be defined as a sequence of 8-bit bytes.

I don't think it can be changed to utf-8 in zarr spec v2; to be compatible with numpy it has to be utf-32 (and this is explicitly used as reference in the zarr-v2 spec). So for v2 I would just explicitly state that it's utf-32.

I am not so up-to-date on the current plans for v3, but it might be a good idea to decouple the data type encoding from numpy there; and also to change to UTF-8 by default and/or allow to specify the encoding sounds like a good idea.

DennisHeimbigner · 2022-02-10T20:59:25Z

I wonder how many non-python zarr implementations adhere to using utf-32?

jbms · 2022-02-22T07:20:01Z

Not sure how many other implementations even support it? A fixed-length sequence of utf-32-enocded code points seems unlikely to be particularly useful as a data type.

joshmoore · 2022-03-07T22:09:38Z

https://github.com/zarr-developers/zarr-specs/pull/135/files#diff-6b08b9e843756eb493a5d6ad9817cb5aea38e09d80d1b84ddac2c5f3e37a246dR69 is the likely location for addressing this in v3. On the v2 front, @constantinpape, I assume our best next step is to get a simple test in zarr_implementations? If basically all implementations vary, it will be tricky to specify anything in the v2 spec.

manzt mentioned this issue Jan 18, 2022

fix: use utf-32 encoding for unicode arrays manzt/zarrita.js#48

Merged

joshmoore mentioned this issue Apr 7, 2022

Specify the encoding of the json metadata ome/ngff#81

Open

jstriebel added the v2 label Nov 16, 2022

joshmoore mentioned this issue Oct 30, 2023

Clarify data type docs in v2 spec page zarr-developers/zarr-python#1508

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2: clarify that unicode uses utf-32 encoding #121

v2: clarify that unicode uses utf-32 encoding #121

constantinpape commented Oct 1, 2021

manzt commented Jan 18, 2022

DennisHeimbigner commented Feb 10, 2022

constantinpape commented Feb 10, 2022

DennisHeimbigner commented Feb 10, 2022

constantinpape commented Feb 10, 2022 •

edited

Loading

DennisHeimbigner commented Feb 10, 2022

jbms commented Feb 22, 2022

joshmoore commented Mar 7, 2022

v2: clarify that unicode uses utf-32 encoding #121

v2: clarify that unicode uses utf-32 encoding #121

Comments

constantinpape commented Oct 1, 2021

manzt commented Jan 18, 2022

DennisHeimbigner commented Feb 10, 2022

constantinpape commented Feb 10, 2022

DennisHeimbigner commented Feb 10, 2022

constantinpape commented Feb 10, 2022 • edited Loading

DennisHeimbigner commented Feb 10, 2022

jbms commented Feb 22, 2022

joshmoore commented Mar 7, 2022

constantinpape commented Feb 10, 2022 •

edited

Loading