-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2: clarify that unicode uses utf-32 encoding #121
Comments
Ran into this today... glad to find this issue! |
Is this a spec issue or a python impl issue. I would assume the spec specifies utf-8 |
@DennisHeimbigner the spec only states the following:
and numpy uses UTF-32 encoding, see the code snippet above. By transitivity the spec currently uses utf-32 (but quite implicitly). |
I see. Should this be changed to be explicitly utf-8 or perhaps to include |
I don't think it can be changed to utf-8 in zarr spec v2; to be compatible with numpy it has to be utf-32 (and this is explicitly used as reference in the zarr-v2 spec). So for v2 I would just explicitly state that it's utf-32. I am not so up-to-date on the current plans for v3, but it might be a good idea to decouple the data type encoding from numpy there; and also to change to UTF-8 by default and/or allow to specify the encoding sounds like a good idea. |
I wonder how many non-python zarr implementations adhere to using utf-32? |
Not sure how many other implementations even support it? A fixed-length sequence of utf-32-enocded code points seems unlikely to be particularly useful as a data type. |
https://github.com/zarr-developers/zarr-specs/pull/135/files#diff-6b08b9e843756eb493a5d6ad9817cb5aea38e09d80d1b84ddac2c5f3e37a246dR69 is the likely location for addressing this in v3. On the v2 front, @constantinpape, I assume our best next step is to get a simple test in zarr_implementations? If basically all implementations vary, it will be tricky to specify anything in the v2 spec. |
In data type encoding does not specify the unicode encoding. It appears that this is using UTF-32 (inherited from numpy unicode datatypes). Unfortunately this seems to be not clearly documented in the numpy dtype description as well, but inspection of the serialized data shows that it's UTF-32 encoded:
(and I have also validated by decoding zarr unicode chunks manually).
For supporting zarr unicode data in other languages this information is important, so it should be stated more explicitly in the spec.
The text was updated successfully, but these errors were encountered: