Use Arrow C data interface format strings? #61

alimanfoo · 2020-04-28T07:51:35Z

Reading The Arrow C data interface I'm wondering if we should consider following any of the approach described there for the zarr v3.0 core protocol spec.

In particular, the format strings for core data types may be easier to handle than the currently used numpy-style format strings. Although unfortunately there is no concept of endianness.

joshmoore · 2020-04-28T09:58:51Z

cc: @DennisHeimbigner

Carreau · 2020-05-05T18:35:15Z

I'm wondering if we should consider following any of the approach described there

I don't see any link to endianness in the arrow document, though it seem like we have both in current spec. Do you know know common it is to have different endianness in the same Zarr ?

alimanfoo · 2020-05-06T12:39:43Z

Endianness is important given data may be produced on one system and consumed on another. I don't know of any cases where endianness differs between different arrays in the same hierarchy, that would probably be rare. But it would be nice to include this information within the array metadata.

DennisHeimbigner · 2020-05-06T15:37:28Z

As you know, both HDF5 and netcdf support setting endianness on a per-array
basis. Practically speaking, I have never seen a netcdf file in which the endiannes
differed across arrays. SInce I tend to support less complexity, I would
think that specifying an endianness for the whole file only would be the way to go.

DennisHeimbigner · 2020-05-06T15:44:06Z

WRT format strings. The binary and large binary distinction seem
odd to me. I assume it is trying to provide information about how
to store the binary string. I would have said that the distinction is
arbitrary and should be left to the implementation to decide.
I also note that fixed-width binary is presumably the same as the
HDF5 opaque type. We have found that this type has very little use
as such and users tend to use arrays of uint8 for this.

DennisHeimbigner · 2020-05-06T15:45:40Z

WRT the time types, that has been an issue with netcdf for a while.
Currently, times are stored as integers or strings and an attribute is
used to specify its semantics.

jstriebel · 2022-11-16T17:16:33Z

Just as a note: This was also discussed in issue #131, and the data type names of the v3 spec were updated in PR #155 to be uint32, float64 etc… Date and time related datatypes are not part of v3 core at the moment but added as an extension.

jakirkham · 2022-11-18T09:06:32Z

Also related to issue ( zarr-developers/numcodecs#227 )

jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Nov 16, 2022

jstriebel added protocol-extension Protocol extension related issue data-type labels Nov 16, 2022

jstriebel removed the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Arrow C data interface format strings? #61

Use Arrow C data interface format strings? #61

alimanfoo commented Apr 28, 2020

joshmoore commented Apr 28, 2020

Carreau commented May 5, 2020

alimanfoo commented May 6, 2020

DennisHeimbigner commented May 6, 2020 •

edited

Loading

DennisHeimbigner commented May 6, 2020

DennisHeimbigner commented May 6, 2020

jstriebel commented Nov 16, 2022

jakirkham commented Nov 18, 2022

Use Arrow C data interface format strings? #61

Use Arrow C data interface format strings? #61

Comments

alimanfoo commented Apr 28, 2020

joshmoore commented Apr 28, 2020

Carreau commented May 5, 2020

alimanfoo commented May 6, 2020

DennisHeimbigner commented May 6, 2020 • edited Loading

DennisHeimbigner commented May 6, 2020

DennisHeimbigner commented May 6, 2020

jstriebel commented Nov 16, 2022

jakirkham commented Nov 18, 2022

DennisHeimbigner commented May 6, 2020 •

edited

Loading