Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Arrow C data interface format strings? #61

Open
alimanfoo opened this issue Apr 28, 2020 · 8 comments
Open

Use Arrow C data interface format strings? #61

alimanfoo opened this issue Apr 28, 2020 · 8 comments
Labels
data-type protocol-extension Protocol extension related issue

Comments

@alimanfoo
Copy link
Member

Reading The Arrow C data interface I'm wondering if we should consider following any of the approach described there for the zarr v3.0 core protocol spec.

In particular, the format strings for core data types may be easier to handle than the currently used numpy-style format strings. Although unfortunately there is no concept of endianness.

@joshmoore
Copy link
Member

cc: @DennisHeimbigner

@Carreau
Copy link
Contributor

Carreau commented May 5, 2020

I'm wondering if we should consider following any of the approach described there

I don't see any link to endianness in the arrow document, though it seem like we have both in current spec. Do you know know common it is to have different endianness in the same Zarr ?

@alimanfoo
Copy link
Member Author

Endianness is important given data may be produced on one system and consumed on another. I don't know of any cases where endianness differs between different arrays in the same hierarchy, that would probably be rare. But it would be nice to include this information within the array metadata.

@DennisHeimbigner
Copy link

DennisHeimbigner commented May 6, 2020

As you know, both HDF5 and netcdf support setting endianness on a per-array
basis. Practically speaking, I have never seen a netcdf file in which the endiannes
differed across arrays. SInce I tend to support less complexity, I would
think that specifying an endianness for the whole file only would be the way to go.

@DennisHeimbigner
Copy link

WRT format strings. The binary and large binary distinction seem
odd to me. I assume it is trying to provide information about how
to store the binary string. I would have said that the distinction is
arbitrary and should be left to the implementation to decide.
I also note that fixed-width binary is presumably the same as the
HDF5 opaque type. We have found that this type has very little use
as such and users tend to use arrays of uint8 for this.

@DennisHeimbigner
Copy link

WRT the time types, that has been an issue with netcdf for a while.
Currently, times are stored as integers or strings and an attribute is
used to specify its semantics.

@jstriebel jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Nov 16, 2022
@jstriebel
Copy link
Member

Just as a note: This was also discussed in issue #131, and the data type names of the v3 spec were updated in PR #155 to be uint32, float64 etc… Date and time related datatypes are not part of v3 core at the moment but added as an extension.

@jstriebel jstriebel added protocol-extension Protocol extension related issue data-type labels Nov 16, 2022
@jakirkham
Copy link
Member

Also related to issue ( zarr-developers/numcodecs#227 )

@jstriebel jstriebel removed the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-type protocol-extension Protocol extension related issue
Projects
None yet
Development

No branches or pull requests

6 participants