Add Apache Arrow codec #227

vdwees · 2020-04-09T15:47:58Z

Adding an Apache Arrow codec for efficient data loading.

For data stored in the filesystem, Apache parquete might be added as well.

Also mentioned here:
zarr-developers/zarr-python#515

jakirkham · 2020-04-09T17:41:34Z

How do you imagine this working?

If people are using Parquet, do they actually need Zarr?

vdwees · 2020-04-09T23:22:57Z

I guess one of the selling features of zarr for me is being able to load only the chunks I need off of a remote server. Arrow is only an in-memory representation, so I guess it is conceivable that a chunk is larger than is reasonable for memory, and it gets spooled to a local disk as a Parquet file?🤔

I’ll experiment a bit with Arrow, and if I can get the behavior I’m hoping for I’ll submit a PR.

jakirkham · 2020-04-10T01:30:20Z

Before getting to a PR, it would be good to get a clearer idea on the usage pattern and how well it generalizes (though it sounds like we are still working on those questions 😉).

alimanfoo · 2020-04-14T22:55:55Z

Hi @vdwees, just to second @jakirkham's comment, it would be helpful to clarify goals and usage patterns here.

IIUC Arrow provides a standard way to share memory buffers between processes. So, e.g., you could imagine loading data from one or more chunks of any Zarr array into an PyArrow array, rather than a numpy array as currently. That is something completely independent of codecs, it's more about how to lay out memory buffers and expose them to applications.

Parquet is a file format for columnar data, i.e., serialisation of multiple 1D arrays.

Codecs in Zarr are are things like compressors which transform arrays during serialisation or deserialisation. Some of the current codecs in the numcodecs package do borrow some ideas from the Parquet format, but that's some very specific things, e.g., about how to serialise strings.

Hth.

jakirkham mentioned this issue Nov 18, 2022

Use Arrow C data interface format strings? zarr-developers/zarr-specs#61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Apache Arrow codec #227

Add Apache Arrow codec #227

vdwees commented Apr 9, 2020

jakirkham commented Apr 9, 2020

vdwees commented Apr 9, 2020

jakirkham commented Apr 10, 2020

alimanfoo commented Apr 14, 2020

Add Apache Arrow codec #227

Add Apache Arrow codec #227

Comments

vdwees commented Apr 9, 2020

jakirkham commented Apr 9, 2020

vdwees commented Apr 9, 2020

jakirkham commented Apr 10, 2020

alimanfoo commented Apr 14, 2020