Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Apache Arrow codec #227

Open
vdwees opened this issue Apr 9, 2020 · 4 comments
Open

Add Apache Arrow codec #227

vdwees opened this issue Apr 9, 2020 · 4 comments

Comments

@vdwees
Copy link

vdwees commented Apr 9, 2020

Adding an Apache Arrow codec for efficient data loading.

For data stored in the filesystem, Apache parquete might be added as well.

Also mentioned here:
zarr-developers/zarr-python#515

@jakirkham
Copy link
Member

How do you imagine this working?

If people are using Parquet, do they actually need Zarr?

@vdwees
Copy link
Author

vdwees commented Apr 9, 2020

I guess one of the selling features of zarr for me is being able to load only the chunks I need off of a remote server. Arrow is only an in-memory representation, so I guess it is conceivable that a chunk is larger than is reasonable for memory, and it gets spooled to a local disk as a Parquet file?🤔

I’ll experiment a bit with Arrow, and if I can get the behavior I’m hoping for I’ll submit a PR.

@jakirkham
Copy link
Member

Before getting to a PR, it would be good to get a clearer idea on the usage pattern and how well it generalizes (though it sounds like we are still working on those questions 😉).

@alimanfoo
Copy link
Member

Hi @vdwees, just to second @jakirkham's comment, it would be helpful to clarify goals and usage patterns here.

IIUC Arrow provides a standard way to share memory buffers between processes. So, e.g., you could imagine loading data from one or more chunks of any Zarr array into an PyArrow array, rather than a numpy array as currently. That is something completely independent of codecs, it's more about how to lay out memory buffers and expose them to applications.

Parquet is a file format for columnar data, i.e., serialisation of multiple 1D arrays.

Codecs in Zarr are are things like compressors which transform arrays during serialisation or deserialisation. Some of the current codecs in the numcodecs package do borrow some ideas from the Parquet format, but that's some very specific things, e.g., about how to serialise strings.

Hth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants