Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe codec #452

Closed
hadim opened this issue Jul 27, 2023 · 10 comments
Closed

Dataframe codec #452

hadim opened this issue Jul 27, 2023 · 10 comments

Comments

@hadim
Copy link

hadim commented Jul 27, 2023

I am looking for a way to store a dataframe (or many) into a Zarr. It seems this is currently not possible, see zarr-developers/community#31 for context.

I was wondering whether you think a dataframe codec based on Parquet could be useful to have in numcodecs. I am not sure whether it fit the scope of numcodecs though.

@martindurant
Copy link
Member

Zarr can store complex dtypes, the pure numpy equivalent of dataframes and probably not what you are after.

Zarr can also store groups of one-dimensional columns, which together look like a dataframe.

However, the question is why you want to use zarr for this? Zarr is for n-dimensional arrays, and distinguishes itself over parquet by being able to index and chunk in each dimension, whereas parquet is one-dimensional. But your data is one-dimensional.

Yes, you could write a codec to store chunks of dataframe in zarr chunks. However, parquet itself is a chunked format, so you don't need zarr to get there. If you want to use parquet already, why not use it by itself? What advantage do you want from zarr?

@hadim
Copy link
Author

hadim commented Jul 27, 2023

So the "naive" idea I would like to explore (whether it's here or in custom code) is to encode a dataframe in Parquet and store the bytes in a Zarr.

Now on the reason for why I would like to do that: I am storing a complex dataset and zarr beyond its n-dimensional arrays features, is also very convenient to store literally anything (as long as you have a codec for it). My use case is to store a metadata dataframe (dataframe is simply the most convenient and fastest way for my use case), that will come alongside the "true data" which is a set of n-dimensional arrays. By doing that, we will only have to deal with a single dataset files instead of two (and it makes everything simpler by doing so).

Now I totally understand this can be outside of scope of numcodecs and zarr so all good and I can close.

@martindurant
Copy link
Member

Is the metadata so small that using the pickle/JSON/msgpack codecs makes sense? Then, your dataframe can look like a one-dimensional object-type array.

@hadim
Copy link
Author

hadim commented Jul 27, 2023

Yes, I will explore JSON/msgpack indeed and it might do the job. But some metadata might be large (never ultra-large likely) so this is why I really liked the Parquet codec.

The other reason is more of a "conceptual reason". If going with JSON/msgpack, then I need logic in the code to load and then apply pd.read_{json|msgpack} which is indeed only a one-liner. But the nice property of having a "dataframe" codec is that you don't even have to think about it and zarr will simply return you directly a dataframe (without having to care about the serialization method used).

@martindurant
Copy link
Member

Oh, I finally see what you are after :)

No, I don't think that zarr itself will ever return anything other than arrays (it's in the name!). Those could be made into dataframes separately.

@hadim
Copy link
Author

hadim commented Jul 27, 2023

Ok I understand and all good to me.

Thanks!

@hadim hadim closed this as completed Jul 27, 2023
@hadim
Copy link
Author

hadim commented Jul 27, 2023

For the record, I crossref that rather old PR zarr-developers/zarr-python#84 (in case anyone is looking into the story of this)

@martindurant
Copy link
Member

As you can see, the idea was dropped due to the implementation of parquet IO for pandas with fastparquet. :)

@hadim
Copy link
Author

hadim commented Jul 27, 2023

For the records (again!), saving Parquet bytes in zarr works very well and shows the same performance as working on a raw Parquet file. While not suprising, I needed to check it before moving forward.

# Save metadata as Parquet and bytes
df_buffer = io.BytesIO()
metadata.to_parquet(df_buffer)
root.create_dataset("metadata_parquet", data=[df_buffer.getbuffer().tobytes()], dtype=bytes)

# Read from Parquet
df_buffer = io.BytesIO()
df_buffer.write(root["metadata_parquet"][0])
metadata = pd.read_parquet(df_buffer)
assert isinstance(metadata, pd.DataFrame)

Sizes on disk are the same and loading times are also the same.

The only downside is that by doing df_buffer.write(root["metadata_parquet"][0]), you are reading the whole file and so loose the ability to "dynamically and conditionally" load only a subset of a Parquet file. I am sure it's possible to enable that, but it's not needed for my use case.

Hope it helps!

@joshmoore
Copy link
Member

Sorry for chiming in late, but I find this personally interesting (i.e. for NGFF):

A question that fairly regularly comes up when is whether or not one starts mixing Parquest files into Zarr files. This is at least a new avenue for consideration.@hadim, thanks for the investigation & (ongoing) record-keeping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants