Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding to_dask/from_dask #198

Closed
jakirkham opened this issue Nov 24, 2017 · 14 comments
Closed

Adding to_dask/from_dask #198

jakirkham opened this issue Nov 24, 2017 · 14 comments

Comments

@jakirkham
Copy link
Member

Was thinking the other day that it might be nice to have some convenience methods on Zarr's Array for converting it to a Dask Array and for storing a Dask Array to Zarr. May also make sense to have such methods on Zarr Groups (thinking of cases where the array has not been created yet). For the most part these are pretty straightforward to do outside of Zarr. That said, they would be convenient and maybe cut some boilerplate for end users.

@jakirkham
Copy link
Member Author

Should add this is not urgent. Just something that might be interesting to think about later.

@alimanfoo
Copy link
Member

alimanfoo commented Nov 24, 2017 via email

@jakirkham
Copy link
Member Author

That sounds like a reasonable alternative. Thoughts @mrocklin?

@mrocklin
Copy link
Contributor

What would the da.from_zarr or da.to_zarr functions take as inputs?

In the from_zarr case if it's just a Zarr array then this would presumably be identical to da.from_array, no?

@jakirkham
Copy link
Member Author

Well it would be able to extract the dtype and chunks from the Zarr Array as well. This crosses over with issue ( dask/dask#1983 ) a bit.

@mrocklin
Copy link
Contributor

Yeah, I agree that this is the same as with HDF5 and other array objects. The .dtype, .chunks, .shape attributes effectively form a protocol. The issue here is that it's not always clear that dask.array should use the chunksizes in the storage format.

@alimanfoo
Copy link
Member

alimanfoo commented Nov 24, 2017 via email

@mrocklin
Copy link
Contributor

One path is that if we want to support the convenience of matching the dask.array's chunks value to the zarr array's chunks value then I would suggest that we just bake this into da.from_array instead of creating a new from_zarr function. The counterargument here is that often the chunksize of the underlying storage format is too small or too large for dask.array, and so you do want the user to explicitly specify the chunksize.

@jakirkham
Copy link
Member Author

I'm ok with that. IOW just go with the solution proposed in issue ( dask/dask#1983 )?

@jakirkham
Copy link
Member Author

Separately would add that part of the reason I had been thinking about having to_dask in Zarr is we could bypass some of the indexing logic if the chunks in the Dask Array are the same as those in the Zarr Array. Namely we could pull directly from the underlying store and decompress the contents instead of worrying about how to handle slicings that overlap multiple chunks. If we want to get even more clever for this case, we don't even have to do the decompression initially at all. Instead we just return an in memory Zarr Array for the chunk. Thus allowing it to be decompressed when used.

@alimanfoo
Copy link
Member

alimanfoo commented Nov 24, 2017 via email

@jakirkham
Copy link
Member Author

The main benefit is that we delay decompression and keep the memory footprint small in Dask. Depending on the operations performed, it may not be necessary to decompress at all. The tail end of this comment provides one such use case.

@jakirkham
Copy link
Member Author

Should add discussion about adding to_zarr/from_zarr is in issue ( dask/dask#3457 ) and an implementation is being pursued in PR ( dask/dask#3460 ).

@jakirkham
Copy link
Member Author

This is irrelevant as Dask can now convert to/from Zarr thanks to @martindurant's PR ( dask/dask#3460 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants