Adding to_dask/from_dask #198

jakirkham · 2017-11-24T00:48:39Z

Was thinking the other day that it might be nice to have some convenience methods on Zarr's Array for converting it to a Dask Array and for storing a Dask Array to Zarr. May also make sense to have such methods on Zarr Groups (thinking of cases where the array has not been created yet). For the most part these are pretty straightforward to do outside of Zarr. That said, they would be convenient and maybe cut some boilerplate for end users.

The text was updated successfully, but these errors were encountered:

jakirkham · 2017-11-24T00:49:11Z

Should add this is not urgent. Just something that might be interesting to think about later.

alimanfoo · 2017-11-24T01:58:15Z

Yes this would be useful, although I reckon these should go into dask as da.from_zarr and da.to_zarr - I think of dask as sitting one level above zarr in the dependency stack.

…

On Fri, Nov 24, 2017 at 12:49 AM, jakirkham ***@***.***> wrote: Should add this is not urgent. Just something that might be interesting to think about later. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/issues/198#issuecomment-346714016>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QtIGVFRMLILNeJ1wo5lpfuPe5yr2ks5s5hKHgaJpZM4QpRK_> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

jakirkham · 2017-11-24T17:24:32Z

That sounds like a reasonable alternative. Thoughts @mrocklin?

mrocklin · 2017-11-24T17:32:34Z

What would the da.from_zarr or da.to_zarr functions take as inputs?

In the from_zarr case if it's just a Zarr array then this would presumably be identical to da.from_array, no?

jakirkham · 2017-11-24T17:38:28Z

Well it would be able to extract the dtype and chunks from the Zarr Array as well. This crosses over with issue ( dask/dask#1983 ) a bit.

mrocklin · 2017-11-24T17:42:31Z

Yeah, I agree that this is the same as with HDF5 and other array objects. The .dtype, .chunks, .shape attributes effectively form a protocol. The issue here is that it's not always clear that dask.array should use the chunksizes in the storage format.

alimanfoo · 2017-11-24T18:20:44Z

Fwiw I always do d = da.from_array(z, chunks=z.chunks). I'd appreciate a convenience, e.g., da.from_zarr(z). It would also be useful to have da.to_zarr(d, **create_kws) which is basically a convenient way to materialise the result of a computation into a new zarr array. Happy to provide further details if you think this is worth considering.

…

On Fri, 24 Nov 2017 at 17:42, Matthew Rocklin ***@***.*** ***@***.***');>> wrote: Yeah, I agree that this is the same as with HDF5 and other array objects. The .dtype, .chunks, .shape attributes effectively form a protocol. The issue here is that it's not always clear that dask.array should use the chunksizes in the storage format. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/issues/198#issuecomment-346875102>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qr6Ov5KrkpPTBjDuH1uuGZ5HEMQpks5s5wAIgaJpZM4QpRK_> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

mrocklin · 2017-11-24T18:24:39Z

One path is that if we want to support the convenience of matching the dask.array's chunks value to the zarr array's chunks value then I would suggest that we just bake this into da.from_array instead of creating a new from_zarr function. The counterargument here is that often the chunksize of the underlying storage format is too small or too large for dask.array, and so you do want the user to explicitly specify the chunksize.

jakirkham · 2017-11-24T19:27:28Z

I'm ok with that. IOW just go with the solution proposed in issue ( dask/dask#1983 )?

jakirkham · 2017-11-24T19:32:06Z

Separately would add that part of the reason I had been thinking about having to_dask in Zarr is we could bypass some of the indexing logic if the chunks in the Dask Array are the same as those in the Zarr Array. Namely we could pull directly from the underlying store and decompress the contents instead of worrying about how to handle slicings that overlap multiple chunks. If we want to get even more clever for this case, we don't even have to do the decompression initially at all. Instead we just return an in memory Zarr Array for the chunk. Thus allowing it to be decompressed when used.

alimanfoo · 2017-11-24T22:05:08Z

FWIW the overhead of the indexing logic is very minimal compared to the time taken in decompression, even with the fastest compressor, so I don't think you would gain much performance-wise by providing a more direct route to access a chunk. Out of interest, what is the motivation for thinking to return an in-memory zarr array for each chunk?

…

On Friday, November 24, 2017, jakirkham ***@***.***> wrote: Separately would add that part of the reason I had been thinking about having to_dask in Zarr is we could bypass some of the indexing logic if the chunks in the Dask Array are the same as those in the Zarr Array. Namely we could pull directly from the underlying store and decompress the contents instead of worrying about how to handle slicings that overlap multiple chunks. If we want to get even more clever for this case, we don't even have to do the decompression initially at all. Instead we just return an in memory Zarr Array for the chunk. Thus allowing it to be decompressed when used. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/issues/198#issuecomment-346887073>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QjK43HthFJ5D409DPxYngI1ALCgWks5s5xm2gaJpZM4QpRK_> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

jakirkham · 2017-11-28T21:14:25Z

The main benefit is that we delay decompression and keep the memory footprint small in Dask. Depending on the operations performed, it may not be necessary to decompress at all. The tail end of this comment provides one such use case.

jakirkham · 2018-05-04T22:48:51Z

Should add discussion about adding to_zarr/from_zarr is in issue ( dask/dask#3457 ) and an implementation is being pursued in PR ( dask/dask#3460 ).

jakirkham · 2018-05-26T02:50:20Z

This is irrelevant as Dask can now convert to/from Zarr thanks to @martindurant's PR ( dask/dask#3460 ).

jakirkham mentioned this issue May 2, 2018

Adding from_zarr and to_zarr methods dask/dask#3457

Closed

jakirkham closed this as completed May 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding to_dask/from_dask #198

Adding to_dask/from_dask #198

jakirkham commented Nov 24, 2017

jakirkham commented Nov 24, 2017

alimanfoo commented Nov 24, 2017 via email

jakirkham commented Nov 24, 2017

mrocklin commented Nov 24, 2017

jakirkham commented Nov 24, 2017

mrocklin commented Nov 24, 2017

alimanfoo commented Nov 24, 2017 via email

mrocklin commented Nov 24, 2017

jakirkham commented Nov 24, 2017

jakirkham commented Nov 24, 2017

alimanfoo commented Nov 24, 2017 via email

jakirkham commented Nov 28, 2017

jakirkham commented May 4, 2018

jakirkham commented May 26, 2018

Adding to_dask/from_dask #198

Adding to_dask/from_dask #198

Comments

jakirkham commented Nov 24, 2017

jakirkham commented Nov 24, 2017

alimanfoo commented Nov 24, 2017 via email

jakirkham commented Nov 24, 2017

mrocklin commented Nov 24, 2017

jakirkham commented Nov 24, 2017

mrocklin commented Nov 24, 2017

alimanfoo commented Nov 24, 2017 via email

mrocklin commented Nov 24, 2017

jakirkham commented Nov 24, 2017

jakirkham commented Nov 24, 2017

alimanfoo commented Nov 24, 2017 via email

jakirkham commented Nov 28, 2017

jakirkham commented May 4, 2018

jakirkham commented May 26, 2018