Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial chunk reads #59

Open
jrbourbeau opened this issue Apr 6, 2020 · 7 comments
Open

Partial chunk reads #59

jrbourbeau opened this issue Apr 6, 2020 · 7 comments
Labels
codec Codec spec related issue

Comments

@jrbourbeau
Copy link
Member

The ability for zarr to support partial chunk reads has come up a couple of times (xref zarr-developers/zarr-python#40, zarr-developers/zarr-python#521). One benefit of supporting this would be improvements to slicing operations that are poorly aligned with chunk boundaries. As @alimanfoo pointed out, some compressors also support partial decompression which would allow for extracting out part of a compressed chunk (e.g. the blosc_getitem method in Blosc).

One potential starting point would be to add a new method, e.g. decode_part, to the Codec interface. Compressors which don't support partial decompression could have a fallback implementation where the entire chunk is decompressed and then sliced. We would also need a mechanism for mapping chunk indices to the appropriate parameters needed for decode_part to extract a part of a chunk.

With the current work on the v3.0 spec taking place, I wanted to open this issue to discuss if partial chunk reads are something we'd like to support as a community

@jrbourbeau jrbourbeau added core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec codec Codec spec related issue labels Apr 6, 2020
@jrbourbeau
Copy link
Member Author

cc'ing @shoyer @vigji who have expressed an interest in this previously

@vigji
Copy link

vigji commented Apr 7, 2020

Thank you for the mention! I would still be extremely interested in the feature. I was trying to move my datasets (4D volumetric imaging microscopy) to zarr but then I stopped mainly because of this problem. I constantly need to combine chunks of a size that make sense with our parallelized pipelines for data processing with the option of loading small parts of the chunks and fast visualisation with a viewer that slice them grabbing only the frames that have to be displayed.
Unfortunately I am not familiar at all with compression algorithms and the inner structures of zarr, I don't think I can be of much help in developing this :( but I would appreciate a lot the feature!

@alimanfoo
Copy link
Member

Thanks @jrbourbeau for reviving this. Just a short technical note, there are at least two possible scenarios here, when a compressor is involved.

  1. One scenario is when the whole chunk is retrieved from storage, but only the relevant parts of the chunk are decompressed. This saves some compute because less data are decompressed, but storage (I/O) costs are the same because the whole compressed chunk needs to be retrieved.

  2. Another scenario is when only part of the chunk is retrieved from storage, then that part is decompressed. This potentially saves both compute because of less data being decompressed, and I/O because less data are retrieved from storage.

  3. A third scenario is where you use a storage layout that stores multiple compressed chunks together within the same file or cloud object. You collocate chunks when you know there is a high likelihood they will need to be retrieved together, at least for some use cases. If your storage layer then implements a getitems() method (as discussed here and here) then the storage layer can potentially optimise retrieval and retrieve multiple chunks within a single request. This is not strictly a partial chunk read, but could address some of the same needs.

Scenario 1 could be achieved for some compression codecs, and would require a change to the codec interface, to allow leveraging mechanisms such as the blosc_getitem function.

It isn't obvious to me yet whether scenario 2 can be achieved at all, it is technically quite complex. If it is doable, it would require changes both to the codec interface and the storage interface.

Scenario 3 is doable and would require changes only to the storage interface.

@JackKelly
Copy link

A forth scenario (which I'm very interested in) might be:

  1. Data is not compressed. When loading a slice of a chunk, only the data requested is loaded from disk.

Does Zarr already work like this for data in cloud storage buckets and for data from a local POSIX filesystem?

@shoyer
Copy link

shoyer commented Jul 19, 2021

As I understand it, Zarr currently does not support any form of partial chunk reads. But indeed, perhaps it should!

One promising way to implement this would to wrap Caterva inside Zarr: zarr-developers/zarr-python#713

@rabernat
Copy link
Contributor

Zarr does support partial chunk reads! It was implemented by @andrewfulton9 in zarr-developers/zarr-python#667 for data encoded with Blosc!

@jstriebel
Copy link
Member

The v3 spec defines partial chunk reads and writes, not discussing interactions with codecs so far: https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#abstract-store-interface
It would be great to specifically note if and how partial reads and writes would be possible regarding codecs when writing those sections.
Please let me know if you are missing anything in the current v3 core spec about this.

@jstriebel jstriebel removed the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codec Codec spec related issue
Projects
None yet
Development

No branches or pull requests

7 participants