API for direct block access #543

clbarnes · 2020-03-03T12:11:16Z

There are internal methods for retrieving individual blocks, but there are some circumstances where addressing data one block at a time is helpful for end users, and would avoid the user having to do their own pipeline of chunk size -> block indices -> slicing only for zarr to then go slicing -> block indices etc. again.

I envision something like

@dataclass
class ChunkWrapper:
    chunk_idx: Tuple[int, ...]
    chunk_slice: Tuple[slice, ...]  # or an offset-shape pair, or a start-stop pair
    data: np.ndarray

class Array:
    ...
    def get_chunk(self, chunk_idx: Tuple[int, ...]) -> ChunkWrapper:
        ...

    def set_chunk(self, chunk_idx: Tuple[int, ...], data: np.ndarray) -> None:
        # check data is the right shape, handling edge blocks
        ...

    def iter_chunk_idxs(self) -> Iterator[ChunkWrapper]:
        ...

Then e.g. a blockwise operation could be trivially implemented with

for idx in my_array.iter_chunk_idxs():
    chunk = my_array.get_chunk(idx)
    my_array.set_chunk(idx, chunk.data  * 2)

Obviously in this particular case, you could use dask, but the principle is useful elsewhere. My use case is that I have an array of labels which I want to relate to point annotations: I want to get a chunk, see which point annotations exist inside it, and find the relationships, preferably without chunk-mangling boilerplate 😁

This allows tools implementing their own parallelism (dask being one example, but there are many others imaginable) much easier access to the blocked nature of the underlying arrays.

alimanfoo · 2020-03-03T12:27:55Z

Hi @clbarnes, could get_chunk() return the numpy array with chunk data directly, no need for a wrapper class? E.g.:

for idx in my_array.iter_chunk_idxs():
    chunk = my_array.get_chunk(idx)
    my_array.set_chunk(idx, chunk * 2)

clbarnes · 2020-03-03T12:32:23Z

It could, although working with blocks like this before, I've found myself shipping round the same tuple of index and data to different functions so thought it might be handy to have them in the same place. Not critical, though. You could get the convenience of both in one go by subclassing np.ndarray to add a chunk_idx member, although that introduces more maintenance overhead.

jakirkham mentioned this issue Mar 9, 2020

Utility to turn block-ID into equivalent slice #545

Open

tasansal mentioned this issue Mar 25, 2022

Proposal: Add Array.blocks using new BlockIndexer (Prototype Code Included) #991

Closed

tasansal mentioned this issue Jun 8, 2023

Support Block (Chunk) Indexing #1428

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API for direct block access #543

API for direct block access #543

clbarnes commented Mar 3, 2020 •

edited

Loading

alimanfoo commented Mar 3, 2020

clbarnes commented Mar 3, 2020

API for direct block access #543

API for direct block access #543

Comments

clbarnes commented Mar 3, 2020 • edited Loading

alimanfoo commented Mar 3, 2020

clbarnes commented Mar 3, 2020

clbarnes commented Mar 3, 2020 •

edited

Loading