Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for direct block access #543

Open
clbarnes opened this issue Mar 3, 2020 · 2 comments
Open

API for direct block access #543

clbarnes opened this issue Mar 3, 2020 · 2 comments

Comments

@clbarnes
Copy link
Contributor

clbarnes commented Mar 3, 2020

There are internal methods for retrieving individual blocks, but there are some circumstances where addressing data one block at a time is helpful for end users, and would avoid the user having to do their own pipeline of chunk size -> block indices -> slicing only for zarr to then go slicing -> block indices etc. again.

I envision something like

@dataclass
class ChunkWrapper:
    chunk_idx: Tuple[int, ...]
    chunk_slice: Tuple[slice, ...]  # or an offset-shape pair, or a start-stop pair
    data: np.ndarray

class Array:
    ...
    def get_chunk(self, chunk_idx: Tuple[int, ...]) -> ChunkWrapper:
        ...

    def set_chunk(self, chunk_idx: Tuple[int, ...], data: np.ndarray) -> None:
        # check data is the right shape, handling edge blocks
        ...

    def iter_chunk_idxs(self) -> Iterator[ChunkWrapper]:
        ...

Then e.g. a blockwise operation could be trivially implemented with

for idx in my_array.iter_chunk_idxs():
    chunk = my_array.get_chunk(idx)
    my_array.set_chunk(idx, chunk.data  * 2)

Obviously in this particular case, you could use dask, but the principle is useful elsewhere. My use case is that I have an array of labels which I want to relate to point annotations: I want to get a chunk, see which point annotations exist inside it, and find the relationships, preferably without chunk-mangling boilerplate 😁

This allows tools implementing their own parallelism (dask being one example, but there are many others imaginable) much easier access to the blocked nature of the underlying arrays.

@alimanfoo
Copy link
Member

Hi @clbarnes, could get_chunk() return the numpy array with chunk data directly, no need for a wrapper class? E.g.:

for idx in my_array.iter_chunk_idxs():
    chunk = my_array.get_chunk(idx)
    my_array.set_chunk(idx, chunk * 2)

@clbarnes
Copy link
Contributor Author

clbarnes commented Mar 3, 2020

It could, although working with blocks like this before, I've found myself shipping round the same tuple of index and data to different functions so thought it might be handy to have them in the same place. Not critical, though. You could get the convenience of both in one go by subclassing np.ndarray to add a chunk_idx member, although that introduces more maintenance overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants