Skip to content

Add API to Store interface to read directly into an output buffer #3429

@TomAugspurger

Description

@TomAugspurger

Currently, we reading uncompressed data we have the store read the data into a temporary buffer and then copy those bytes into the output buffer.

To get the best performance, we should consider adding an optional get_into API to Store. Instead of taking a prototype, this would take the actual output Buffer to read into. Stores must opt into this, for backwards compatibility, by overriding supports_get_into:

    async def get_into(
        self,
        key: str,
        out: Buffer,
        byte_range: ByteRequest | None = None,
    ) -> bool:
        raise NotImplementedError

    @property
    def supports_get_into(self) -> bool:
        """Does the store support get_into?"""
        return False

For the special case of

  • uncompress data, and
  • The chunk being read is a contiguous subset of the output ndarray

then the bytes on disk can be interpreted directly as an ndarray (when combined with a shape and itemsize (and maybe endianness?), and we can avoid a memcpy. Some early testing indicates that this might be worth doing. Over in https://github.com/TomAugspurger/zarr-python/blob/tom/zero-copy-alt/simple.py, I see about 7.5x higher throughput for reading uncompressed data with read_into (compared to about 2.5x higher throughput for compressed data, where this get_into optimization isn't an option).

Real world gains will probably be lower, and remote file system APIs typically don't offer APIs to read directly into a user-allocated output buffer like .readinto does.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePotential issues with Zarr performance (I/O, memory, etc.)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions