Skip to content

Zarr datasource#63003

Open
alexandrplashchinsky wants to merge 78 commits into
ray-project:masterfrom
alexandrplashchinsky:zarr-datasource
Open

Zarr datasource#63003
alexandrplashchinsky wants to merge 78 commits into
ray-project:masterfrom
alexandrplashchinsky:zarr-datasource

Conversation

@alexandrplashchinsky
Copy link
Copy Markdown

Description

This PR introduces ray.data.read_zarrv2() and the backing ZarrV2Datasource for reading Zarr v2 stores with Ray Data.

This adds a dedicated public API and datasource implementation for Zarr v2 so users can read chunk metadata from consolidated Zarr v2 stores through the standard Ray Data read API surface.

Related issues

N/A

Additional information

This PR adds:

  • ray.data.read_zarrv2() as a new public Ray Data read API
  • ZarrV2Datasource as the datasource implementation used by the API
  • unit tests covering the datasource behavior and API wiring

Example usage:

import ray

ds = ray.data.read_zarrv2("/path/to/store")

@alexandrplashchinsky alexandrplashchinsky requested a review from a team as a code owner April 28, 2026 21:19
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for reading Zarr v2 stores in Ray Data by adding the ZarrV2Datasource and a public read_zarrv2 API. The implementation includes support for various storage backends (local, S3, Azure) and handles chunk metadata, slice bounds, and padding. Feedback focuses on reducing logic duplication in chunk calculation, simplifying path normalization, converting a utility method to a static method, and adhering to standard Python formatting for keyword arguments.

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated
Comment thread python/ray/data/read_api.py Outdated
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py
@ayushk7102 ayushk7102 added the go add ONLY when ready to merge, run all tests label Apr 28, 2026
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Apr 29, 2026
@alexandrplashchinsky alexandrplashchinsky removed the go add ONLY when ready to merge, run all tests label Apr 29, 2026
@alexandrplashchinsky alexandrplashchinsky self-assigned this Apr 29, 2026
@alexandrplashchinsky alexandrplashchinsky force-pushed the zarr-datasource branch 2 times, most recently from 901636d to 687c8fa Compare April 29, 2026 03:12
Comment thread python/ray/data/tests/datasource/test_zarrv2.py Outdated
@alexandrplashchinsky alexandrplashchinsky force-pushed the zarr-datasource branch 2 times, most recently from 34ab253 to a57fd8c Compare April 29, 2026 18:34
Comment thread python/ray/data/tests/datasource/test_zarrv2.py Outdated
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py
Alexandr Plashchinsky and others added 12 commits April 30, 2026 12:30
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
ArturNiederfahrenhorst and others added 2 commits May 20, 2026 19:55
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py
Comment thread python/ray/data/read_api.py Outdated
... chunk_size=100,
... )
>>> ds.count()
2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created this dataset and tested with it. So we don't need to skip here.

Comment thread python/ray/data/read_api.py Outdated
):
"""Creates a :class:`~ray.data.Dataset` from a Zarr v2 store.

Each output row is one slice along axis 0 of the store's selected arrays.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a larger change from where the PR started:
The key idea here is: In any given zarr dataset, you'll have a bunch of arrays. Some of these arrays will typically be metadata and as such don't represent rows of the dataset. Others do represent rows of the dataset (example: mist dataset would have two arrays, one for images, one for labels, one image and one label belong in one row).
So when reading data from datasets that have metadata etc., you should provide array_paths to read only the arrays together that align on their 0th axis semantically.

array_path = (
""
if dirpath.rstrip("/") == store_root
else dirpath.removeprefix(store_prefix)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing slash not stripped in full-scan array path

Low Severity

In _load_metadata_full_scan, the dirpath from fs.walk is used directly with removeprefix without stripping trailing slashes. The code correctly uses dirpath.rstrip("/") for the root comparison (line 123) and for building zarray_path (line 126), but the array_path key computed via dirpath.removeprefix(store_prefix) could retain a trailing slash if the filesystem's walk yields one. This would produce keys like "sub/" that won't match the normalized (slash-stripped) paths used in filtering and user-facing array_paths.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 80024bb. Configure here.

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated
Comment thread python/ray/data/__init__.py Outdated
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource_artur.py Outdated
Comment thread python/ray/data/_internal/datasource/zarrv2_datasource_artur.py Outdated
Alexandr Plashchinsky added 3 commits May 22, 2026 14:09
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
per_task_row_limit=per_task_row_limit,
)
)
batch = []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed-rank array chunks in single batch breaks Arrow

Medium Severity

When materialize=True and the store has multiple arrays of different ranks (e.g., 2-D images and 1-D labels), get_read_tasks mixes chunks from all arrays into the same batch. A single batch's chunk column would then contain numpy arrays of different ranks. Ray Data converts DataFrames to Arrow internally, and Arrow's tensor extension requires uniform rank within a column — this will fail at conversion time when parallelism is low enough to group different arrays together. The alternate implementation in the PR correctly batches per-array to avoid this.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c309fea. Configure here.

Alexandr Plashchinsky added 2 commits May 22, 2026 22:01
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

There are 5 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 1f0f1fa. Configure here.

Comment thread python/ray/data/read_api.py Outdated
]
actual = sum(_deep_sizeof(v) for v in row_cells)

estimate = zarrv2_datasource._descriptor_row_size_bytes(array_name, meta)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test references undefined module-level function _descriptor_row_size_bytes

Medium Severity

The test calls zarrv2_datasource._descriptor_row_size_bytes(array_name, meta) but this function is never defined anywhere in the codebase. The module defines related constants (_PYINT_BYTES, _PYSTR_BASE, _PYTUPLE_BASE, etc.) that are clearly intended for this function, but the implementation was never added. This test will always fail with AttributeError.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1f0f1fa. Configure here.

# container + the int object itself.
_INT_IN_SEQ = _PYPTR_BYTES + _PYINT_BYTES
# Cost of one (int, int) tuple held inside a list.
_PAIR_IN_LIST = _PYPTR_BYTES + _PYTUPLE_BASE + 2 * _INT_IN_SEQ
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused module-level constants from missing function implementation

Low Severity

The constants _PYINT_BYTES, _PYSTR_BASE, _PYTUPLE_BASE, _PYLIST_BASE, _PYPTR_BYTES, _INT_IN_SEQ, and _PAIR_IN_LIST are defined but never referenced anywhere in the module. They appear to be remnants of a _descriptor_row_size_bytes function that was never implemented, leaving dead code that adds confusion for future readers.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1f0f1fa. Configure here.

fix
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants