Zarr datasource by alexandrplashchinsky · Pull Request #63003 · ray-project/ray

alexandrplashchinsky · 2026-04-28T21:19:05Z

Description

This PR introduces ray.data.read_zarrv2() and the backing ZarrV2Datasource for reading Zarr v2 stores with Ray Data.

This adds a dedicated public API and datasource implementation for Zarr v2 so users can read chunk metadata from consolidated Zarr v2 stores through the standard Ray Data read API surface.

Related issues

N/A

Additional information

This PR adds:

ray.data.read_zarrv2() as a new public Ray Data read API
ZarrV2Datasource as the datasource implementation used by the API
unit tests covering the datasource behavior and API wiring

Example usage:

import ray

ds = ray.data.read_zarrv2("/path/to/store")

gemini-code-assist

Code Review

This pull request introduces support for reading Zarr v2 stores in Ray Data by adding the ZarrV2Datasource and a public read_zarrv2 API. The implementation includes support for various storage backends (local, S3, Azure) and handles chunk metadata, slice bounds, and padding. Feedback focuses on reducing logic duplication in chunk calculation, simplifying path normalization, converting a utility method to a static method, and adhering to standard Python formatting for keyword arguments.

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

ArturNiederfahrenhorst · 2026-05-21T16:10:59Z

+        ...     chunk_size=100,
+        ... )
+        >>> ds.count()
+        2


I created this dataset and tested with it. So we don't need to skip here.

ArturNiederfahrenhorst · 2026-05-21T16:16:34Z

+):
+    """Creates a :class:`~ray.data.Dataset` from a Zarr v2 store.
+
+    Each output row is one slice along axis 0 of the store's selected arrays.


This is a larger change from where the PR started:
The key idea here is: In any given zarr dataset, you'll have a bunch of arrays. Some of these arrays will typically be metadata and as such don't represent rows of the dataset. Others do represent rows of the dataset (example: mist dataset would have two arrays, one for images, one for labels, one image and one label belong in one row).
So when reading data from datasets that have metadata etc., you should provide array_paths to read only the arrays together that align on their 0th axis semantically.

cursor · 2026-05-21T16:23:34Z

+        array_path = (
+            ""
+            if dirpath.rstrip("/") == store_root
+            else dirpath.removeprefix(store_prefix)


Trailing slash not stripped in full-scan array path

Low Severity

In _load_metadata_full_scan, the dirpath from fs.walk is used directly with removeprefix without stripping trailing slashes. The code correctly uses dirpath.rstrip("/") for the root comparison (line 123) and for building zarray_path (line 126), but the array_path key computed via dirpath.removeprefix(store_prefix) could retain a trailing slash if the filesystem's walk yields one. This would produce keys like "sub/" that won't match the normalized (slash-stripped) paths used in filtering and user-facing array_paths.

^{Reviewed by Cursor Bugbot for commit 80024bb. Configure here.}

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

cursor · 2026-05-22T21:49:42Z

+                            per_task_row_limit=per_task_row_limit,
+                        )
+                    )
+                    batch = []


Mixed-rank array chunks in single batch breaks Arrow

Medium Severity

When materialize=True and the store has multiple arrays of different ranks (e.g., 2-D images and 1-D labels), get_read_tasks mixes chunks from all arrays into the same batch. A single batch's chunk column would then contain numpy arrays of different ranks. Ray Data converts DataFrames to Arrow internally, and Arrow's tensor extension requires uniform rank within a column — this will fail at conversion time when parallelism is low enough to group different arrays together. The alternate implementation in the PR correctly batches per-array to avoid this.

^{Reviewed by Cursor Bugbot for commit c309fea. Configure here.}

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

There are 5 total unresolved issues (including 2 from previous reviews).

^{Reviewed by Cursor Bugbot for commit 1f0f1fa. Configure here.}

cursor · 2026-05-23T05:09:11Z

+    ]
+    actual = sum(_deep_sizeof(v) for v in row_cells)
+
+    estimate = zarrv2_datasource._descriptor_row_size_bytes(array_name, meta)


Test references undefined module-level function _descriptor_row_size_bytes

Medium Severity

The test calls zarrv2_datasource._descriptor_row_size_bytes(array_name, meta) but this function is never defined anywhere in the codebase. The module defines related constants (_PYINT_BYTES, _PYSTR_BASE, _PYTUPLE_BASE, etc.) that are clearly intended for this function, but the implementation was never added. This test will always fail with AttributeError.

Additional Locations (1)

python/ray/data/_internal/datasource/zarrv2_datasource.py#L41-L52

^{Reviewed by Cursor Bugbot for commit 1f0f1fa. Configure here.}

cursor · 2026-05-23T05:09:11Z

+# container + the int object itself.
+_INT_IN_SEQ = _PYPTR_BYTES + _PYINT_BYTES
+# Cost of one (int, int) tuple held inside a list.
+_PAIR_IN_LIST = _PYPTR_BYTES + _PYTUPLE_BASE + 2 * _INT_IN_SEQ


Unused module-level constants from missing function implementation

Low Severity

The constants _PYINT_BYTES, _PYSTR_BASE, _PYTUPLE_BASE, _PYLIST_BASE, _PYPTR_BYTES, _INT_IN_SEQ, and _PAIR_IN_LIST are defined but never referenced anywhere in the module. They appear to be remnants of a _descriptor_row_size_bytes function that was never implemented, leaving dead code that adds confusion for future readers.

^{Reviewed by Cursor Bugbot for commit 1f0f1fa. Configure here.}

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

alexandrplashchinsky requested a review from a team as a code owner April 28, 2026 21:19

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

Comment thread python/ray/data/read_api.py Outdated

cursor Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

alexandrplashchinsky force-pushed the zarr-datasource branch from 7390b42 to 1da5b21 Compare April 28, 2026 21:30

cursor Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

cursor Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

cursor Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

ayushk7102 added the go add ONLY when ready to merge, run all tests label Apr 28, 2026

ray-gardener Bot added the data Ray Data-related issues label Apr 29, 2026

alexandrplashchinsky removed the go add ONLY when ready to merge, run all tests label Apr 29, 2026

alexandrplashchinsky self-assigned this Apr 29, 2026

alexandrplashchinsky force-pushed the zarr-datasource branch 2 times, most recently from 901636d to 687c8fa Compare April 29, 2026 03:12

cursor Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread python/ray/data/tests/datasource/test_zarrv2.py Outdated

alexandrplashchinsky force-pushed the zarr-datasource branch 2 times, most recently from 34ab253 to a57fd8c Compare April 29, 2026 18:34

cursor Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread python/ray/data/tests/datasource/test_zarrv2.py Outdated

cursor Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

Alexandr Plashchinsky and others added 12 commits April 30, 2026 12:30

zarr datasource

68b92fe

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

init py import

b447258

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

debug

b0a965c

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

added custom chunk shape

f469cb0

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

documentation for read API

c32510c

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

added padding to read task returns

867a919

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

added tuple conversion for chunk shapes

205228e

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

tests for zarr v2 datasource

9c87d40

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

fixed bugs

d597244

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

fixes

b730cf6

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

obj in check import

5679351

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

fixed obj path in check import

088e4f3

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

ArturNiederfahrenhorst and others added 2 commits May 20, 2026 19:55

add zarr to data test requirements

0a270f7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

drop _check_import stub; match sibling test pattern

a0b1b35

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

cursor Bot reviewed May 20, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

clean up

f9e5c92

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

cursor Bot reviewed May 20, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

ArturNiederfahrenhorst mentioned this pull request May 20, 2026

[Don't merge] Changes to Zarr feature branch #63552

Draft

chunk geometry in metadata only read tasks

dae4970

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

cursor Bot reviewed May 21, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

ArturNiederfahrenhorst reviewed May 21, 2026

View reviewed changes

cursor Bot reviewed May 21, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

cursor Bot reviewed May 21, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

alexandrplashchinsky force-pushed the zarr-datasource branch from 4df7f5b to dae4970 Compare May 21, 2026 19:01

upstream changes

7b05698

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

cursor Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

arturs code

9212d23

cursor Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/ray/data/__init__.py Outdated

put read chunk def inside read fn

79dc6dd

cursor Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource_artur.py Outdated

work around

05f0d81

cursor Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource_artur.py Outdated

Alexandr Plashchinsky added 3 commits May 22, 2026 14:09

potential fix

12b5609

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

fix

d9b70f7

revert arturs code

c309fea

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

cursor Bot reviewed May 22, 2026

View reviewed changes

Alexandr Plashchinsky added 2 commits May 22, 2026 22:01

chunk shapes dict, and always materialize

3434a8d

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

cleanup

1f0f1fa

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

cursor Bot reviewed May 23, 2026

View reviewed changes

fix

8598ceb

Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

Conversation

alexandrplashchinsky commented Apr 28, 2026

Description

Related issues

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArturNiederfahrenhorst May 21, 2026

Choose a reason for hiding this comment

Uh oh!

ArturNiederfahrenhorst May 21, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 21, 2026

Choose a reason for hiding this comment

Trailing slash not stripped in full-scan array path

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot May 22, 2026

Choose a reason for hiding this comment

Mixed-rank array chunks in single batch breaks Arrow

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot May 23, 2026

Choose a reason for hiding this comment

Test references undefined module-level function _descriptor_row_size_bytes

Uh oh!

cursor Bot May 23, 2026

Choose a reason for hiding this comment

Unused module-level constants from missing function implementation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Test references undefined module-level function `_descriptor_row_size_bytes`