Zarr datasource#63003
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for reading Zarr v2 stores in Ray Data by adding the ZarrV2Datasource and a public read_zarrv2 API. The implementation includes support for various storage backends (local, S3, Azure) and handles chunk metadata, slice bounds, and padding. Feedback focuses on reducing logic duplication in chunk calculation, simplifying path normalization, converting a utility method to a static method, and adhering to standard Python formatting for keyword arguments.
7390b42 to
1da5b21
Compare
901636d to
687c8fa
Compare
34ab253 to
a57fd8c
Compare
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandr.plashchinsky-h765g66h9v> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
| ... chunk_size=100, | ||
| ... ) | ||
| >>> ds.count() | ||
| 2 |
There was a problem hiding this comment.
I created this dataset and tested with it. So we don't need to skip here.
| ): | ||
| """Creates a :class:`~ray.data.Dataset` from a Zarr v2 store. | ||
|
|
||
| Each output row is one slice along axis 0 of the store's selected arrays. |
There was a problem hiding this comment.
This is a larger change from where the PR started:
The key idea here is: In any given zarr dataset, you'll have a bunch of arrays. Some of these arrays will typically be metadata and as such don't represent rows of the dataset. Others do represent rows of the dataset (example: mist dataset would have two arrays, one for images, one for labels, one image and one label belong in one row).
So when reading data from datasets that have metadata etc., you should provide array_paths to read only the arrays together that align on their 0th axis semantically.
| array_path = ( | ||
| "" | ||
| if dirpath.rstrip("/") == store_root | ||
| else dirpath.removeprefix(store_prefix) |
There was a problem hiding this comment.
Trailing slash not stripped in full-scan array path
Low Severity
In _load_metadata_full_scan, the dirpath from fs.walk is used directly with removeprefix without stripping trailing slashes. The code correctly uses dirpath.rstrip("/") for the root comparison (line 123) and for building zarray_path (line 126), but the array_path key computed via dirpath.removeprefix(store_prefix) could retain a trailing slash if the filesystem's walk yields one. This would produce keys like "sub/" that won't match the normalized (slash-stripped) paths used in filtering and user-facing array_paths.
Reviewed by Cursor Bugbot for commit 80024bb. Configure here.
4df7f5b to
dae4970
Compare
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
| per_task_row_limit=per_task_row_limit, | ||
| ) | ||
| ) | ||
| batch = [] |
There was a problem hiding this comment.
Mixed-rank array chunks in single batch breaks Arrow
Medium Severity
When materialize=True and the store has multiple arrays of different ranks (e.g., 2-D images and 1-D labels), get_read_tasks mixes chunks from all arrays into the same batch. A single batch's chunk column would then contain numpy arrays of different ranks. Ray Data converts DataFrames to Arrow internally, and Arrow's tensor extension requires uniform rank within a column — this will fail at conversion time when parallelism is low enough to group different arrays together. The alternate implementation in the PR correctly batches per-array to avoid this.
Reviewed by Cursor Bugbot for commit c309fea. Configure here.
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
There are 5 total unresolved issues (including 2 from previous reviews).
Reviewed by Cursor Bugbot for commit 1f0f1fa. Configure here.
| ] | ||
| actual = sum(_deep_sizeof(v) for v in row_cells) | ||
|
|
||
| estimate = zarrv2_datasource._descriptor_row_size_bytes(array_name, meta) |
There was a problem hiding this comment.
Test references undefined module-level function _descriptor_row_size_bytes
Medium Severity
The test calls zarrv2_datasource._descriptor_row_size_bytes(array_name, meta) but this function is never defined anywhere in the codebase. The module defines related constants (_PYINT_BYTES, _PYSTR_BASE, _PYTUPLE_BASE, etc.) that are clearly intended for this function, but the implementation was never added. This test will always fail with AttributeError.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 1f0f1fa. Configure here.
| # container + the int object itself. | ||
| _INT_IN_SEQ = _PYPTR_BYTES + _PYINT_BYTES | ||
| # Cost of one (int, int) tuple held inside a list. | ||
| _PAIR_IN_LIST = _PYPTR_BYTES + _PYTUPLE_BASE + 2 * _INT_IN_SEQ |
There was a problem hiding this comment.
Unused module-level constants from missing function implementation
Low Severity
The constants _PYINT_BYTES, _PYSTR_BASE, _PYTUPLE_BASE, _PYLIST_BASE, _PYPTR_BYTES, _INT_IN_SEQ, and _PAIR_IN_LIST are defined but never referenced anywhere in the module. They appear to be remnants of a _descriptor_row_size_bytes function that was never implemented, leaving dead code that adds confusion for future readers.
Reviewed by Cursor Bugbot for commit 1f0f1fa. Configure here.


Description
This PR introduces
ray.data.read_zarrv2()and the backingZarrV2Datasourcefor reading Zarr v2 stores with Ray Data.This adds a dedicated public API and datasource implementation for Zarr v2 so users can read chunk metadata from consolidated Zarr v2 stores through the standard Ray Data read API surface.
Related issues
N/A
Additional information
This PR adds:
ray.data.read_zarrv2()as a new public Ray Data read APIZarrV2Datasourceas the datasource implementation used by the APIExample usage: