[Data][2/n][Dsv2] Add DataSourceV2 core API, scanner/reader framework, and optimizer mixins by goutamvenkat-anyscale · Pull Request #61615 · ray-project/ray

goutamvenkat-anyscale · 2026-03-10T06:33:09Z

Description

DataSourceV2 (datasource_v2.py) -- Top-level entry point that ties together file indexing, schema inference, size estimation, and scanner creation. Datasources declare their category (file-based, database, data lake, etc.) to enable category-specific optimizations.
Scanner / FileScanner (scanners/) -- Immutable, clonable abstraction representing a configured read plan. Scanner is responsible for partitioning input into parallel work units (plan()) and creating Reader instances. FileScanner provides a default plan() that distributes files evenly across a target parallelism.
Reader / FileReader (readers/) -- Worker-side execution: receives an InputBucket (e.g., FileManifest) and yields Arrow tables. FileReader is wired for column pruning, filter pushdown, and limit support via PyArrow's Dataset API.
InMemorySizeEstimator / SamplingInMemorySizeEstimator (readers/in_memory_size_estimator.py) -- Estimates in-memory data sizes to inform partitioning. The sampling variant reads a single file to estimate the encoding ratio (on-disk vs. in-memory) and applies it to the rest, with a 1:1 fallback for multi-batch files.
Logical optimizer mixins (logical_optimizers.py) -- Declarative capability interfaces (SupportsFilterPushdown, SupportsColumnPruning, SupportsLimitPushdown, SupportsPartitionPruning) that scanners can implement to advertise which optimizations they support. The execution engine can introspect these to apply pushdowns automatically.

DataSourceV2
  ├── FileIndexer     (discovery: what files exist?)
  ├── Scanner         (planning: how to partition & what pushdowns?)
  │     └── Reader    (execution: read data from a partition)
  └── SizeEstimator   (cost model: how big is the data in memory?)

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

gemini-code-assist

Code Review

This pull request introduces the core API for DataSourceV2, including the DataSourceV2, Scanner, and Reader abstractions, along with mixins for optimizer pushdowns. This is a foundational change for unifying data source access in Ray Data. The new API is well-structured and modular.

My review focuses on ensuring the new abstractions are sound and maintainable. I've identified a few issues related to code duplication and circular dependencies that should be resolved to improve the design. I also found a critical bug in SamplingInMemorySizeEstimator where it calls a non-existent method, and an incorrect type hint. Please see the detailed comments for suggestions.

python/ray/data/_internal/datasource_v2/readers/in_memory_size_estimator.py

python/ray/data/_internal/datasource_v2/datasource_v2.py

python/ray/data/_internal/datasource_v2/readers/base_reader.py

python/ray/data/_internal/datasource_v2/scanners/scanner.py

python/ray/data/_internal/datasource_v2/readers/in_memory_size_estimator.py

python/ray/data/_internal/datasource_v2/readers/base_reader.py

cursor · 2026-03-16T08:15:18Z

python/ray/data/_internal/datasource_v2/readers/in_memory_size_estimator.py

+                # reading the file. So, we only estimate the encoding ratio if we don't
+                # already have one.
+                self._encoding_ratio = self._estimate_encoding_ratio(path, file_size)
+                break


Loop iterates all files needlessly when ratio exists

Low Severity

When self._encoding_ratio is already set from a previous call, the for loop iterates through every file in the manifest doing nothing, because the break on line 59 is inside the if self._encoding_ratio is None block. The intent is to skip estimation entirely when the ratio is known, but the loop only terminates early when the condition is true. The guard or the break needs to be restructured so the loop is skipped when the ratio is already computed.

python/ray/data/_internal/datasource_v2/readers/in_memory_size_estimator.py

python/ray/data/_internal/datasource_v2/scanners/scanner.py

cursor · 2026-03-16T16:23:17Z

python/ray/data/_internal/datasource_v2/readers/in_memory_size_estimator.py

+                # reading the file. So, we only estimate the encoding ratio if we don't
+                # already have one.
+                self._encoding_ratio = self._estimate_encoding_ratio(path, file_size)
+                break


Unconditional break skips valid files for ratio estimation

Medium Severity

The break on line 59 is unconditional — it exits the loop after trying only the first file, even when _estimate_encoding_ratio returns None (e.g., file has zero size or produces no data). If the first file can't yield a valid ratio but subsequent files could, the estimator falls back to a 1:1 ratio unnecessarily. The break needs to be conditional on self._encoding_ratio is not None.

alexeykudinkin · 2026-03-18T18:06:23Z

python/ray/data/_internal/datasource_v2/datasource_v2.py

+
+
+@DeveloperAPI
+class DataSourceV2(ABC, Generic[InputBucket]):


Where will it be used?

Will be used in subsequent PRs. When I introduce it for Parquet

python/ray/data/_internal/datasource_v2/readers/base_reader.py

alexeykudinkin · 2026-03-18T18:16:09Z

python/ray/data/_internal/datasource_v2/readers/in_memory_size_estimator.py

+        ...
+
+
+class SamplingInMemorySizeEstimator(InMemorySizeEstimator):


This is just a copy, no changes, right?

the filemanifest construction and read_files signature are different. But the rest are the same

python/ray/data/_internal/datasource_v2/readers/base_reader.py

python/ray/data/_internal/datasource_v2/datasource_v2.py

python/ray/data/_internal/datasource_v2/readers/file_reader.py

…izer mixins Signed-off-by: Goutam <goutam@anyscale.com>

Signed-off-by: Goutam <goutam@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

cursor · 2026-03-20T23:03:16Z

python/ray/data/_internal/datasource_v2/readers/file_reader.py

+        Returns:
+            Iterator[pa.Table]: Iterator of PyArrow Tables containing the read data.
+        """
+        ...


FileReader is concrete but read() returns None

Medium Severity

FileReader overrides Reader's @abstractmethod read() with a body of ..., which returns None. This makes FileReader a concrete, instantiable class whose read() silently violates the Iterator[pa.Table] return contract. SamplingInMemorySizeEstimator calls self._reader.read(manifest) and passes the result to next(), which will raise TypeError if the reader is a plain FileReader rather than a subclass with a real implementation. Marking FileReader as abstract or adding @abstractmethod to read() would prevent this.

Additional Locations (1)

python/ray/data/_internal/datasource_v2/readers/in_memory_size_estimator.py#L91-L95

Signed-off-by: Goutam <goutam@anyscale.com>

…, and optimizer mixins (ray-project#61615) ## Description - DataSourceV2 (datasource_v2.py) -- Top-level entry point that ties together file indexing, schema inference, size estimation, and scanner creation. Datasources declare their category (file-based, database, data lake, etc.) to enable category-specific optimizations. - Scanner / FileScanner (scanners/) -- Immutable, clonable abstraction representing a configured read plan. Scanner is responsible for partitioning input into parallel work units (plan()) and creating Reader instances. FileScanner provides a default plan() that distributes files evenly across a target parallelism. - Reader / FileReader (readers/) -- Worker-side execution: receives an InputBucket (e.g., FileManifest) and yields Arrow tables. FileReader is wired for column pruning, filter pushdown, and limit support via PyArrow's Dataset API. - InMemorySizeEstimator / SamplingInMemorySizeEstimator (readers/in_memory_size_estimator.py) -- Estimates in-memory data sizes to inform partitioning. The sampling variant reads a single file to estimate the encoding ratio (on-disk vs. in-memory) and applies it to the rest, with a 1:1 fallback for multi-batch files. - Logical optimizer mixins (logical_optimizers.py) -- Declarative capability interfaces (SupportsFilterPushdown, SupportsColumnPruning, SupportsLimitPushdown, SupportsPartitionPruning) that scanners can implement to advertise which optimizations they support. The execution engine can introspect these to apply pushdowns automatically. ``` DataSourceV2 ├── FileIndexer (discovery: what files exist?) ├── Scanner (planning: how to partition & what pushdowns?) │ └── Reader (execution: read data from a partition) └── SizeEstimator (cost model: how big is the data in memory?) ``` ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: Pedro Jeronimo <pedro.jeronimo@tecnico.ulisboa.pt>

…, and optimizer mixins (ray-project#61615) ## Description - DataSourceV2 (datasource_v2.py) -- Top-level entry point that ties together file indexing, schema inference, size estimation, and scanner creation. Datasources declare their category (file-based, database, data lake, etc.) to enable category-specific optimizations. - Scanner / FileScanner (scanners/) -- Immutable, clonable abstraction representing a configured read plan. Scanner is responsible for partitioning input into parallel work units (plan()) and creating Reader instances. FileScanner provides a default plan() that distributes files evenly across a target parallelism. - Reader / FileReader (readers/) -- Worker-side execution: receives an InputBucket (e.g., FileManifest) and yields Arrow tables. FileReader is wired for column pruning, filter pushdown, and limit support via PyArrow's Dataset API. - InMemorySizeEstimator / SamplingInMemorySizeEstimator (readers/in_memory_size_estimator.py) -- Estimates in-memory data sizes to inform partitioning. The sampling variant reads a single file to estimate the encoding ratio (on-disk vs. in-memory) and applies it to the rest, with a 1:1 fallback for multi-batch files. - Logical optimizer mixins (logical_optimizers.py) -- Declarative capability interfaces (SupportsFilterPushdown, SupportsColumnPruning, SupportsLimitPushdown, SupportsPartitionPruning) that scanners can implement to advertise which optimizations they support. The execution engine can introspect these to apply pushdowns automatically. ``` DataSourceV2 ├── FileIndexer (discovery: what files exist?) ├── Scanner (planning: how to partition & what pushdowns?) │ └── Reader (execution: read data from a partition) └── SizeEstimator (cost model: how big is the data in memory?) ``` ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from a team as a code owner March 10, 2026 06:33

goutamvenkat-anyscale added data Ray Data-related issues data:datasources labels Mar 10, 2026

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

cursor bot reviewed Mar 10, 2026

View reviewed changes

python/ray/data/_internal/datasource_v2/readers/in_memory_size_estimator.py Outdated Show resolved Hide resolved

python/ray/data/_internal/datasource_v2/readers/base_reader.py Show resolved Hide resolved

goutamvenkat-anyscale force-pushed the datasource-v2/scanner-reader-api branch from 7af78ef to 403bd79 Compare March 16, 2026 08:05

cursor bot reviewed Mar 16, 2026

View reviewed changes

goutamvenkat-anyscale force-pushed the datasource-v2/scanner-reader-api branch from e6d421d to c0e8101 Compare March 16, 2026 18:52

goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Mar 17, 2026

goutamvenkat-anyscale force-pushed the datasource-v2/scanner-reader-api branch from c0e8101 to 410ae0d Compare March 17, 2026 22:49

alexeykudinkin reviewed Mar 18, 2026

View reviewed changes

goutamvenkat-anyscale force-pushed the datasource-v2/scanner-reader-api branch from 410ae0d to bb31ddb Compare March 18, 2026 21:00

cursor bot reviewed Mar 18, 2026

View reviewed changes

python/ray/data/_internal/datasource_v2/datasource_v2.py Show resolved Hide resolved

goutamvenkat-anyscale force-pushed the datasource-v2/scanner-reader-api branch 2 times, most recently from 37fa501 to 42dfa66 Compare March 19, 2026 20:34

cursor bot reviewed Mar 19, 2026

View reviewed changes

python/ray/data/_internal/datasource_v2/readers/file_reader.py Show resolved Hide resolved

alexeykudinkin approved these changes Mar 19, 2026

View reviewed changes

python/ray/data/_internal/datasource_v2/readers/file_reader.py Show resolved Hide resolved

goutamvenkat-anyscale added 10 commits March 20, 2026 15:49

[Data] Add DataSourceV2 core API, scanner/reader framework, and optim…

1af2f7b

…izer mixins Signed-off-by: Goutam <goutam@anyscale.com>

remove mention

c1e3d3e

Signed-off-by: Goutam <goutam@anyscale.com>

Fix circular imports

fb6a6d3

Signed-off-by: Goutam <goutam@anyscale.com>

Fix api

cf33e11

Signed-off-by: Goutam <goutam@anyscale.com>

another

15bfaad

Signed-off-by: Goutam <goutam@anyscale.com>

remove fs

ac7a109

Signed-off-by: Goutam <goutam@anyscale.com>

Doc fix

769c9ee

Signed-off-by: Goutam <goutam@anyscale.com>

Address comments

5244758

Signed-off-by: Goutam <goutam@anyscale.com>

Add alpha stability

bba195d

Signed-off-by: Goutam <goutam@anyscale.com>

Address comments about dev api

a67b865

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale force-pushed the datasource-v2/scanner-reader-api branch from 42dfa66 to a67b865 Compare March 20, 2026 22:55

goutamvenkat-anyscale enabled auto-merge (squash) March 20, 2026 22:56

cursor bot reviewed Mar 20, 2026

View reviewed changes

Fix pyrefly

39784da

Signed-off-by: Goutam <goutam@anyscale.com>

github-actions bot disabled auto-merge March 21, 2026 00:32

goutamvenkat-anyscale enabled auto-merge (squash) March 21, 2026 00:32

One more

c65264c

Signed-off-by: Goutam <goutam@anyscale.com>

github-actions bot disabled auto-merge March 21, 2026 00:33

goutamvenkat-anyscale added 3 commits March 20, 2026 18:07

Pyrefly

d980fe5

Signed-off-by: Goutam <goutam@anyscale.com>

Make it abstract

4c2e464

Signed-off-by: Goutam <goutam@anyscale.com>

undo

6094c43

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale merged commit ca430cf into ray-project:master Mar 21, 2026
5 checks passed

goutamvenkat-anyscale deleted the datasource-v2/scanner-reader-api branch March 23, 2026 22:56

		...


		class SamplingInMemorySizeEstimator(InMemorySizeEstimator):

Conversation

goutamvenkat-anyscale commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Mar 16, 2026

Choose a reason for hiding this comment

Loop iterates all files needlessly when ratio exists

Uh oh!

Uh oh!

Uh oh!

cursor bot Mar 16, 2026

Choose a reason for hiding this comment

Unconditional break skips valid files for ratio estimation

Uh oh!

alexeykudinkin Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexeykudinkin Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 20, 2026

Choose a reason for hiding this comment

FileReader is concrete but read() returns None

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

goutamvenkat-anyscale commented Mar 10, 2026 •

edited

Loading

`FileReader` is concrete but `read()` returns `None`