Skip to content

[Data][2/N] Move schema and meta_count to dataset#61349

Merged
bveeramani merged 14 commits intoray-project:masterfrom
kyuds:schema-meta-move
Mar 31, 2026
Merged

[Data][2/N] Move schema and meta_count to dataset#61349
bveeramani merged 14 commits intoray-project:masterfrom
kyuds:schema-meta-move

Conversation

@kyuds
Copy link
Copy Markdown
Member

@kyuds kyuds commented Feb 26, 2026

Description

Progress for removing ExecutionPlan.

Moved schema and meta_count to the Dataset class. The operations for getting schema is separatated into two parts: _base_schema and schema. The motivation for this that for certain operations, we need the underlying schema class, while the public api returns the Ray wrapped schema.

Currently there is a two-way binding between Dataset and ExecutionPlan due to repr operations. These will move to the Dataset in subsequent PRs.

Related issues

#60358

Additional information

N/A

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
@kyuds kyuds requested review from a team as code owners February 26, 2026 08:16
@kyuds kyuds added the go add ONLY when ready to merge, run all tests label Feb 26, 2026
@kyuds kyuds changed the title [Data][2/N] Move schema and meta_count to dataset [Data][2/N] Move schema and meta_count to dataset Feb 26, 2026
@kyuds kyuds requested a review from bveeramani February 26, 2026 08:18
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a good step towards refactoring ExecutionPlan by moving the schema and meta_count methods to the Dataset class. The changes are well-contained and correctly update all relevant call sites. The new implementations in Dataset preserve the original logic while making the code cleaner, for example by using self.limit(1) in the schema method. The modifications across plan.py, dataset.py, dataset_repr.py, and base_trainer.py are consistent with this goal. Overall, this is a solid refactoring that improves code organization.

Comment thread python/ray/data/_internal/plan.py Outdated
Comment thread python/ray/data/dataset.py Outdated
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Comment thread python/ray/data/_internal/plan.py Outdated
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Comment thread python/ray/data/_internal/plan.py Outdated
Comment thread python/ray/data/dataset.py Outdated
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
@ray-gardener ray-gardener bot added the community-contribution Contributed by the community label Feb 26, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 13, 2026
@kyuds kyuds removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 17, 2026
Comment thread python/ray/data/_internal/plan.py Outdated
Comment on lines +210 to +212
schema = ds._base_schema(fetch_if_missing=False)
if count is None:
count = plan.meta_count()
count = ds._meta_count()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This temporary bidirectional coupling between Dataset and ExecutionPlan doesn't seem ideal. Are we planning to move get_plan_as_string, initial_num_blocks, and input_files to Dataset soon?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this will go away soon

Comment thread python/ray/data/dataset.py Outdated
Comment thread python/ray/data/dataset.py
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Comment thread python/ray/data/dataset.py Outdated
kyuds added 2 commits March 28, 2026 15:13
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Comment thread python/ray/data/_internal/plan.py Outdated
kyuds added 2 commits March 28, 2026 23:36
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
@kyuds kyuds requested a review from bveeramani March 29, 2026 06:42
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Comment thread python/ray/data/_internal/plan.py Outdated
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
@kyuds
Copy link
Copy Markdown
Member Author

kyuds commented Mar 30, 2026

buildkite failure is some pip dependency resolution is too deep error. This PR doesn't change any setup.py or requirements.txt files, so I'll wait a bit, merge to latest master, and then retry CI.

@kyuds kyuds added the data Ray Data-related issues label Mar 30, 2026
@kyuds
Copy link
Copy Markdown
Member Author

kyuds commented Mar 31, 2026

waiting on #62208 to be merged

@bveeramani bveeramani merged commit 6dadfdf into ray-project:master Mar 31, 2026
6 checks passed
@kyuds kyuds deleted the schema-meta-move branch March 31, 2026 23:07
mancfactor pushed a commit to mancfactor/ray that referenced this pull request Apr 2, 2026
)

## Description
Progress for removing `ExecutionPlan`.

Moved `schema` and `meta_count` to the `Dataset` class. The operations
for getting schema is separatated into two parts: `_base_schema` and
`schema`. The motivation for this that for certain operations, we need
the underlying schema class, while the public api returns the Ray
wrapped schema.

Currently there is a two-way binding between Dataset and ExecutionPlan
due to repr operations. These will move to the Dataset in subsequent
PRs.

## Related issues
ray-project#60358

## Additional information
N/A

---------

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Frank Mancina <fmancina@haproxy.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants