Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Improve str/repr of lazy Datasets #31417

Closed
Tracked by #31286
ericl opened this issue Jan 3, 2023 · 1 comment · Fixed by #31604
Closed
Tracked by #31286

[data] Improve str/repr of lazy Datasets #31417

ericl opened this issue Jan 3, 2023 · 1 comment · Fixed by #31604
Assignees
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks

Comments

@ericl
Copy link
Contributor

ericl commented Jan 3, 2023

Description

Currently, a lazy Dataset's string repr shows something like this:

>>> ray.data.range(10).map(lambda x: x).filter(lambda y: y)
Dataset(num_blocks=10, num_rows=?, schema=Unknown schema)

This doesn't provide a lot of useful information, and is also confusing to the user. We could improve this to something like this:

Dataset(num_blocks=10, <Pending execution: Map, Filter>)

Or even add a verbose form, like:

Dataset(num_blocks=10, num_rows=1000 schema=<class 'int'>)
  -> Map <function <lambda> at 0x7fb9919c8a60> (Pending execution)
  -> Filter <function <lambda> at 0x7fbbf9e45820> (Pending execution)

We should implement this prior to making lazy execution the default.

Use case

No response

@ericl ericl added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Jan 3, 2023
@c21 c21 self-assigned this Jan 3, 2023
@c21
Copy link
Contributor

c21 commented Jan 3, 2023

Agree to make str/repr clearer. Print out the execution plan sounds good to me.

ericl pushed a commit that referenced this issue Jan 6, 2023
This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes:
* Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed.
* `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block.
* Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing.

TODO:
- [x] Fix all unit tests
- [x] #31459
- [x] #31460 
- [ ] Remove the behavior to eagerly compute first block for read
- [ ] #31417
- [ ] Update documentation
AmeerHajAli pushed a commit that referenced this issue Jan 12, 2023
This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes:
* Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed.
* `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block.
* Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing.

TODO:
- [x] Fix all unit tests
- [x] #31459
- [x] #31460 
- [ ] Remove the behavior to eagerly compute first block for read
- [ ] #31417
- [ ] Update documentation
tamohannes pushed a commit to ju2ez/ray that referenced this issue Jan 25, 2023
This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes:
* Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed.
* `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block.
* Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing.

TODO:
- [x] Fix all unit tests
- [x] ray-project#31459
- [x] ray-project#31460 
- [ ] Remove the behavior to eagerly compute first block for read
- [ ] ray-project#31417
- [ ] Update documentation

Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
tamohannes pushed a commit to ju2ez/ray that referenced this issue Jan 25, 2023
This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes:
* Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed.
* `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block.
* Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing.

TODO:
- [x] Fix all unit tests
- [x] ray-project#31459
- [x] ray-project#31460 
- [ ] Remove the behavior to eagerly compute first block for read
- [ ] ray-project#31417
- [ ] Update documentation

Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants