[data] Improve str/repr of lazy Datasets #31417

ericl · 2023-01-03T22:43:02Z

Description

Currently, a lazy Dataset's string repr shows something like this:

>>> ray.data.range(10).map(lambda x: x).filter(lambda y: y)
Dataset(num_blocks=10, num_rows=?, schema=Unknown schema)

This doesn't provide a lot of useful information, and is also confusing to the user. We could improve this to something like this:

Dataset(num_blocks=10, <Pending execution: Map, Filter>)

Or even add a verbose form, like:

Dataset(num_blocks=10, num_rows=1000 schema=<class 'int'>)
  -> Map <function <lambda> at 0x7fb9919c8a60> (Pending execution)
  -> Filter <function <lambda> at 0x7fbbf9e45820> (Pending execution)

We should implement this prior to making lazy execution the default.

Use case

No response

The text was updated successfully, but these errors were encountered:

c21 · 2023-01-03T23:42:00Z

Agree to make str/repr clearer. Print out the execution plan sounds good to me.

This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes: * Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed. * `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block. * Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing. TODO: - [x] Fix all unit tests - [x] #31459 - [x] #31460 - [ ] Remove the behavior to eagerly compute first block for read - [ ] #31417 - [ ] Update documentation

This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes: * Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed. * `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block. * Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing. TODO: - [x] Fix all unit tests - [x] ray-project#31459 - [x] ray-project#31460 - [ ] Remove the behavior to eagerly compute first block for read - [ ] ray-project#31417 - [ ] Update documentation Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>

ericl added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Jan 3, 2023

c21 self-assigned this Jan 3, 2023

c21 mentioned this issue Jan 5, 2023

[Datasets] Enable lazy execution by default #31286

Merged

13 tasks

c21 mentioned this issue Jan 11, 2023

[Dataset] Improve str/repr of Dataset to include execution plan #31604

Merged

7 tasks

ericl closed this as completed in #31604 Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Improve str/repr of lazy Datasets #31417

[data] Improve str/repr of lazy Datasets #31417

ericl commented Jan 3, 2023 •

edited

c21 commented Jan 3, 2023

[data] Improve str/repr of lazy Datasets #31417

[data] Improve str/repr of lazy Datasets #31417

Comments

ericl commented Jan 3, 2023 • edited

Description

Use case

c21 commented Jan 3, 2023

ericl commented Jan 3, 2023 •

edited