[Dataset] Improve str/repr of `Dataset` to include execution plan #31604

c21 · 2023-01-11T20:21:23Z

Signed-off-by: Cheng Su scnju13@gmail.com

Why are these changes needed?

This is a followup of #31286, we want to improve the Dataset.__repr__() to provide more useful information to users, given lazy execution is default behavior.

The change is to include execution plan (stages as a tree) into Dataset.__repr__(). Currently each stage only has stage name printed out. We shall add more information per stage/operator in the future, which is orthogonal to this PR. This PR is just to print out the existing information we have.

Example:

>>> import ray
>>> ds = ray.data.range(10)
>>> ds = ds.map_batches(lambda x:x)
>>> ds = ds.filter(lambda x: x > 0)
>>> ds = ds.random_shuffle()
>>> ds
RandomShuffle
+- Filter
   +- MapBatches
      +- Dataset(num_blocks=10, num_rows=10, schema=<class 'int'>)
>>> ds.fully_executed()
>>> ds
Dataset(num_blocks=10, num_rows=9, schema=<class 'int'>)

The code change includes:

Introduce ExecutionPlan.get_plan_as_string() to get the string representation above for the plan.
Refactor two private methods inside ExecutionPlan - _get_unified_blocks_schema() and _get_num_rows_from_blocks_metadata()
Change Dataset.__repr__ to call ExecutionPlan.get_plan_as_string() directly.

Related issue number

Closes #31417

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Cheng Su <scnju13@gmail.com>

ericl

Could we also update the rst docs?

c21 · 2023-01-11T21:04:11Z

Could we also update the rst docs?

@ericl - yeah plan to do that in the same PR here, if reviewers has no objections on the string representation.

stephanie-wang · 2023-01-11T21:04:25Z

Just a suggestion, but I think it would be nicer to keep the plan history even after fully_executed is called, more like a "cache" call.

c21 · 2023-01-11T21:08:04Z

Just a suggestion, but I think it would be nicer to keep the plan history even after fully_executed is called, more like a "cache" call.

@stephanie-wang - I thought it before, the only thing I am worried about, is the plan gets super long after multiple calls, assuming users only care about the latest Dataset. WDYT? @ericl, @clarkzinzow and @jianoaix.

Alternative is to add Dataset.plan() or Dataset.explain() prints out all plan history.

jianoaix · 2023-01-11T21:36:26Z

Just a suggestion, but I think it would be nicer to keep the plan history even after fully_executed is called, more like a "cache" call.

@stephanie-wang - I thought it before, the only thing I am worried about, is the plan gets super long after multiple calls, assuming users only care about the latest Dataset. WDYT? @ericl, @clarkzinzow and @jianoaix.

Alternative is to add Dataset.plan() or Dataset.explain() prints out all plan history.

It sounds good to me to have a separate API to display the plan. The repr is used quite often and I think it's too much details for a simple print(ds). It seems not bad idea to just leave plan out of repr as well.

c21 · 2023-01-12T00:52:14Z

Could we also update the rst docs?

@ericl - acutally given @jianoaix is doing change to make from_item() being lazy in parallel. Should we update all Ray Data rst docs in one pass after both PRs are merged? This should save us time to do only one pass for running all code snippets of documentation.

Signed-off-by: Cheng Su <scnju13@gmail.com>

ericl · 2023-01-12T05:52:55Z

Hmm, for the caching thing I think we should hide the plan if the Dataset is fully independent of the previous stages. if it still has a hidden reference, we should show those previous stages. This might matter since the serialization behavior of the two cases could be different.

ericl · 2023-01-12T05:54:10Z

I'm going to just merge this, since I think it's a reasonable first step. We can discuss further refinements on a longer timescale.

…1604)

Improve str/repr of Dataset

08b2cb5

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 requested review from ericl, scv119, clarkzinzow, jjyao and jianoaix as code owners January 11, 2023 20:21

c21 assigned ericl, clarkzinzow and jianoaix Jan 11, 2023

ericl approved these changes Jan 11, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 11, 2023

Fix docstring of Dataset

8f88204

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 changed the title ~~[Dataset] Improve str/repr of Dataset~~ [Dataset] Improve str/repr of Dataset to include execution plan Jan 12, 2023

ericl merged commit fb00672 into ray-project:master Jan 12, 2023

c21 deleted the repr branch January 12, 2023 07:01

AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023

[Dataset] Improve str/repr of Dataset to include execution plan (#3…

82515c0

…1604)

feefs mentioned this pull request Jan 26, 2023

Change Dataset's repr to use angled brackets #31947

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dataset] Improve str/repr of `Dataset` to include execution plan #31604

[Dataset] Improve str/repr of `Dataset` to include execution plan #31604

c21 commented Jan 11, 2023

ericl left a comment

c21 commented Jan 11, 2023

stephanie-wang commented Jan 11, 2023

c21 commented Jan 11, 2023

jianoaix commented Jan 11, 2023

c21 commented Jan 12, 2023 •

edited

ericl commented Jan 12, 2023

ericl commented Jan 12, 2023

[Dataset] Improve str/repr of Dataset to include execution plan #31604

[Dataset] Improve str/repr of Dataset to include execution plan #31604

Conversation

c21 commented Jan 11, 2023

Why are these changes needed?

Related issue number

Checks

ericl left a comment

Choose a reason for hiding this comment

c21 commented Jan 11, 2023

stephanie-wang commented Jan 11, 2023

c21 commented Jan 11, 2023

jianoaix commented Jan 11, 2023

c21 commented Jan 12, 2023 • edited

ericl commented Jan 12, 2023

ericl commented Jan 12, 2023

[Dataset] Improve str/repr of `Dataset` to include execution plan #31604

[Dataset] Improve str/repr of `Dataset` to include execution plan #31604

c21 commented Jan 12, 2023 •

edited