[Data] Enable read-only Datasets to be executed on new execution backend #41466

scottjlee · 2023-11-28T23:11:42Z

Why are these changes needed?

Enable read-only datasets to run on new execution backend. We achieve this by executing the InputDataBuffer in isolation to fetch ReadTasks and any known BlockMetadata from the input Datasource or Reader. This logic is in execute_read_only_to_legacy_lazy_block_list().
By default, use the new execution backend for all Datasets unless otherwise configured by the user.
~~Copy input ops when creating a new LogicalOperator: [Data] Copy input LogicalOperators to avoid mutating their output dependencies #41468~~ No longer needed, see the linked PR for more details.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen

LGTM.
By the way, this PR just reminds me an issue I saw the other day. For read-only datasets, when we call ds.count(), we should get the count from the metadata. However, currently ds.count() will trigger the execution for the entire dataset. Not sure if this is still an issue after this PR. Could you verify that when you have a chance? (not blocking this PR)

…uffer Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2023-12-02T01:24:38Z

@raulchen I used parquet_metadata_resolution release test to debug the case you mentioned, I found that there is spilling/eventual OOM with the previous approach. Instead of executing the plan containing only InputDataBuffer, I tried a new approach by directly using the Read logical op's _datasource_or_legacy_reader.get_read_tasks() to construct the output LazyBlockList (logic in get_legacy_lazy_block_list_read_only()). Any concerns overall with this approach?

Test run of parquet_metadata_resolution took 37.18 seconds, which is consistent with the current runtimes. Also checked runtime and throughput of torch_batch_inference_1_gpu_10gb_raw, they are also consistent with current results.

- After #41466, all Datasets are executed using the new streaming executor backend. There is an edge case that was not caught related to MaterializedDatasets, where executing the already materialized dataset caused extraneous (empty) metrics to be registered from these newly created datasets. This PR covers this edge case by skipping execution in the case where the Dataset's logical plan consists of only an `InputData` operator. - Also enables Data dashboard tests to run for any data-related change, not just for dashboard related changes. Signed-off-by: Scott Lee <sjl@anyscale.com>

…41634) #41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage. Signed-off-by: Scott Lee <sjl@anyscale.com>

…ay-project#41634) ray-project#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage. Signed-off-by: Scott Lee <sjl@anyscale.com>

…41634) (#41665) #41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage. Signed-off-by: Scott Lee <sjl@anyscale.com>

…ay-project#41634) ray-project#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage. Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee added 7 commits November 28, 2023 15:11

wip

54d193e

Signed-off-by: Scott Lee <sjl@anyscale.com>

execute to legacy lbl

7ce0460

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 1127-md-only

8f1ec42

Signed-off-by: Scott Lee <sjl@anyscale.com>

pass remote args

5191ca7

Signed-off-by: Scott Lee <sjl@anyscale.com>

use task metadata

5faabf4

Signed-off-by: Scott Lee <sjl@anyscale.com>

update input blocks for force read case

1382b71

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 1127-md-only

ac98aa0

Signed-off-by: Scott Lee <sjl@anyscale.com>

Zandew mentioned this pull request Nov 30, 2023

[data] update dataset.num_blocks for stage deprecation #41544

Merged

8 tasks

scottjlee added 2 commits November 30, 2023 15:23

manually apply block split logic for read-only

db780ac

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 1127-md-only

5d94f1b

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee changed the title ~~[Data] Get metadata from InputDataBuffer for read-only datasets~~ [Data] Enable read-only Datasets to be executed on new execution backend Nov 30, 2023

scottjlee added 2 commits November 30, 2023 15:28

add test from copying logical op

e394e6d

Signed-off-by: Scott Lee <sjl@anyscale.com>

remove logical op copy

9760147

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee mentioned this pull request Dec 1, 2023

[Data] Copy input LogicalOperators to avoid mutating their output dependencies #41468

Closed

8 tasks

scottjlee marked this pull request as ready for review December 1, 2023 00:59

scottjlee requested review from ericl, scv119, c21, amogkam, bveeramani, raulchen, stephanie-wang and Zandew as code owners December 1, 2023 00:59

scottjlee assigned raulchen and c21 Dec 1, 2023

raulchen approved these changes Dec 1, 2023

View reviewed changes

get readtask directly from datasource instead of executing inputdatab…

f2add6d

…uffer Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested a review from raulchen December 2, 2023 02:54

raulchen merged commit 19bedd1 into ray-project:master Dec 4, 2023
15 of 16 checks passed

scottjlee mentioned this pull request Dec 4, 2023

[Data] Skip execution for LogicalPlans with only InputData op #41597

Merged

8 tasks

scottjlee mentioned this pull request Dec 5, 2023

[Data] Skip test_client_compat.py::test_client_data_get unit test #41634

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Enable read-only Datasets to be executed on new execution backend #41466

[Data] Enable read-only Datasets to be executed on new execution backend #41466

scottjlee commented Nov 28, 2023 •

edited

Loading

raulchen left a comment

scottjlee commented Dec 2, 2023 •

edited

Loading

[Data] Enable read-only Datasets to be executed on new execution backend #41466

[Data] Enable read-only Datasets to be executed on new execution backend #41466

Conversation

scottjlee commented Nov 28, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

raulchen left a comment

Choose a reason for hiding this comment

scottjlee commented Dec 2, 2023 • edited Loading

scottjlee commented Nov 28, 2023 •

edited

Loading

scottjlee commented Dec 2, 2023 •

edited

Loading