Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Enable read-only Datasets to be executed on new execution backend #41466

Merged
merged 12 commits into from
Dec 4, 2023

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Nov 28, 2023

Why are these changes needed?

  • Enable read-only datasets to run on new execution backend. We achieve this by executing the InputDataBuffer in isolation to fetch ReadTasks and any known BlockMetadata from the input Datasource or Reader. This logic is in execute_read_only_to_legacy_lazy_block_list().
  • By default, use the new execution backend for all Datasets unless otherwise configured by the user.
  • Copy input ops when creating a new LogicalOperator: [Data] Copy input LogicalOperators to avoid mutating their output dependencies #41468 No longer needed, see the linked PR for more details.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee changed the title [Data] Get metadata from InputDataBuffer for read-only datasets [Data] Enable read-only Datasets to be executed on new execution backend Nov 30, 2023
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
By the way, this PR just reminds me an issue I saw the other day. For read-only datasets, when we call ds.count(), we should get the count from the metadata. However, currently ds.count() will trigger the execution for the entire dataset. Not sure if this is still an issue after this PR. Could you verify that when you have a chance? (not blocking this PR)

…uffer

Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee
Copy link
Contributor Author

scottjlee commented Dec 2, 2023

@raulchen I used parquet_metadata_resolution release test to debug the case you mentioned, I found that there is spilling/eventual OOM with the previous approach. Instead of executing the plan containing only InputDataBuffer, I tried a new approach by directly using the Read logical op's _datasource_or_legacy_reader.get_read_tasks() to construct the output LazyBlockList (logic in get_legacy_lazy_block_list_read_only()). Any concerns overall with this approach?

Test run of parquet_metadata_resolution took 37.18 seconds, which is consistent with the current runtimes. Also checked runtime and throughput of torch_batch_inference_1_gpu_10gb_raw, they are also consistent with current results.

@raulchen raulchen merged commit 19bedd1 into ray-project:master Dec 4, 2023
15 of 16 checks passed
aslonnie pushed a commit that referenced this pull request Dec 5, 2023
- After #41466, all Datasets are executed using the new streaming executor backend. There is an edge case that was not caught related to MaterializedDatasets, where executing the already materialized dataset caused extraneous (empty) metrics to be registered from these newly created datasets. This PR covers this edge case by skipping execution in the case where the Dataset's logical plan consists of only an `InputData` operator.
- Also enables Data dashboard tests to run for any data-related change, not just for dashboard related changes.

Signed-off-by: Scott Lee <sjl@anyscale.com>
c21 pushed a commit that referenced this pull request Dec 6, 2023
…41634)

#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage.

Signed-off-by: Scott Lee <sjl@anyscale.com>
hh5blxj pushed a commit to hh5blxj/ray that referenced this pull request Dec 6, 2023
…ay-project#41634)

ray-project#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage.

Signed-off-by: Scott Lee <sjl@anyscale.com>
scottjlee added a commit to scottjlee/ray that referenced this pull request Dec 6, 2023
…ay-project#41634)

ray-project#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage.

Signed-off-by: Scott Lee <sjl@anyscale.com>
architkulkarni pushed a commit that referenced this pull request Dec 7, 2023
…41634) (#41665)

#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage.

Signed-off-by: Scott Lee <sjl@anyscale.com>
hh5blxj pushed a commit to hh5blxj/ray that referenced this pull request Dec 9, 2023
…ay-project#41634)

ray-project#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage.

Signed-off-by: Scott Lee <sjl@anyscale.com>
hh5blxj pushed a commit to hh5blxj/ray that referenced this pull request Dec 9, 2023
…ay-project#41634)

ray-project#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage.

Signed-off-by: Scott Lee <sjl@anyscale.com>
hh5blxj pushed a commit to hh5blxj/ray that referenced this pull request Dec 10, 2023
…ay-project#41634)

ray-project#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage.

Signed-off-by: Scott Lee <sjl@anyscale.com>
hh5blxj pushed a commit to hh5blxj/ray that referenced this pull request Dec 10, 2023
…ay-project#41634)

ray-project#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage.

Signed-off-by: Scott Lee <sjl@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants