-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Enable read-only Datasets to be executed on new execution backend #41466
Conversation
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
InputDataBuffer
for read-only datasetsSigned-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
By the way, this PR just reminds me an issue I saw the other day. For read-only datasets, when we call ds.count()
, we should get the count from the metadata. However, currently ds.count()
will trigger the execution for the entire dataset. Not sure if this is still an issue after this PR. Could you verify that when you have a chance? (not blocking this PR)
…uffer Signed-off-by: Scott Lee <sjl@anyscale.com>
@raulchen I used Test run of |
- After #41466, all Datasets are executed using the new streaming executor backend. There is an edge case that was not caught related to MaterializedDatasets, where executing the already materialized dataset caused extraneous (empty) metrics to be registered from these newly created datasets. This PR covers this edge case by skipping execution in the case where the Dataset's logical plan consists of only an `InputData` operator. - Also enables Data dashboard tests to run for any data-related change, not just for dashboard related changes. Signed-off-by: Scott Lee <sjl@anyscale.com>
…41634) #41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage. Signed-off-by: Scott Lee <sjl@anyscale.com>
…ay-project#41634) ray-project#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage. Signed-off-by: Scott Lee <sjl@anyscale.com>
…ay-project#41634) ray-project#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage. Signed-off-by: Scott Lee <sjl@anyscale.com>
…41634) (#41665) #41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage. Signed-off-by: Scott Lee <sjl@anyscale.com>
…ay-project#41634) ray-project#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage. Signed-off-by: Scott Lee <sjl@anyscale.com>
…ay-project#41634) ray-project#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage. Signed-off-by: Scott Lee <sjl@anyscale.com>
…ay-project#41634) ray-project#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage. Signed-off-by: Scott Lee <sjl@anyscale.com>
…ay-project#41634) ray-project#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage. Signed-off-by: Scott Lee <sjl@anyscale.com>
Why are these changes needed?
InputDataBuffer
in isolation to fetchReadTasks
and any knownBlockMetadata
from the input Datasource or Reader. This logic is inexecute_read_only_to_legacy_lazy_block_list()
.Dataset
s unless otherwise configured by the user.Copy input ops when creating a newNo longer needed, see the linked PR for more details.LogicalOperator
: [Data] Copy inputLogicalOperator
s to avoid mutating their output dependencies #41468Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.