Add image datasets to ray train benchmark#51657
Closed
srinathk10 wants to merge 55 commits intomasterfrom
Closed
Conversation
srinathk10
commented
Mar 26, 2025
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Just a missing comma and equal sign Signed-off-by: Jonathan Dumaine <jonathan@dumstruck.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
It would fail on ``` FAILED python/ray/tests/test_threaded_actor.py::test_threaded_actor_basic - ValueError: When connecting to an existing cluster, num_cpus and num_gpus must not be provided. ``` so instead just creating the cluster directly. Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…k up from zero (#51600) ## Why are these changes needed? Previously we added a new deployment status `DEPLOY_FAILED`, which a deployment transitions into if the replicas repeatedly fail to start after a new config update / new deployment. After a certain number of retries, the controller will stop retrying replicas for that deployment and it enters a terminal state. However that causes deployments whose replicas fail during autoscaling (from zero) to also stop retrying after the threshold. This PR fixes that. Summary of failure scenarios and how we now handle them: 1. A deployment is first deployed / re-deployed. If the number of replica failures exceed a threshold (`3 * target`), and not a single replica was able to start successfully, the deployment transitions to `DEPLOY_FAILED` and no more replicas are retried. This state is terminal, and user must redeploy. 2. An autoscaling deployment is deployed with `initial_replicas = 0`. A request is sent and replicas fail to start. Similarly if the number of replica failures exceed a threshold (`3 * target`), and not a single replica was able to start successfully, no more replicas are retried. The deployment transitions to `UNHEALTHY`. 3. An autoscaling deployment is deployed and replicas start successfully. It later scales down and tries to scale back up again, but there are a lot of replica failures. Here the deployment could transition to `UNHEALTHY` if there are enough replica failures, but replicas will _continue_ to retry. ## Related issue number closes #50710 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…e for duration="auto". (#51637) Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
## Why are these changes needed? Update the list of possible deployment statuses in the docs. ## Related issue number <!-- For example: "Closes #1234" --> --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: akyang-anyscale <alexyang@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
#51645) Flaky on windows Signed-off-by: akyang-anyscale <alexyang@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Enable CI environment with writeable cgroupv2 in CI. To use this environment for testing, you need to use the `--privileged-container` flag in your test definitions. To enable bazel read/write to sys/fs/cgroup in your tests, you need to use the `--build-type cgroup` --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Abrar Sheikh <abrar@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
## Why are these changes needed? This PR removes a feature flag that controls whether the proxy should use cached replica queue length values for routing. The FF was [introduced](#42943) over a year ago as a way for users to quickly switch back to the previous implementation. It has been enabled by default for [over a year](#43169) now and works as expected, so let's remove it. Consequently, this PR also removes `RAY_SERVE_ENABLE_STRICT_MAX_ONGOING_REQUESTS`, as it is always enabled if `RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE` is enabled. Signed-off-by: akyang-anyscale <alexyang@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? The current Databricks integration in Ray Data requires providing the Databricks host URL without the "https://" prefix. However, this creates compatibility issues when using Ray Data alongside MLflow, as MLflow's Databricks integration (which uses the same DATABRICKS_HOST environment variable) expects the URL to include the "https://" prefix. ## Related issue number Closes #49925 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Gil Leibovitz <gil.leibovitz@doubleverify.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Gil Leibovitz <gil.leibovitz@doubleverify.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…51433) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> Update repartition on target_num_rows_per_block: When `target_num_rows_per_block` is set, it only repartitions Dataset blocks that are larger than `target_num_rows_per_block`. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
uv much much faster also adds tests to enforce dependency integrity. Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
so that "lint" tag can be used for just lints, and will not be abused to be used as "always" Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…51668) Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: jukejian <jukejian@bytedance.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…or missing keys (#44769) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> As stated in #44768, the current implementation of `multiget` based on `np.searchsorted` does not check for missing keys. I added the required checks and updated unit test for this case. ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> Closes #44768 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Wu Yufei <wuyufei.2000@bytedance.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This PR adds async generator support to `flat_map`. The implementation is similar to how #46129 handled async callable classes for map_batches(), changes include: * Generalize the logic in `_generate_transform_fn_for_async_flat_map` so it can process both batches and rows * Add test case for async `flat_map` <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number #50329 <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Drice1999 <chenxh267@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
cgroup tests access (read and write) system directory, so they shouldn't be executed in parallel. For example, `bazel test //src/ray/common/cgroup/tests:all` should run tests one by one. --------- Signed-off-by: dentiny <dentinyhao@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…nals (#51582) If a threaded actor receives exit signals twice as shown in the above screenshot, it will execute [task_receiver_->Stop()](https://github.com/ray-project/ray/blob/6bb9cef9257046ae31f78f6c52015a8ebf009f81/src/ray/core_worker/core_worker.cc#L1224) twice. However, the second call to `task_receiver_->Stop()` will get stuck forever when executing the [releaser](https://github.com/ray-project/ray/blob/6bb9cef9257046ae31f78f6c52015a8ebf009f81/src/ray/core_worker/transport/concurrency_group_manager.cc#L135). ### Reproduction ```sh ray start --head --include-dashboard=True --num-cpus=1 # https://gist.github.com/kevin85421/7a42ac3693537c2148fa554065bb5223 python3 test.py # Some actors are still ALIVE. If all actors are DEAD, increase the number of actors. ray list actors ``` --------- Signed-off-by: kaihsun <kaihsun@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
This PR fixes all variable shadowing for core worker. Variable shadowing is known to be error prune --- I personally just spent 20 minute debugging an issue caused by it. Need to address all issues before #51669 Signed-off-by: dentiny <dentinyhao@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
This reverts commit 65514ea. <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Logging the rejected requests is causing lower serve throughput. The regression was originally flagged from the microbenchmark test that runs in the nightly release tests. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: akyang-anyscale <alexyang@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
) ## Why are these changes needed? This PR removes a feature flag that controls whether replicas should immediately respawn after stopping. The FF was introduced and enabled by default [over a year ago](#43187), so let's remove it. Signed-off-by: akyang-anyscale <alexyang@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
fix drift on requirements file Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
or windows builds will hard-fail. Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
a4141ea to
f8cd325
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Add image datasets to ray train benchmark
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.