Add image datasets to ray train benchmark by srinathk10 · Pull Request #51657 · ray-project/ray

srinathk10 · 2025-03-24T23:50:17Z

Why are these changes needed?

Add image datasets to ray train benchmark

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

release/train_tests/benchmark/config.py

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Just a missing comma and equal sign Signed-off-by: Jonathan Dumaine <jonathan@dumstruck.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

It would fail on ``` FAILED python/ray/tests/test_threaded_actor.py::test_threaded_actor_basic - ValueError: When connecting to an existing cluster, num_cpus and num_gpus must not be provided. ``` so instead just creating the cluster directly. Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

…k up from zero (#51600) ## Why are these changes needed? Previously we added a new deployment status `DEPLOY_FAILED`, which a deployment transitions into if the replicas repeatedly fail to start after a new config update / new deployment. After a certain number of retries, the controller will stop retrying replicas for that deployment and it enters a terminal state. However that causes deployments whose replicas fail during autoscaling (from zero) to also stop retrying after the threshold. This PR fixes that. Summary of failure scenarios and how we now handle them: 1. A deployment is first deployed / re-deployed. If the number of replica failures exceed a threshold (`3 * target`), and not a single replica was able to start successfully, the deployment transitions to `DEPLOY_FAILED` and no more replicas are retried. This state is terminal, and user must redeploy. 2. An autoscaling deployment is deployed with `initial_replicas = 0`. A request is sent and replicas fail to start. Similarly if the number of replica failures exceed a threshold (`3 * target`), and not a single replica was able to start successfully, no more replicas are retried. The deployment transitions to `UNHEALTHY`. 3. An autoscaling deployment is deployed and replicas start successfully. It later scales down and tries to scale back up again, but there are a lot of replica failures. Here the deployment could transition to `UNHEALTHY` if there are enough replica failures, but replicas will _continue_ to retry. ## Related issue number closes #50710 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

…e for duration="auto". (#51637) Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

## Why are these changes needed? Update the list of possible deployment statuses in the docs. ## Related issue number  --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: akyang-anyscale <alexyang@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

#51645) Flaky on windows Signed-off-by: akyang-anyscale <alexyang@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Enable CI environment with writeable cgroupv2 in CI. To use this environment for testing, you need to use the `--privileged-container` flag in your test definitions. To enable bazel read/write to sys/fs/cgroup in your tests, you need to use the `--build-type cgroup` --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Signed-off-by: Abrar Sheikh <abrar@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

## Why are these changes needed? This PR removes a feature flag that controls whether the proxy should use cached replica queue length values for routing. The FF was [introduced](#42943) over a year ago as a way for users to quickly switch back to the previous implementation. It has been enabled by default for [over a year](#43169) now and works as expected, so let's remove it. Consequently, this PR also removes `RAY_SERVE_ENABLE_STRICT_MAX_ONGOING_REQUESTS`, as it is always enabled if `RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE` is enabled. Signed-off-by: akyang-anyscale <alexyang@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

## Why are these changes needed? The current Databricks integration in Ray Data requires providing the Databricks host URL without the "https://" prefix. However, this creates compatibility issues when using Ray Data alongside MLflow, as MLflow's Databricks integration (which uses the same DATABRICKS_HOST environment variable) expects the URL to include the "https://" prefix. ## Related issue number Closes #49925 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Gil Leibovitz <gil.leibovitz@doubleverify.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Gil Leibovitz <gil.leibovitz@doubleverify.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

…51433)   ## Why are these changes needed?  Update repartition on target_num_rows_per_block: When `target_num_rows_per_block` is set, it only repartitions Dataset blocks that are larger than `target_num_rows_per_block`. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

uv much much faster also adds tests to enforce dependency integrity. Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

so that "lint" tag can be used for just lints, and will not be abused to be used as "always" Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

…51668) Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

## Why are these changes needed?  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: jukejian <jukejian@bytedance.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

…or missing keys (#44769)   As stated in #44768, the current implementation of `multiget` based on `np.searchsorted` does not check for missing keys. I added the required checks and updated unit test for this case. ## Why are these changes needed?  ## Related issue number  Closes #44768 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Wu Yufei <wuyufei.2000@bytedance.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

## Why are these changes needed? This PR adds async generator support to `flat_map`. The implementation is similar to how #46129 handled async callable classes for map_batches(), changes include: * Generalize the logic in `_generate_transform_fn_for_async_flat_map` so it can process both batches and rows * Add test case for async `flat_map`  ## Related issue number #50329  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Drice1999 <chenxh267@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

cgroup tests access (read and write) system directory, so they shouldn't be executed in parallel. For example, `bazel test //src/ray/common/cgroup/tests:all` should run tests one by one. --------- Signed-off-by: dentiny <dentinyhao@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

…nals (#51582) If a threaded actor receives exit signals twice as shown in the above screenshot, it will execute [task_receiver_->Stop()](https://github.com/ray-project/ray/blob/6bb9cef9257046ae31f78f6c52015a8ebf009f81/src/ray/core_worker/core_worker.cc#L1224) twice. However, the second call to `task_receiver_->Stop()` will get stuck forever when executing the [releaser](https://github.com/ray-project/ray/blob/6bb9cef9257046ae31f78f6c52015a8ebf009f81/src/ray/core_worker/transport/concurrency_group_manager.cc#L135). ### Reproduction ```sh ray start --head --include-dashboard=True --num-cpus=1 # https://gist.github.com/kevin85421/7a42ac3693537c2148fa554065bb5223 python3 test.py # Some actors are still ALIVE. If all actors are DEAD, increase the number of actors. ray list actors ``` --------- Signed-off-by: kaihsun <kaihsun@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

This PR fixes all variable shadowing for core worker. Variable shadowing is known to be error prune --- I personally just spent 20 minute debugging an issue caused by it. Need to address all issues before #51669 Signed-off-by: dentiny <dentinyhao@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

This reverts commit 65514ea.   ## Why are these changes needed? Logging the rejected requests is causing lower serve throughput. The regression was originally flagged from the microbenchmark test that runs in the nightly release tests.  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: akyang-anyscale <alexyang@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

) ## Why are these changes needed? This PR removes a feature flag that controls whether replicas should immediately respawn after stopping. The FF was introduced and enabled by default [over a year ago](#43187), so let's remove it. Signed-off-by: akyang-anyscale <alexyang@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

fix drift on requirements file Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

or windows builds will hard-fail. Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

srinathk10 commented Mar 26, 2025

View reviewed changes

release/train_tests/benchmark/config.py Show resolved Hide resolved

srinathk10 marked this pull request as ready for review March 27, 2025 06:15

srinathk10 and others added 28 commits March 28, 2025 07:04

Fix Ray Train release test

57f1b4e

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Ray Train Release Test: Add Image Classification Jpeg

b7186a4

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Fix syntax errors in Ray Tune example pbt_ppo_example.ipynb (#51626)

b370458

Just a missing comma and equal sign Signed-off-by: Jonathan Dumaine <jonathan@dumstruck.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

[RLlib] Make min/max env steps per evaluation sample call configurabl…

e6ab31a

…e for duration="auto". (#51637) Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Skip multiplex metrics and proxy status code is error tests on windows (

01be839

#51645) Flaky on windows Signed-off-by: akyang-anyscale <alexyang@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

[Test][KubeRay] Add doctest for RayCluster Quickstart doc (#51249)

2ec9d9f

Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

refactor replica _handle_errors_and_metrics (#51644)

302c1e1

Signed-off-by: Abrar Sheikh <abrar@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

[ray.llm] Refactor model download utilities (#51604)

3d69a81

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Misc fixes

0576b5c

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Misc fixes

b8ad722

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Addressed review comments

56a84dc

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

[deps] Use UV to compile LLM dependencies (#51323)

a2e3fd9

uv much much faster also adds tests to enforce dependency integrity. Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

[ci] add an always tag for cond testing (#51662)

7d41499

so that "lint" tag can be used for just lints, and will not be abused to be used as "always" Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

[core] Correct the wording in the OnNodeDead logs to avoid confusion (#…

0c73275

…51668) Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

[data.llm] support trust remote code (#51680)

a7e7848

Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

akyang-anyscale and others added 5 commits March 28, 2025 07:04

[data] add getdaft to compiled versions (#51723)

0787132

fix drift on requirements file Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

[core] Fix windows build with no cython -Wno-shadow (#51730)

cd164b3

or windows builds will hard-fail. Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Misc fixes

f8cd325

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

srinathk10 force-pushed the srinathk10-train_benchmark_image_datasets branch from a4141ea to f8cd325 Compare March 28, 2025 07:04

srinathk10 requested review from a team, GeneDer, akshay-anyscale, edoakes, hongpeng-guo, jjyao, justinvyu, kevin85421, matthewdeng, pcmoritz, raulchen, richardliaw, simonsays1980, sven1977, woshiyyya and zcin as code owners March 28, 2025 07:04

srinathk10 closed this Mar 28, 2025

srinathk10 deleted the srinathk10-train_benchmark_image_datasets branch March 28, 2025 07:08

srinathk10 restored the srinathk10-train_benchmark_image_datasets branch March 28, 2025 07:08

hainesmichaelc added the community-backlog label May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add image datasets to ray train benchmark#51657

Add image datasets to ray train benchmark#51657
srinathk10 wants to merge 55 commits intomasterfrom
srinathk10-train_benchmark_image_datasets

srinathk10 commented Mar 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Comments

Conversation

srinathk10 commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

srinathk10 commented Mar 24, 2025 •

edited

Loading