Skip to content

Comments

Add image datasets to ray train benchmark#51657

Closed
srinathk10 wants to merge 55 commits intomasterfrom
srinathk10-train_benchmark_image_datasets
Closed

Add image datasets to ray train benchmark#51657
srinathk10 wants to merge 55 commits intomasterfrom
srinathk10-train_benchmark_image_datasets

Conversation

@srinathk10
Copy link
Contributor

@srinathk10 srinathk10 commented Mar 24, 2025

Why are these changes needed?

Add image datasets to ray train benchmark

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@srinathk10 srinathk10 marked this pull request as ready for review March 27, 2025 06:15
srinathk10 and others added 28 commits March 28, 2025 07:04
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Just a missing comma and equal sign

Signed-off-by: Jonathan Dumaine <jonathan@dumstruck.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
It would fail on
```
FAILED python/ray/tests/test_threaded_actor.py::test_threaded_actor_basic - ValueError: When connecting to an existing cluster, num_cpus and num_gpus must not be provided.
```
so instead just creating the cluster directly.

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…k up from zero (#51600)

## Why are these changes needed?

Previously we added a new deployment status `DEPLOY_FAILED`, which a
deployment transitions into if the replicas repeatedly fail to start
after a new config update / new deployment. After a certain number of
retries, the controller will stop retrying replicas for that deployment
and it enters a terminal state. However that causes deployments whose
replicas fail during autoscaling (from zero) to also stop retrying after
the threshold. This PR fixes that.

Summary of failure scenarios and how we now handle them:
1. A deployment is first deployed / re-deployed. If the number of
replica failures exceed a threshold (`3 * target`), and not a single
replica was able to start successfully, the deployment transitions to
`DEPLOY_FAILED` and no more replicas are retried. This state is
terminal, and user must redeploy.
2. An autoscaling deployment is deployed with `initial_replicas = 0`. A
request is sent and replicas fail to start. Similarly if the number of
replica failures exceed a threshold (`3 * target`), and not a single
replica was able to start successfully, no more replicas are retried.
The deployment transitions to `UNHEALTHY`.
3. An autoscaling deployment is deployed and replicas start
successfully. It later scales down and tries to scale back up again, but
there are a lot of replica failures. Here the deployment could
transition to `UNHEALTHY` if there are enough replica failures, but
replicas will _continue_ to retry.

## Related issue number

closes #50710

---------

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…e for duration="auto". (#51637)

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
## Why are these changes needed?

Update the list of possible deployment statuses in the docs.

## Related issue number

<!-- For example: "Closes #1234" -->

---------

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
#51645)

Flaky on windows

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Enable CI environment with writeable cgroupv2 in CI.

To use this environment for testing, you need to use the
`--privileged-container` flag in your test definitions.
To enable bazel read/write to sys/fs/cgroup in your tests, you need to
use the `--build-type cgroup`

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Abrar Sheikh <abrar@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
## Why are these changes needed?

This PR removes a feature flag that controls whether the proxy should
use cached replica queue length values for routing. The FF was
[introduced](#42943) over a year
ago as a way for users to quickly switch back to the previous
implementation. It has been enabled by default for [over a
year](#43169) now and works as
expected, so let's remove it. Consequently, this PR also removes
`RAY_SERVE_ENABLE_STRICT_MAX_ONGOING_REQUESTS`, as it is always enabled
if `RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE` is enabled.

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

The current Databricks integration in Ray Data requires providing the
Databricks host URL without the "https://" prefix. However, this creates
compatibility issues when using Ray Data alongside MLflow, as MLflow's
Databricks integration (which uses the same DATABRICKS_HOST environment
variable) expects the URL to include the "https://" prefix.

## Related issue number

Closes #49925

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Gil Leibovitz <gil.leibovitz@doubleverify.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Gil Leibovitz <gil.leibovitz@doubleverify.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…51433)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

Update repartition on target_num_rows_per_block:
When `target_num_rows_per_block` is set, it only repartitions Dataset
blocks that are larger than `target_num_rows_per_block`.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
uv much much faster

also adds tests to enforce dependency integrity.

Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
so that "lint" tag can be used for just lints, and will not be abused to
be used as "always"

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…51668)

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: jukejian <jukejian@bytedance.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…or missing keys (#44769)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

As stated in #44768, the current implementation of `multiget` based on
`np.searchsorted` does not check for missing keys. I added the required
checks and updated unit test for this case.

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->
Closes #44768

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Wu Yufei <wuyufei.2000@bytedance.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
This PR adds async generator support to `flat_map`.
The implementation is similar to how #46129 handled async callable
classes for map_batches(), changes include:
* Generalize the logic in `_generate_transform_fn_for_async_flat_map` so
it can process both batches and rows
* Add test case for async `flat_map`
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number
#50329
<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
cgroup tests access (read and write) system directory, so they shouldn't
be executed in parallel.
For example, `bazel test //src/ray/common/cgroup/tests:all` should run
tests one by one.

---------

Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…nals (#51582)

If a threaded actor receives exit signals twice as shown in the above
screenshot, it will execute
[task_receiver_->Stop()](https://github.com/ray-project/ray/blob/6bb9cef9257046ae31f78f6c52015a8ebf009f81/src/ray/core_worker/core_worker.cc#L1224)
twice. However, the second call to `task_receiver_->Stop()` will get
stuck forever when executing the
[releaser](https://github.com/ray-project/ray/blob/6bb9cef9257046ae31f78f6c52015a8ebf009f81/src/ray/core_worker/transport/concurrency_group_manager.cc#L135).

### Reproduction

```sh
ray start --head --include-dashboard=True --num-cpus=1
# https://gist.github.com/kevin85421/7a42ac3693537c2148fa554065bb5223
python3 test.py

# Some actors are still ALIVE. If all actors are DEAD, increase the number of actors.
ray list actors
```

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
This PR fixes all variable shadowing for core worker.

Variable shadowing is known to be error prune --- I personally just
spent 20 minute debugging an issue caused by it.
Need to address all issues before
#51669

Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Linkun Chen <github@lkchen.net>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
akyang-anyscale and others added 5 commits March 28, 2025 07:04
This reverts commit 65514ea.

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Logging the rejected requests is causing lower serve throughput. The
regression was originally flagged from the microbenchmark test that runs
in the nightly release tests.
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
)

## Why are these changes needed?

This PR removes a feature flag that controls whether replicas should
immediately respawn after stopping. The FF was introduced and enabled by
default [over a year
ago](#43187), so let's remove it.

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
fix drift on requirements file

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
or windows builds will hard-fail.

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
@srinathk10 srinathk10 force-pushed the srinathk10-train_benchmark_image_datasets branch from a4141ea to f8cd325 Compare March 28, 2025 07:04
@srinathk10 srinathk10 closed this Mar 28, 2025
@srinathk10 srinathk10 deleted the srinathk10-train_benchmark_image_datasets branch March 28, 2025 07:08
@srinathk10 srinathk10 restored the srinathk10-train_benchmark_image_datasets branch March 28, 2025 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.