Add backpressure reason #1

Jay-ju · 2024-09-12T01:48:48Z

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Rename the vars to be more accurate and easier to follow.

…ecker (ray-project#47138) - Add missing tune public API references - Turn on API policy int checker so all public API needs to be documented going forward Test: - CI Signed-off-by: can <can@anyscale.com>

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

…ilures. (ray-project#47356)

…oject#47373) Created by release automation bot. Update with commit 679989c Signed-off-by: kevin <kevin@anyscale.com>

…ject#47369) Add instruction for troubleshooting doc build in the presence of `_raylet.so`. The presence of `_raylet.so` will cause sphinx to fail since it will try to import this file for real, instead of parsing the docstring from the source codes. Test: - CI Signed-off-by: can <can@anyscale.com>

This PR fixes part of the problem by creating the payload message once and reusing it throughout the benchmark. Ran the release test on this change [build](https://buildkite.com/ray-project/release/builds/21663#01918fe1-853b-46f2-9699-c4045b182b8c) now seeing the `grpc_10mb_p50_latency` now dropped to ~58ms from ~80ms previously. The rest of the issue came from the existing gRPC server implementation requires to wait on the entirety of the unary request before it's able to continue it's work on replica. We will need to create a new HTTP2 proxy and pass the request transparently between the replica and the proxy to speed thing up. Will follow up in the future on ray-project#47370 Closes ray-project#47371 Signed-off-by: Gene Su <e870252314@gmail.com>

…tentional worker failures (ray-project#47368) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

## Why are these changes needed? To reduce flakiness of the `multi_node_train_benchmark.py` release tests (e.g. `read_images_train_1_gpu_5_cpu.aws`) caused by `AWS Error ACCESS_DENIED`, we get filesystem object using `boto3` and pass it to `read_parquet()` instead. We suspect the root cause of the AWS error stems from `pyarrow.fs`, but need to confirm. See below for two release test runs without such errors. ## Related issue number Closes ray-project#47337 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [x] Release tests - https://buildkite.com/ray-project/release/builds/21708 - https://buildkite.com/ray-project/release/builds/21742 - [ ] This PR is not tested :( --------- Signed-off-by: Scott Lee <sjl@anyscale.com>

Write node events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false. Event write is called whenever a value in the node event data schema is modified. Typically this occurs in the callback after writing NodeTable to the GCS table

Some Ray IDs have those fields and some don't. Unify them with a list of methods and test. Note ObjectID is not in the list. Signed-off-by: Ruiyang Wang <rywang014@gmail.com> Signed-off-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>

…d new example script. (ray-project#47362)

Like actor_head.py, we now update DataSource.nodes on delta. It first queries all node infos, then subscribes node deltas. Each delta updates: 1. DataSource.nodes[node_id] 2. DataSource.agents[node_id] 3. a warning generated after RAY_DASHBOARD_HEAD_NODE_REGISTRATION_TIMEOUT = 10s Note on (2) agents: it's read from internal kv, and is not readily available until the agent.py is spawned and writes its own port to internal kv. So we make an async task for each node to poll this port every 1s. It occurs that the get-all-then-subscribe code has a TOCTOU problem, so also updated actor_head.py to first subscribe then get all actors. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

…nd float16 inference) through new example script. (ray-project#47116)

…7188) The `test_actor_retry` tests are failing/flaky on windows. They pass locally. I have not been able to access the CI logs to see what is going wrong. In order to shrink the problem (is it a overall timeout? Is one of the tests failing?) we can start by splitting the tests into two files. Toward solving ray-project#43845.

…new stack Offline RL. (ray-project#47359)

Redeploy in between each microbenchmark. --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

…7405) Reverts ray-project#47221 This broke ray-project#47395

…ay-project#47393) In each Ray Data scheduling step, Ray Data launches tasks until it can't launch any more. To ensure that the progress bar updates frequently, ray-project#33952 made it such that Ray Data can only launch 50 tasks per scheduling step. However, this cause performance issues for large workloads. This PR fixes the issue by removing the limit on number of tasks launched while still updating the progress bar frequently.

Small PR to add missing quotation mark in the documentations. I skipped the checks below since they seem unnecessary. Signed-off-by: Rio McMahon <48159564+rmcsqrd@users.noreply.github.com>

…tally (ray-project#47372) Signed-off-by: khluu <51931015+khluu@users.noreply.github.com>

Currently we have no linting on any part of the docs code. This PR runs pre-commit on the cluster docs. This PR fixes the following issues: ``` trim trailing whitespace.................................................Failed - hook id: trailing-whitespace - exit code: 1 - files were modified by this hook Fixing doc/source/cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md Fixing doc/source/cluster/running-applications/job-submission/cli.rst Fixing doc/source/cluster/configure-manage-dashboard.md Fixing doc/source/cluster/kubernetes/user-guides/pod-security.md Fixing doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md Fixing doc/source/cluster/kubernetes/user-guides/helm-chart-rbac.md Fixing doc/source/cluster/vms/references/ray-cluster-configuration.rst Fixing doc/source/cluster/running-applications/job-submission/quickstart.rst Fixing doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md Fixing doc/source/cluster/kubernetes/getting-started/raycluster-quick-start.md Fixing doc/source/cluster/kubernetes/examples/rayjob-kueue-gang-scheduling.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/ingress.md Fixing doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml Fixing doc/source/cluster/kubernetes/k8s-ecosystem/pyspy.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/volcano.md Fixing doc/source/cluster/running-applications/job-submission/sdk.rst Fixing doc/source/cluster/running-applications/job-submission/ray-client.rst Fixing doc/source/cluster/kubernetes/troubleshooting/troubleshooting.md Fixing doc/source/cluster/kubernetes/getting-started/rayjob-quick-start.md Fixing doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml Fixing doc/source/cluster/kubernetes/examples/mnist-training-example.md Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml Fixing doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md Fixing doc/source/cluster/kubernetes/examples/distributed-checkpointing-with-gcsfuse.md Fixing doc/source/cluster/kubernetes/user-guides/gke-gcs-bucket.md Fixing doc/source/cluster/kubernetes/user-guides/logging.md Fixing doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md Fixing doc/source/cluster/kubernetes/examples/rayjob-batch-inference-example.md Fixing doc/source/cluster/metrics.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/kubeflow.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/kueue.md Fixing doc/source/cluster/kubernetes/examples/rayjob-kueue-priority-scheduling.md Fixing doc/source/cluster/faq.rst Fixing doc/source/cluster/running-applications/job-submission/openapi.yml Fixing doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.md Fixing doc/source/cluster/kubernetes/getting-started/rayservice-quick-start.md Fixing doc/source/cluster/kubernetes/user-guides/static-ray-cluster-without-kuberay.md Fixing doc/source/cluster/kubernetes/user-guides/config.md Fixing doc/source/cluster/kubernetes/user-guides/pod-command.md fix end of files.........................................................Failed - hook id: end-of-file-fixer - exit code: 1 - files were modified by this hook Fixing doc/source/cluster/kubernetes/images/rbac-clusterrole.svg Fixing doc/source/cluster/running-applications/job-submission/cli.rst Fixing doc/source/cluster/vms/user-guides/community/slurm.rst Fixing doc/source/cluster/kubernetes/benchmarks/memory-scalability-benchmark.md Fixing doc/source/cluster/images/ray-job-diagram.svg Fixing doc/source/cluster/kubernetes/user-guides/observability.md Fixing doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml Fixing doc/source/cluster/kubernetes/images/rbac-role-one-namespace.svg Fixing doc/source/cluster/kubernetes/examples/mnist-training-example.md Fixing doc/source/cluster/cli.rst Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml Fixing doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md Fixing doc/source/cluster/kubernetes/user-guides/logging.md Fixing doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md Fixing doc/source/cluster/kubernetes/images/rbac-role-multi-namespaces.svg Fixing doc/source/cluster/kubernetes/images/kubeflow-architecture.svg Fixing doc/source/cluster/faq.rst Fixing doc/source/cluster/running-applications/job-submission/openapi.yml Fixing doc/source/cluster/kubernetes/images/AutoscalerOperator.svg check for added large files..............................................Passed check python ast.........................................................Passed check json...........................................(no files to check)Skipped check toml...........................................(no files to check)Skipped black....................................................................Passed flake8...................................................................Passed prettier.............................................(no files to check)Skipped mypy.................................................(no files to check)Skipped isort (python)...........................................................Passed rst directives end with two colons.......................................Passed rst ``inline code`` next to normal text..................................Passed use logger.warning(......................................................Passed check for not-real mock methods..........................................Passed ShellCheck v0.9.0........................................................Passed clang-format.........................................(no files to check)Skipped Google Java Formatter................................(no files to check)Skipped Check for Ray docstyle violations........................................Passed Check for Ray import order violations....................................Passed ``` Signed-off-by: pdmurray <peynmurray@gmail.com>

In ray_experimental_perf.py, some DAGs call experimental_compile twice. Therefore, the first compiled node will become a leaf node. For example, a = DAGActor.remote() with InputNode() as inp: dag = a.echo.bind(inp) ... compiled_dag_1 = dag.experimental_compile() # driver -> a.echo -> compiled_dag_1 (output) ... compiled_dag.teardown() ... compiled_dag_2 = dag.experimental_compile(enable_asyncio=True) # driver -> a.echo -> compiled_dag_1 # | # |-> compiled_dag_2 (output) In this case, a.echo has two downstream nodes, compiled_dag_1 and compiled_dag_2. However, it is not supported for a DAG node to be read by both output nodes and non-output nodes simultaneously.

## Why are these changes needed? close ray-project#46962 ## Related issue number ray-project#46962 ## Checks - [√] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [√] I've run `scripts/format.sh` to lint the changes in this PR. - [√] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [√] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

…ating to new API stack. (ray-project#47425)

…nector piece). Fix: "State-connector" would use `seq_len=20`. (ray-project#47401)

Uses ThreadedPoolExecutor to handle all proto->json conversions. Semantically unchanged. Mostly this just indents code into a function and call it in TPE. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

) ## Why are these changes needed? When you specify `arrow_parquet_args`, those arguments should be passed to PyArrow so that you can configure your writes. However, we weren't actually passing `arrow_parquet_args` to PyArrow. This PR fixes that bug. ## Related issue number Fixes ray-project#47153 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Co-authored-by: Scott Lee <sjl@anyscale.com> Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>

) Clean up the code.

This PR supports multi readers in multi nodes. It also adds tests that the feature works with large gRPC payloads and buffer resizing. multi readers in multi node didn't work because the code allows to only register 1 remote reader reference on 1 specific node. This fixes the issues by allowing to register remote reader references in multi nodes.

…7533) When a serve app is launched, serve will startup automatically. In certain places like k8s, it can be difficult to preconfigure serve (e.g. in the [ray-cluster helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml) there is no ability to set the default serve arguments). This means you need to either be explicit when you start serve, or if it starts up automatically you may need to shut it down, then restart it, which is inconvenient. Signed-off-by: Tim Paine <Timothy.Paine@CubistSystematic.com>

…ject#47592) Style nits. ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: angelinalg <122562471+angelinalg@users.noreply.github.com>

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

Write Driver Job events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false. Event write is called whenever a job table data value is modified. Typically this occurs before writing JobTableData to the GCS table

…#47492) GCS API GetAllJobInfo serves Dashboard APIs, even for only 1 job. This becomes slow when the number of jobs are high. This PR pushes down the job filter to GCS to save Dashboard workload. This API is kind of strange because the filter `job_or_submission_id` is actually Either a Job ID Or a job_submission_id. We don't have an index on the latter, and some jobs don't have one. So we still GetAll from Redis; and filter by both IDs after that and before doing more RPC calls. --------- Signed-off-by: Ruiyang Wang <rywang014@gmail.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>

…servability doc (ray-project#47462) Signed-off-by: Rueian <rueiancsie@gmail.com>

…roject#47038) Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

It was saying: ``` /home/ray/anaconda3/lib/python3.12/site-packages/ray/dashboard/modules/job/utils.py:56: SyntaxWarning: invalid escape sequence '\/' ``` Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

The ranks should be in the order of the actors.

Fix type hint for _build_execution_schedule()

_default_metadata_providers adds a layer of indirection. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Went through all the constants in the file and remove the ones that's no Signed-off-by: Gene Su <e870252314@gmail.com>

Move TestWriteTaskExportEvents to a separate file and skip on Windows. This is ok for the export API feature because we currently aren't supporting on Windows (tests for other resource events written from GCS are also skipped on Windows). This test is failing in postmerge (CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523) for Windows due to unknown file: error: C++ exception with description "remove_all: The process cannot access the file because it is being used by another process.: "event_123"" thrown in TearDown(). in the tear down step. This is the same error raised for other tests that clean up created directories with remove_all() in Windows (eg: //src/ray/util/tests:event_test). These tests are also skipped on Windows. Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com> Co-authored-by: Nikita Vemuri <nikitavemuri@anyscale.com>

…g their own loss (algo independent). (ray-project#47581)

## Why are these changes needed? This changes the counts of various metrics for ray data to rates for those same metrics. This allows for easier visualization. There are a variety of different approaches we can take here to calculate the rates: * Window of rate: We can compute the rate over a time window of specified size, the relevancy of this depends on the function chosen. * Function: * `rate`: Calculates the average per-second rate of increase of a time series over a specified range. Less volatile, produces more smoothing depending on the window. * `irate`: Calculates the per-second rate of increase based on the two most recent data points. More volatile and reactive, less smoothing applied, not as sensitive to the window size because last two data points are used. ### Examples: **Yellow: `rate[30s]`** (Have selected this as this seems to be the best balance between the various options) Orange: `irate[30s]` Blue: `rate[5m]` <img width="2688" alt="image" src="https://github.com/user-attachments/assets/a2424318-e06a-4b32-8d8c-83fc1dfbb540"> #### Sample code ```python import ray import time context = ray.init(include_dashboard=True) print(context.dashboard_url) def f(x): time.sleep(0.1) return x def g(x): time.sleep(0.1) return x ray.data.range(10000).map(f).materialize() ``` ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu> Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>

…47608)

…project#46875) 3 main changes for TrainRunInfo: - Rename `STARTED` status to `RUNNING` status. - Updated the status detail of `ABORTED` status. - Add the stack trace of the failed worker, together with the failed worker rank. - truncate the stack trace to less than 50,000 chars. - Added a new field for the error: `TrainRunInfo.run_error` Signed-off-by: yunxuanx <yunxuanx@anyscale.com> Signed-off-by: woshiyyya <1085966850@qq.com>

Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

Re: ray-project#47229 Previous PR to setup default serve logger has some unexpected consequence. Mainly combined with Serve's stdout redirect feature (when `RAY_SERVE_LOG_TO_STDERR=0` is set in env), it will setup default serve logger and redirect all stdout/stderr into serve's log files instead going to the console. This caused on the Anyscale platform unable to identify ray start command is running successfully and unable to start the cluster. This PR fixes this behavior by only configure Serve's default logger with stream handler and skip configuring file handler altogether. Signed-off-by: Gene Su <e870252314@gmail.com>

…ect#47604) If an operator outputs empty blocks, then Ray Data thinks that the operator has 256 MiB of pending task outputs, even though it should be 0. For example: ```python import pyarrow as pa output = pa.Table.from_pydict({"data": [None] * 128}) assert output.nbytes == 0, output.nbytes ``` The reason for the bug is because we check if `average_bytes_per_output` is truthy rather than if it's not `None`. https://github.com/ray-project/ray/blob/1f83fb44580e392ba6d39a9e79bbdd8cd5b7d916/python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py#L369-L371 --- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…rkers. (ray-project#47212) Signed-off-by: Ruiyang Wang <rywang014@gmail.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>

…tiple refs or futures to allow clients to retrieve them one at a time (ray-project#46908) (ray-project#47305) ## Why are these changes needed? Currently, if `MultiOutputNode` is used to wrap a DAG's output, you get back a single `CompiledDAGRef` or `CompiledDAGFuture`, depending on whether `execute` or `execute_async` is invoked, that points to a list of all of the outputs. To retrieve one of the outputs, you have to get and deserialize all of them at the same time. This PR separates the output of `execute` and `execute_async` to a list of `CompiledDAGRef` or `CompiledDAGFuture` when the output is wrapped by `MultiOutputNode`. This is particularly useful for vLLM tensor parallelism. Since all shards return the same results, we only need to fetch result from one of the workers. Closes ray-project#46908. --------- Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Signed-off-by: Jeffrey Wang <jeffrey31415926@gmail.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>

…um_blocks" argument (ray-project#47559) ## Why are these changes needed? As in the issue ray-project#47507, from_huggingface() does not support override_num_blocks for non-streaming HF Datasets, so we should throw exception, also we need to pass other arguments for from_huggingface() if they are using streaming dataset ## Related issue number Close ray-project#47507 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( I did manual test on exception part. Let me know if I need to do more tests. --------- Signed-off-by: Xingyu Long <xingyulong97@gmail.com> Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>

## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

ruisearch42 and others added 30 commits August 27, 2024 08:16

[aDAG] Rename vars in compiled_dag_node (ray-project#47346)

95546ca

Rename the vars to be more accurate and easier to follow.

[aDAG] Minor cleanup of accelerated_dag_gpu_microbenchmark (ray-proje…

679989c

…ct#47353)

[doc][ml] add missing tune public API references + api policy lint ch…

e32eb38

…ecker (ray-project#47138) - Add missing tune public API references - Turn on API policy int checker so all public API needs to be documented going forward Test: - CI Signed-off-by: can <can@anyscale.com>

[Core] No actor creation task retry on core worker (ray-project#47122)

68811a5

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

[RLlib] Fix large batch size for synchronous algos after EnvRunner fa…

c67fb76

…ilures. (ray-project#47356)

[docker] Update latest Docker dependencies for 2.35.0 release (ray-pr…

4aea49f

…oject#47373) Created by release automation bot. Update with commit 679989c Signed-off-by: kevin <kevin@anyscale.com>

[RLlib] Add reset() method to MetricsLogger. (ray-project#47382)

e164d87

[Core] Fix ray_unintentional_worker_failures_total to only count unin…

d9d74fa

…tentional worker failures (ray-project#47368) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

[RLlib] Cleanup examples folder (vol 23): Float16 training support an…

a92f3b4

…d new example script. (ray-project#47362)

[RLlib] Cleanup examples folder (vol 24): Mixed-precision training (a…

751dbb1

…nd float16 inference) through new example script. (ray-project#47116)

[RLlib; Offline RL] - Enable reading old-stack SampleBatch data in …

ccd3ba5

…new stack Offline RL. (ray-project#47359)

[serve] redeploy in between each microbenchmark (ray-project#47404)

9e6b89f

Redeploy in between each microbenchmark. --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

Revert "[observability][export-api] Write node events" (ray-project#4…

ce0e2c7

…7405) Reverts ray-project#47221 This broke ray-project#47395

Update handling-dependencies.rst (ray-project#47306)

baa42a1

Small PR to add missing quotation mark in the documentations. I skipped the checks below since they seem unnecessary. Signed-off-by: Rio McMahon <48159564+rmcsqrd@users.noreply.github.com>

[doc] Instruction for troubleshooting side nav when building incremen…

5062bd5

…tally (ray-project#47372) Signed-off-by: khluu <51931015+khluu@users.noreply.github.com>

[RLlib] Examples folder cleanup: ModelV2 -> RLModule wrapper for migr…

3c950a1

…ating to new API stack. (ray-project#47425)

[RLlib] Remove 2nd Learner ConnectorV2 pass from PPO (add new GAE Con…

d1f21a5

…nector piece). Fix: "State-connector" would use `seq_len=20`. (ray-project#47401)

[core][dashboard] TPE for state_aggregator.py (ray-project#47392)

a967a35

Uses ThreadedPoolExecutor to handle all proto->json conversions. Semantically unchanged. Mostly this just indents code into a function and call it in TPE. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

ruisearch42 and others added 30 commits September 10, 2024 08:46

[aDAG] Clean up arg_to_consumers in _get_or_compile() (ray-project#47514

6f2af2c

) Clean up the code.

[RLlib; Offline RL] Store episodes in state form. (ray-project#47294)

aa7179a

[Core] Mark single_node_oom release test as unstable (ray-project#47591)

d655fcc

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

[Doc][KubeRay] Add description tables for RayCluster Status in the ob…

8e62414

…servability doc (ray-project#47462) Signed-off-by: Rueian <rueiancsie@gmail.com>

[Docs][KubeRay] add a guide for deploying vLLM with RayService (ray-p…

b6ca703

…roject#47038) Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

[dashboard] fix syntax warnings on slashes (ray-project#47595)

53813e5

It was saying: ``` /home/ray/anaconda3/lib/python3.12/site-packages/ray/dashboard/modules/job/utils.py:56: SyntaxWarning: invalid escape sequence '\/' ``` Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[aDAG] Fix ranks ordering for custom NCCL group (ray-project#47594)

fe4cc0a

The ranks should be in the order of the actors.

[aDAG] Fix type hint for _build_execution_schedule() (ray-project#47569)

af9fa13

Fix type hint for _build_execution_schedule()

[RLlib] RLModule: InferenceOnlyAPI. (ray-project#47572)

7cb95ce

[Data] Remove _default_metadata_providers (ray-project#47575)

12e1325

_default_metadata_providers adds a layer of indirection. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

[Serve] Remove unused Serve constants (ray-project#47593)

6812c6e

Went through all the constants in the file and remove the ones that's no Signed-off-by: Gene Su <e870252314@gmail.com>

[RLlib] RLModule API: SelfSupervisedLossAPI for RLModules that brin…

f422376

…g their own loss (algo independent). (ray-project#47581)

[RLlib] Add restart-failed-env option to new api stack. (ray-project#…

e75f5e7

…47608)

[GCS] Optimize GetAllJobInfo API for performance (ray-project#47530)

432dbce

Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

[core] Make is_gpu, is_actor, root_detached_id fields late bind to wo…

7910a5e

…rkers. (ray-project#47212) Signed-off-by: Ruiyang Wang <rywang014@gmail.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>

add backpressure

c4983cc

Merge branch 'master' into add_backpressure_reason

5c09215

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add backpressure reason #1

Add backpressure reason #1

Uh oh!

Jay-ju commented Sep 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

37 participants

Add backpressure reason #1

Are you sure you want to change the base?

Add backpressure reason #1

Uh oh!

Conversation

Jay-ju commented Sep 12, 2024

Why are these changes needed?

Related issue number

Checks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

37 participants