forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 0
Add backpressure reason #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Jay-ju
wants to merge
125
commits into
volcengine:master
Choose a base branch
from
Jay-ju:add_backpressure_reason
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Rename the vars to be more accurate and easier to follow.
…ecker (ray-project#47138) - Add missing tune public API references - Turn on API policy int checker so all public API needs to be documented going forward Test: - CI Signed-off-by: can <can@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
…oject#47373) Created by release automation bot. Update with commit 679989c Signed-off-by: kevin <kevin@anyscale.com>
…ject#47369) Add instruction for troubleshooting doc build in the presence of `_raylet.so`. The presence of `_raylet.so` will cause sphinx to fail since it will try to import this file for real, instead of parsing the docstring from the source codes. Test: - CI Signed-off-by: can <can@anyscale.com>
This PR fixes part of the problem by creating the payload message once and reusing it throughout the benchmark. Ran the release test on this change [build](https://buildkite.com/ray-project/release/builds/21663#01918fe1-853b-46f2-9699-c4045b182b8c) now seeing the `grpc_10mb_p50_latency` now dropped to ~58ms from ~80ms previously. The rest of the issue came from the existing gRPC server implementation requires to wait on the entirety of the unary request before it's able to continue it's work on replica. We will need to create a new HTTP2 proxy and pass the request transparently between the replica and the proxy to speed thing up. Will follow up in the future on ray-project#47370 Closes ray-project#47371 Signed-off-by: Gene Su <e870252314@gmail.com>
…tentional worker failures (ray-project#47368) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
## Why are these changes needed? To reduce flakiness of the `multi_node_train_benchmark.py` release tests (e.g. `read_images_train_1_gpu_5_cpu.aws`) caused by `AWS Error ACCESS_DENIED`, we get filesystem object using `boto3` and pass it to `read_parquet()` instead. We suspect the root cause of the AWS error stems from `pyarrow.fs`, but need to confirm. See below for two release test runs without such errors. ## Related issue number Closes ray-project#47337 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [x] Release tests - https://buildkite.com/ray-project/release/builds/21708 - https://buildkite.com/ray-project/release/builds/21742 - [ ] This PR is not tested :( --------- Signed-off-by: Scott Lee <sjl@anyscale.com>
Write node events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false. Event write is called whenever a value in the node event data schema is modified. Typically this occurs in the callback after writing NodeTable to the GCS table
Some Ray IDs have those fields and some don't. Unify them with a list of methods and test. Note ObjectID is not in the list. Signed-off-by: Ruiyang Wang <rywang014@gmail.com> Signed-off-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
…d new example script. (ray-project#47362)
Like actor_head.py, we now update DataSource.nodes on delta. It first queries all node infos, then subscribes node deltas. Each delta updates: 1. DataSource.nodes[node_id] 2. DataSource.agents[node_id] 3. a warning generated after RAY_DASHBOARD_HEAD_NODE_REGISTRATION_TIMEOUT = 10s Note on (2) agents: it's read from internal kv, and is not readily available until the agent.py is spawned and writes its own port to internal kv. So we make an async task for each node to poll this port every 1s. It occurs that the get-all-then-subscribe code has a TOCTOU problem, so also updated actor_head.py to first subscribe then get all actors. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
…nd float16 inference) through new example script. (ray-project#47116)
…7188) The `test_actor_retry` tests are failing/flaky on windows. They pass locally. I have not been able to access the CI logs to see what is going wrong. In order to shrink the problem (is it a overall timeout? Is one of the tests failing?) we can start by splitting the tests into two files. Toward solving ray-project#43845.
…new stack Offline RL. (ray-project#47359)
Redeploy in between each microbenchmark. --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
…ay-project#47393) In each Ray Data scheduling step, Ray Data launches tasks until it can't launch any more. To ensure that the progress bar updates frequently, ray-project#33952 made it such that Ray Data can only launch 50 tasks per scheduling step. However, this cause performance issues for large workloads. This PR fixes the issue by removing the limit on number of tasks launched while still updating the progress bar frequently.
Small PR to add missing quotation mark in the documentations. I skipped the checks below since they seem unnecessary. Signed-off-by: Rio McMahon <48159564+rmcsqrd@users.noreply.github.com>
…tally (ray-project#47372) Signed-off-by: khluu <51931015+khluu@users.noreply.github.com>
Currently we have no linting on any part of the docs code. This PR runs pre-commit on the cluster docs. This PR fixes the following issues: ``` trim trailing whitespace.................................................Failed - hook id: trailing-whitespace - exit code: 1 - files were modified by this hook Fixing doc/source/cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md Fixing doc/source/cluster/running-applications/job-submission/cli.rst Fixing doc/source/cluster/configure-manage-dashboard.md Fixing doc/source/cluster/kubernetes/user-guides/pod-security.md Fixing doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md Fixing doc/source/cluster/kubernetes/user-guides/helm-chart-rbac.md Fixing doc/source/cluster/vms/references/ray-cluster-configuration.rst Fixing doc/source/cluster/running-applications/job-submission/quickstart.rst Fixing doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md Fixing doc/source/cluster/kubernetes/getting-started/raycluster-quick-start.md Fixing doc/source/cluster/kubernetes/examples/rayjob-kueue-gang-scheduling.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/ingress.md Fixing doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml Fixing doc/source/cluster/kubernetes/k8s-ecosystem/pyspy.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/volcano.md Fixing doc/source/cluster/running-applications/job-submission/sdk.rst Fixing doc/source/cluster/running-applications/job-submission/ray-client.rst Fixing doc/source/cluster/kubernetes/troubleshooting/troubleshooting.md Fixing doc/source/cluster/kubernetes/getting-started/rayjob-quick-start.md Fixing doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml Fixing doc/source/cluster/kubernetes/examples/mnist-training-example.md Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml Fixing doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md Fixing doc/source/cluster/kubernetes/examples/distributed-checkpointing-with-gcsfuse.md Fixing doc/source/cluster/kubernetes/user-guides/gke-gcs-bucket.md Fixing doc/source/cluster/kubernetes/user-guides/logging.md Fixing doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md Fixing doc/source/cluster/kubernetes/examples/rayjob-batch-inference-example.md Fixing doc/source/cluster/metrics.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/kubeflow.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/kueue.md Fixing doc/source/cluster/kubernetes/examples/rayjob-kueue-priority-scheduling.md Fixing doc/source/cluster/faq.rst Fixing doc/source/cluster/running-applications/job-submission/openapi.yml Fixing doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.md Fixing doc/source/cluster/kubernetes/getting-started/rayservice-quick-start.md Fixing doc/source/cluster/kubernetes/user-guides/static-ray-cluster-without-kuberay.md Fixing doc/source/cluster/kubernetes/user-guides/config.md Fixing doc/source/cluster/kubernetes/user-guides/pod-command.md fix end of files.........................................................Failed - hook id: end-of-file-fixer - exit code: 1 - files were modified by this hook Fixing doc/source/cluster/kubernetes/images/rbac-clusterrole.svg Fixing doc/source/cluster/running-applications/job-submission/cli.rst Fixing doc/source/cluster/vms/user-guides/community/slurm.rst Fixing doc/source/cluster/kubernetes/benchmarks/memory-scalability-benchmark.md Fixing doc/source/cluster/images/ray-job-diagram.svg Fixing doc/source/cluster/kubernetes/user-guides/observability.md Fixing doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml Fixing doc/source/cluster/kubernetes/images/rbac-role-one-namespace.svg Fixing doc/source/cluster/kubernetes/examples/mnist-training-example.md Fixing doc/source/cluster/cli.rst Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml Fixing doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md Fixing doc/source/cluster/kubernetes/user-guides/logging.md Fixing doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md Fixing doc/source/cluster/kubernetes/images/rbac-role-multi-namespaces.svg Fixing doc/source/cluster/kubernetes/images/kubeflow-architecture.svg Fixing doc/source/cluster/faq.rst Fixing doc/source/cluster/running-applications/job-submission/openapi.yml Fixing doc/source/cluster/kubernetes/images/AutoscalerOperator.svg check for added large files..............................................Passed check python ast.........................................................Passed check json...........................................(no files to check)Skipped check toml...........................................(no files to check)Skipped black....................................................................Passed flake8...................................................................Passed prettier.............................................(no files to check)Skipped mypy.................................................(no files to check)Skipped isort (python)...........................................................Passed rst directives end with two colons.......................................Passed rst ``inline code`` next to normal text..................................Passed use logger.warning(......................................................Passed check for not-real mock methods..........................................Passed ShellCheck v0.9.0........................................................Passed clang-format.........................................(no files to check)Skipped Google Java Formatter................................(no files to check)Skipped Check for Ray docstyle violations........................................Passed Check for Ray import order violations....................................Passed ``` Signed-off-by: pdmurray <peynmurray@gmail.com>
In ray_experimental_perf.py, some DAGs call experimental_compile twice. Therefore, the first compiled node will become a leaf node. For example, a = DAGActor.remote() with InputNode() as inp: dag = a.echo.bind(inp) ... compiled_dag_1 = dag.experimental_compile() # driver -> a.echo -> compiled_dag_1 (output) ... compiled_dag.teardown() ... compiled_dag_2 = dag.experimental_compile(enable_asyncio=True) # driver -> a.echo -> compiled_dag_1 # | # |-> compiled_dag_2 (output) In this case, a.echo has two downstream nodes, compiled_dag_1 and compiled_dag_2. However, it is not supported for a DAG node to be read by both output nodes and non-output nodes simultaneously.
## Why are these changes needed? close ray-project#46962 ## Related issue number ray-project#46962 ## Checks - [√] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [√] I've run `scripts/format.sh` to lint the changes in this PR. - [√] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [√] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
…ating to new API stack. (ray-project#47425)
…nector piece). Fix: "State-connector" would use `seq_len=20`. (ray-project#47401)
Uses ThreadedPoolExecutor to handle all proto->json conversions. Semantically unchanged. Mostly this just indents code into a function and call it in TPE. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
) ## Why are these changes needed? When you specify `arrow_parquet_args`, those arguments should be passed to PyArrow so that you can configure your writes. However, we weren't actually passing `arrow_parquet_args` to PyArrow. This PR fixes that bug. ## Related issue number Fixes ray-project#47153 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Co-authored-by: Scott Lee <sjl@anyscale.com> Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>
This PR supports multi readers in multi nodes. It also adds tests that the feature works with large gRPC payloads and buffer resizing. multi readers in multi node didn't work because the code allows to only register 1 remote reader reference on 1 specific node. This fixes the issues by allowing to register remote reader references in multi nodes.
…7533) When a serve app is launched, serve will startup automatically. In certain places like k8s, it can be difficult to preconfigure serve (e.g. in the [ray-cluster helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml) there is no ability to set the default serve arguments). This means you need to either be explicit when you start serve, or if it starts up automatically you may need to shut it down, then restart it, which is inconvenient. Signed-off-by: Tim Paine <Timothy.Paine@CubistSystematic.com>
…ject#47592) Style nits. ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Write Driver Job events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false. Event write is called whenever a job table data value is modified. Typically this occurs before writing JobTableData to the GCS table
…#47492) GCS API GetAllJobInfo serves Dashboard APIs, even for only 1 job. This becomes slow when the number of jobs are high. This PR pushes down the job filter to GCS to save Dashboard workload. This API is kind of strange because the filter `job_or_submission_id` is actually Either a Job ID Or a job_submission_id. We don't have an index on the latter, and some jobs don't have one. So we still GetAll from Redis; and filter by both IDs after that and before doing more RPC calls. --------- Signed-off-by: Ruiyang Wang <rywang014@gmail.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
…servability doc (ray-project#47462) Signed-off-by: Rueian <rueiancsie@gmail.com>
…roject#47038) Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
It was saying: ``` /home/ray/anaconda3/lib/python3.12/site-packages/ray/dashboard/modules/job/utils.py:56: SyntaxWarning: invalid escape sequence '\/' ``` Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
The ranks should be in the order of the actors.
Fix type hint for _build_execution_schedule()
_default_metadata_providers adds a layer of indirection. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Went through all the constants in the file and remove the ones that's no Signed-off-by: Gene Su <e870252314@gmail.com>
Move TestWriteTaskExportEvents to a separate file and skip on Windows. This is ok for the export API feature because we currently aren't supporting on Windows (tests for other resource events written from GCS are also skipped on Windows). This test is failing in postmerge (CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523) for Windows due to unknown file: error: C++ exception with description "remove_all: The process cannot access the file because it is being used by another process.: "event_123"" thrown in TearDown(). in the tear down step. This is the same error raised for other tests that clean up created directories with remove_all() in Windows (eg: //src/ray/util/tests:event_test). These tests are also skipped on Windows. Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com> Co-authored-by: Nikita Vemuri <nikitavemuri@anyscale.com>
…g their own loss (algo independent). (ray-project#47581)
## Why are these changes needed? This changes the counts of various metrics for ray data to rates for those same metrics. This allows for easier visualization. There are a variety of different approaches we can take here to calculate the rates: * Window of rate: We can compute the rate over a time window of specified size, the relevancy of this depends on the function chosen. * Function: * `rate`: Calculates the average per-second rate of increase of a time series over a specified range. Less volatile, produces more smoothing depending on the window. * `irate`: Calculates the per-second rate of increase based on the two most recent data points. More volatile and reactive, less smoothing applied, not as sensitive to the window size because last two data points are used. ### Examples: **Yellow: `rate[30s]`** (Have selected this as this seems to be the best balance between the various options) Orange: `irate[30s]` Blue: `rate[5m]` <img width="2688" alt="image" src="https://github.com/user-attachments/assets/a2424318-e06a-4b32-8d8c-83fc1dfbb540"> #### Sample code ```python import ray import time context = ray.init(include_dashboard=True) print(context.dashboard_url) def f(x): time.sleep(0.1) return x def g(x): time.sleep(0.1) return x ray.data.range(10000).map(f).materialize() ``` ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu> Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>
…project#46875) 3 main changes for TrainRunInfo: - Rename `STARTED` status to `RUNNING` status. - Updated the status detail of `ABORTED` status. - Add the stack trace of the failed worker, together with the failed worker rank. - truncate the stack trace to less than 50,000 chars. - Added a new field for the error: `TrainRunInfo.run_error` Signed-off-by: yunxuanx <yunxuanx@anyscale.com> Signed-off-by: woshiyyya <1085966850@qq.com>
Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
Re: ray-project#47229 Previous PR to setup default serve logger has some unexpected consequence. Mainly combined with Serve's stdout redirect feature (when `RAY_SERVE_LOG_TO_STDERR=0` is set in env), it will setup default serve logger and redirect all stdout/stderr into serve's log files instead going to the console. This caused on the Anyscale platform unable to identify ray start command is running successfully and unable to start the cluster. This PR fixes this behavior by only configure Serve's default logger with stream handler and skip configuring file handler altogether. Signed-off-by: Gene Su <e870252314@gmail.com>
…ect#47604) If an operator outputs empty blocks, then Ray Data thinks that the operator has 256 MiB of pending task outputs, even though it should be 0. For example: ```python import pyarrow as pa output = pa.Table.from_pydict({"data": [None] * 128}) assert output.nbytes == 0, output.nbytes ``` The reason for the bug is because we check if `average_bytes_per_output` is truthy rather than if it's not `None`. https://github.com/ray-project/ray/blob/1f83fb44580e392ba6d39a9e79bbdd8cd5b7d916/python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py#L369-L371 --- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…rkers. (ray-project#47212) Signed-off-by: Ruiyang Wang <rywang014@gmail.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
…tiple refs or futures to allow clients to retrieve them one at a time (ray-project#46908) (ray-project#47305) ## Why are these changes needed? Currently, if `MultiOutputNode` is used to wrap a DAG's output, you get back a single `CompiledDAGRef` or `CompiledDAGFuture`, depending on whether `execute` or `execute_async` is invoked, that points to a list of all of the outputs. To retrieve one of the outputs, you have to get and deserialize all of them at the same time. This PR separates the output of `execute` and `execute_async` to a list of `CompiledDAGRef` or `CompiledDAGFuture` when the output is wrapped by `MultiOutputNode`. This is particularly useful for vLLM tensor parallelism. Since all shards return the same results, we only need to fetch result from one of the workers. Closes ray-project#46908. --------- Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Signed-off-by: Jeffrey Wang <jeffrey31415926@gmail.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
…um_blocks" argument (ray-project#47559) ## Why are these changes needed? As in the issue ray-project#47507, from_huggingface() does not support override_num_blocks for non-streaming HF Datasets, so we should throw exception, also we need to pass other arguments for from_huggingface() if they are using streaming dataset ## Related issue number Close ray-project#47507 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( I did manual test on exception part. Let me know if I need to do more tests. --------- Signed-off-by: Xingyu Long <xingyulong97@gmail.com> Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>
## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe.  ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.