Skip to content

Conversation

Jay-ju
Copy link

@Jay-ju Jay-ju commented Sep 12, 2024

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

ruisearch42 and others added 30 commits August 27, 2024 08:16
Rename the vars to be more accurate and easier to follow.
…ecker (ray-project#47138)

- Add missing tune public API references 
- Turn on API policy int checker so all public API needs to be
documented going forward

Test:
- CI

Signed-off-by: can <can@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
…oject#47373)

Created by release automation bot.

Update with commit 679989c

Signed-off-by: kevin <kevin@anyscale.com>
…ject#47369)

Add instruction for troubleshooting doc build in the presence of
`_raylet.so`. The presence of `_raylet.so` will cause sphinx to fail
since it will try to import this file for real, instead of parsing the
docstring from the source codes.

Test:
- CI

Signed-off-by: can <can@anyscale.com>
This PR fixes part of the problem by creating the payload message once
and reusing it throughout the benchmark.

Ran the release test on this change
[build](https://buildkite.com/ray-project/release/builds/21663#01918fe1-853b-46f2-9699-c4045b182b8c)
now seeing the `grpc_10mb_p50_latency` now dropped to ~58ms from ~80ms
previously.

The rest of the issue came from the existing gRPC server implementation
requires to wait on the entirety of the unary request before it's able
to continue it's work on replica. We will need to create a new HTTP2
proxy and pass the request transparently between the replica and the
proxy to speed thing up. Will follow up in the future on
ray-project#47370

Closes ray-project#47371

Signed-off-by: Gene Su <e870252314@gmail.com>
…tentional worker failures (ray-project#47368)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
## Why are these changes needed?

To reduce flakiness of the `multi_node_train_benchmark.py` release tests
(e.g. `read_images_train_1_gpu_5_cpu.aws`) caused by `AWS Error
ACCESS_DENIED`, we get filesystem object using `boto3` and pass it to
`read_parquet()` instead. We suspect the root cause of the AWS error
stems from `pyarrow.fs`, but need to confirm. See below for two release
test runs without such errors.

## Related issue number
Closes ray-project#47337

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [x] Release tests
     - https://buildkite.com/ray-project/release/builds/21708
     - https://buildkite.com/ray-project/release/builds/21742
   - [ ] This PR is not tested :(

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
Write node events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
Event write is called whenever a value in the node event data schema is modified. Typically this occurs in the callback after writing NodeTable to the GCS table
Some Ray IDs have those fields and some don't. Unify them with a list of
methods and test. Note ObjectID is not in the list.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Like actor_head.py, we now update DataSource.nodes on delta. It first
queries all node infos, then subscribes node deltas. Each delta updates:

1. DataSource.nodes[node_id]
2. DataSource.agents[node_id]
3. a warning generated after
RAY_DASHBOARD_HEAD_NODE_REGISTRATION_TIMEOUT = 10s

Note on (2) agents: it's read from internal kv, and is not readily
available until the agent.py is spawned and writes its own port to
internal kv. So we make an async task for each node to poll this port
every 1s.

It occurs that the get-all-then-subscribe code has a TOCTOU problem, so
also updated actor_head.py to first subscribe then get all actors.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
…7188)

The `test_actor_retry` tests are failing/flaky on windows. They pass locally. I have not been able to access the CI logs to see what is going wrong. In order to shrink the problem (is it a overall timeout? Is one of the tests failing?) we can start by splitting the tests into two files.

Toward solving ray-project#43845.
Redeploy in between each microbenchmark.

---------

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
…ay-project#47393)

In each Ray Data scheduling step, Ray Data launches tasks until it can't launch any more. To ensure that the progress bar updates frequently, ray-project#33952 made it such that Ray Data can only launch 50 tasks per scheduling step. However, this cause performance issues for large workloads.

This PR fixes the issue by removing the limit on number of tasks launched while still updating the progress bar frequently.
Small PR to add missing quotation mark in the documentations. I skipped
the checks below since they seem unnecessary.

Signed-off-by: Rio McMahon <48159564+rmcsqrd@users.noreply.github.com>
…tally (ray-project#47372)

Signed-off-by: khluu <51931015+khluu@users.noreply.github.com>
Currently we have no linting on any part of the docs code. This PR runs
pre-commit on the cluster docs.

This PR fixes the following issues:

```
trim trailing whitespace.................................................Failed
- hook id: trailing-whitespace
- exit code: 1
- files were modified by this hook

Fixing doc/source/cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md
Fixing doc/source/cluster/running-applications/job-submission/cli.rst
Fixing doc/source/cluster/configure-manage-dashboard.md
Fixing doc/source/cluster/kubernetes/user-guides/pod-security.md
Fixing doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md
Fixing doc/source/cluster/kubernetes/user-guides/helm-chart-rbac.md
Fixing doc/source/cluster/vms/references/ray-cluster-configuration.rst
Fixing doc/source/cluster/running-applications/job-submission/quickstart.rst
Fixing doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md
Fixing doc/source/cluster/kubernetes/getting-started/raycluster-quick-start.md
Fixing doc/source/cluster/kubernetes/examples/rayjob-kueue-gang-scheduling.md
Fixing doc/source/cluster/kubernetes/k8s-ecosystem/ingress.md
Fixing doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md
Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml
Fixing doc/source/cluster/kubernetes/k8s-ecosystem/pyspy.md
Fixing doc/source/cluster/kubernetes/k8s-ecosystem/volcano.md
Fixing doc/source/cluster/running-applications/job-submission/sdk.rst
Fixing doc/source/cluster/running-applications/job-submission/ray-client.rst
Fixing doc/source/cluster/kubernetes/troubleshooting/troubleshooting.md
Fixing doc/source/cluster/kubernetes/getting-started/rayjob-quick-start.md
Fixing doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml
Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml
Fixing doc/source/cluster/kubernetes/examples/mnist-training-example.md
Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml
Fixing doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md
Fixing doc/source/cluster/kubernetes/examples/distributed-checkpointing-with-gcsfuse.md
Fixing doc/source/cluster/kubernetes/user-guides/gke-gcs-bucket.md
Fixing doc/source/cluster/kubernetes/user-guides/logging.md
Fixing doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md
Fixing doc/source/cluster/kubernetes/examples/rayjob-batch-inference-example.md
Fixing doc/source/cluster/metrics.md
Fixing doc/source/cluster/kubernetes/k8s-ecosystem/kubeflow.md
Fixing doc/source/cluster/kubernetes/k8s-ecosystem/kueue.md
Fixing doc/source/cluster/kubernetes/examples/rayjob-kueue-priority-scheduling.md
Fixing doc/source/cluster/faq.rst
Fixing doc/source/cluster/running-applications/job-submission/openapi.yml
Fixing doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.md
Fixing doc/source/cluster/kubernetes/getting-started/rayservice-quick-start.md
Fixing doc/source/cluster/kubernetes/user-guides/static-ray-cluster-without-kuberay.md
Fixing doc/source/cluster/kubernetes/user-guides/config.md
Fixing doc/source/cluster/kubernetes/user-guides/pod-command.md

fix end of files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
- files were modified by this hook

Fixing doc/source/cluster/kubernetes/images/rbac-clusterrole.svg
Fixing doc/source/cluster/running-applications/job-submission/cli.rst
Fixing doc/source/cluster/vms/user-guides/community/slurm.rst
Fixing doc/source/cluster/kubernetes/benchmarks/memory-scalability-benchmark.md
Fixing doc/source/cluster/images/ray-job-diagram.svg
Fixing doc/source/cluster/kubernetes/user-guides/observability.md
Fixing doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md
Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml
Fixing doc/source/cluster/kubernetes/images/rbac-role-one-namespace.svg
Fixing doc/source/cluster/kubernetes/examples/mnist-training-example.md
Fixing doc/source/cluster/cli.rst
Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml
Fixing doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md
Fixing doc/source/cluster/kubernetes/user-guides/logging.md
Fixing doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md
Fixing doc/source/cluster/kubernetes/images/rbac-role-multi-namespaces.svg
Fixing doc/source/cluster/kubernetes/images/kubeflow-architecture.svg
Fixing doc/source/cluster/faq.rst
Fixing doc/source/cluster/running-applications/job-submission/openapi.yml
Fixing doc/source/cluster/kubernetes/images/AutoscalerOperator.svg

check for added large files..............................................Passed
check python ast.........................................................Passed
check json...........................................(no files to check)Skipped
check toml...........................................(no files to check)Skipped
black....................................................................Passed
flake8...................................................................Passed
prettier.............................................(no files to check)Skipped
mypy.................................................(no files to check)Skipped
isort (python)...........................................................Passed
rst directives end with two colons.......................................Passed
rst ``inline code`` next to normal text..................................Passed
use logger.warning(......................................................Passed
check for not-real mock methods..........................................Passed
ShellCheck v0.9.0........................................................Passed
clang-format.........................................(no files to check)Skipped
Google Java Formatter................................(no files to check)Skipped
Check for Ray docstyle violations........................................Passed
Check for Ray import order violations....................................Passed
```

Signed-off-by: pdmurray <peynmurray@gmail.com>
In ray_experimental_perf.py, some DAGs call experimental_compile twice. Therefore, the first compiled node will become a leaf node. For example,

a = DAGActor.remote()
with InputNode() as inp:
    dag = a.echo.bind(inp)
...
compiled_dag_1 = dag.experimental_compile()
# driver -> a.echo -> compiled_dag_1 (output)
...
compiled_dag.teardown()
...
compiled_dag_2 = dag.experimental_compile(enable_asyncio=True)
# driver -> a.echo -> compiled_dag_1
#                  |
#                  |-> compiled_dag_2 (output)
In this case, a.echo has two downstream nodes, compiled_dag_1 and compiled_dag_2. However, it is not supported for a DAG node to be read by both output nodes and non-output nodes simultaneously.
## Why are these changes needed?

close ray-project#46962
## Related issue number

ray-project#46962
## Checks

- [√] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [√] I've run `scripts/format.sh` to lint the changes in this PR.
- [√] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [√] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
…nector piece). Fix: "State-connector" would use `seq_len=20`. (ray-project#47401)
Uses ThreadedPoolExecutor to handle all proto->json conversions.
Semantically unchanged. Mostly this just indents code into a function
and call it in TPE.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
)

## Why are these changes needed?
When you specify `arrow_parquet_args`, those arguments should be passed
to PyArrow so that you can configure your writes. However, we weren't
actually passing `arrow_parquet_args` to PyArrow. This PR fixes that
bug.

## Related issue number
Fixes ray-project#47153

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
Co-authored-by: Scott Lee <sjl@anyscale.com>
Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>
ruisearch42 and others added 30 commits September 10, 2024 08:46
This PR supports multi readers in multi nodes. It also adds tests that the feature works with large gRPC payloads and buffer resizing.

multi readers in multi node didn't work because the code allows to only register 1 remote reader reference on 1 specific node. This fixes the issues by allowing to register remote reader references in multi nodes.
…7533)

When a serve app is launched, serve will startup automatically. In
certain places like k8s, it can be difficult to preconfigure serve (e.g.
in the [ray-cluster helm
chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml)
there is no ability to set the default serve arguments).

This means you need to either be explicit when you start serve, or if it
starts up automatically you may need to shut it down, then restart it,
which is inconvenient.

Signed-off-by: Tim Paine <Timothy.Paine@CubistSystematic.com>
…ject#47592)

Style nits.

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Write Driver Job events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
Event write is called whenever a job table data value is modified. Typically this occurs before writing JobTableData to the GCS table
…#47492)

GCS API GetAllJobInfo serves Dashboard APIs, even for only 1 job. This becomes slow when the number of jobs are high. This PR pushes down the job filter to GCS to save Dashboard workload.

This API is kind of strange because the filter `job_or_submission_id` is actually Either a Job ID Or a job_submission_id. We don't have an index on the latter, and some jobs don't have one. So we still GetAll from Redis; and filter by both IDs after that and before doing more RPC calls.

---------

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
…servability doc (ray-project#47462)

Signed-off-by: Rueian <rueiancsie@gmail.com>
It was saying:

```
/home/ray/anaconda3/lib/python3.12/site-packages/ray/dashboard/modules/job/utils.py:56: SyntaxWarning: invalid escape sequence '\/'
```

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
The ranks should be in the order of the actors.
_default_metadata_providers adds a layer of indirection.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Went through all the constants in the file and remove the ones that's no

Signed-off-by: Gene Su <e870252314@gmail.com>
Move TestWriteTaskExportEvents to a separate file and skip on Windows. This is ok for the export API feature because we currently aren't supporting on Windows (tests for other resource events written from GCS are also skipped on Windows).
This test is failing in postmerge (CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523) for Windows due to unknown file: error: C++ exception with description "remove_all: The process cannot access the file because it is being used by another process.: "event_123"" thrown in TearDown(). in the tear down step.
This is the same error raised for other tests that clean up created directories with remove_all() in Windows (eg: //src/ray/util/tests:event_test). These tests are also skipped on Windows.

Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
Co-authored-by: Nikita Vemuri <nikitavemuri@anyscale.com>
## Why are these changes needed?
This changes the counts of various metrics for ray data to rates for
those same metrics. This allows for easier visualization.

There are a variety of different approaches we can take here to
calculate the rates:
* Window of rate: We can compute the rate over a time window of
specified size, the relevancy of this depends on the function chosen.
* Function:
* `rate`: Calculates the average per-second rate of increase of a time
series over a specified range. Less volatile, produces more smoothing
depending on the window.
* `irate`: Calculates the per-second rate of increase based on the two
most recent data points. More volatile and reactive, less smoothing
applied, not as sensitive to the window size because last two data
points are used.
 
### Examples:
**Yellow: `rate[30s]`** (Have selected this as this seems to be the best
balance between the various options)
Orange: `irate[30s]`
Blue: `rate[5m]`

<img width="2688" alt="image"
src="https://github.com/user-attachments/assets/a2424318-e06a-4b32-8d8c-83fc1dfbb540">
 
#### Sample code
```python
import ray
import time

context = ray.init(include_dashboard=True)
print(context.dashboard_url)

def f(x):
    time.sleep(0.1)
    return x

def g(x):
    time.sleep(0.1)
    return x

ray.data.range(10000).map(f).materialize()
```


## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>
…project#46875)

3 main changes for TrainRunInfo:
- Rename `STARTED` status to `RUNNING` status.
- Updated the status detail of `ABORTED` status.
- Add the stack trace of the failed worker, together with the failed
worker rank.
  - truncate the stack trace to less than 50,000 chars.
  - Added a new field for the error: `TrainRunInfo.run_error`

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>
Signed-off-by: woshiyyya <1085966850@qq.com>
Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
Re: ray-project#47229

Previous PR to setup default serve logger has some unexpected
consequence. Mainly combined with Serve's stdout redirect feature (when
`RAY_SERVE_LOG_TO_STDERR=0` is set in env), it will setup default serve
logger and redirect all stdout/stderr into serve's log files instead
going to the console. This caused on the Anyscale platform unable to
identify ray start command is running successfully and unable to start
the cluster. This PR fixes this behavior by only configure Serve's
default logger with stream handler and skip configuring file handler
altogether.

Signed-off-by: Gene Su <e870252314@gmail.com>
…ect#47604)

If an operator outputs empty blocks, then Ray Data thinks that the
operator has 256 MiB of pending task outputs, even though it should be
0. For example:
```python
import pyarrow as pa
output = pa.Table.from_pydict({"data": [None] * 128})
assert output.nbytes == 0, output.nbytes
```

The reason for the bug is because we check if `average_bytes_per_output`
is truthy rather than if it's not `None`.

https://github.com/ray-project/ray/blob/1f83fb44580e392ba6d39a9e79bbdd8cd5b7d916/python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py#L369-L371
---

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…rkers. (ray-project#47212)

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
…tiple refs or futures to allow clients to retrieve them one at a time (ray-project#46908) (ray-project#47305)

## Why are these changes needed?
Currently, if `MultiOutputNode` is used to wrap a DAG's output, you get
back a single `CompiledDAGRef` or `CompiledDAGFuture`, depending on
whether `execute` or `execute_async` is invoked, that points to a list
of all of the outputs. To retrieve one of the outputs, you have to get
and deserialize all of them at the same time.

This PR separates the output of `execute` and `execute_async` to a list
of `CompiledDAGRef` or `CompiledDAGFuture` when the output is wrapped by
`MultiOutputNode`. This is particularly useful for vLLM tensor
parallelism. Since all shards return the same results, we only need to
fetch result from one of the workers.

Closes ray-project#46908.

---------

Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: Jeffrey Wang <jeffrey31415926@gmail.com>
Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
…um_blocks" argument (ray-project#47559)

## Why are these changes needed?

As in the issue ray-project#47507, from_huggingface() does not support
override_num_blocks for non-streaming HF Datasets, so we should throw
exception, also we need to pass other arguments for from_huggingface()
if they are using streaming dataset

## Related issue number
Close ray-project#47507

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

I did manual test on exception part. Let me know if I need to do more
tests.

---------

Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>
## Why are these changes needed?

Detect replica death earlier on handles/routers. Currently routers will
process replica death if the actor death error is thrown during active
probing or system message.
1. cover one more case: process replica death if error is thrown _while_
request was being processed on the replica.
2. improved handling: if error is detected on the system message,
meaning router found out replica is dead after assigning a request to
that replica, retry the request.


### Performance evaluation
(master results pulled from
https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42)

Latency:
| metric | master | this PR | % change |
| -- | -- | -- | -- |
| http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 |
| http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 |
| http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89
|
| grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 |
| grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 |
| grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22
|
| handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 |
| handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 |
-2.67 |
| handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 |
-1.96 |

Throughput:
| metric | master | this PR | % change |
| -- | -- | -- | -- |
| http_avg_rps | 359.14 | 357.81 | -0.37 |
| http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 |
| grpc_avg_rps | 506.16 | 485.92 | -4.0 |
| grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 |
| handle_avg_rps | 604.52 | 641.66 | 6.14 |
| handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 |


Results: everything except for grpc results are within noise. As for
grpc results, they have always been relatively noisy (see below), so the
results are actually also within the noise that we've been seeing. There
is also no reason why latency for a request would only increase for grpc
and not http or handle for the changes in this PR, so IMO this is safe.
![Screenshot 2024-08-21 at 11 54
55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378)


## Related issue number

closes ray-project#47219

---------

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.