[RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold #1293

kevin85421 · 2023-08-04T23:31:51Z

Why are these changes needed?

Before Ray 2.6.0, the status of a Serve deployment would be UPDATING if it was attempting to scale up additional replicas. The Serve deployment could remain in the UPDATING status for longer than serviceUnhealthySecondThreshold seconds if the cluster lacked sufficient resources to accommodate the new replicas. In such cases, KubeRay would mark the RayCluster as unhealthy and proceed to prepare a new RayCluster instead. We should not trigger the new RayCluster preparation if it is in scaling process.

Definition of "unhealthy" without this PR

In the RayService, KubeRay can mark a RayCluster as unhealthy in two possible scenarios.

Case 1: The KubeRay operator cannot connect to the dashboard agent on the head Pod for more than the duration defined by the deploymentUnhealthySecondThreshold parameter.

deploymentUnhealthySecondThreshold: If it is not set, the default value is 60 seconds. In sample YAML files, we typically set the value to 300 seconds.

Case 2: The KubeRay operator will mark a RayCluster as unhealthy if the status of a serve application is not RUNNING or if a serve deployment isn't HEALTHY for a duration exceeding the serviceUnhealthySecondThreshold parameter.

serviceUnhealthySecondThreshold: If it is not set, the default value is 60 seconds. In sample YAML files, we typically set the value to 300 seconds.

After KubeRay marks a RayCluster as unhealthy, it initiates the creation of a new RayCluster. Once the new RayCluster is ready, KubeRay redirects network traffic to it, and subsequently deletes the old RayCluster.

Definition of "unhealthy" with this PR

In the RayService, KubeRay can mark a RayCluster as unhealthy in two possible scenarios.

Case 1: The KubeRay operator cannot connect to the dashboard agent on the head Pod for more than the duration defined by the deploymentUnhealthySecondThreshold parameter.

deploymentUnhealthySecondThreshold: Both the default value and values in sample YAML files are 300 seconds.

Case 2: The KubeRay operator will mark a RayCluster as unhealthy if the status of a serve application is DEPLOY_FAILED or UNHEALTHY for a duration exceeding the serviceUnhealthySecondThreshold parameter.

serviceUnhealthySecondThreshold: Both the default value and values in sample YAML files are 900 seconds.

After KubeRay marks a RayCluster as unhealthy, it initiates the creation of a new RayCluster. Once the new RayCluster is ready, KubeRay redirects network traffic to it, and subsequently deletes the old RayCluster.

To conclude, there are two main differences in this PR:

Increase the default values of deploymentUnhealthySecondThreshold and serviceUnhealthySecondThreshold to decrease the possibility of triggering new cluster preparation.
We only determine whether this RayCluster is unhealthy or not based on Serve applications. We will not take Serve deployments into consideration anymore. In addition, we also change the behavior for determining RayService health to not mark the cluster unhealthy if the Serve API returns UPDATING. Hence, the RayCluster will not be considered unhealthy when the Serve application tries to scale up more replicas.

Related issue number

Closes #1277

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

RAY_IMAGE=rayproject/ray:2.5.0 OPERATOR_IMAGE=controller:latest pytest -vs tests/test_sample_rayservice_yamls.py --log-cli-level=INFO

architkulkarni

Overall looks good to me, just a few minor questions. Will leave it to Serve folks to decide if the new defaults and new behavior make sense. They seem reasonable to me.

Should this behavior appear in the docs somewhere?

ray-operator/controllers/ray/rayservice_controller.go

ray-operator/controllers/ray/rayservice_controller_unit_test.go

architkulkarni · 2023-08-07T17:46:47Z

Actually there is one question which Kai-Hsun asked earlier which I'm curious about, should we restart the cluster if the app is stuck in DEPLOYING too long? Or does Serve guarantee that after some time it transitions from DEPLOYING to DEPLOY_FAILED?

Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>

edoakes

Looks good.

Is there relevant documentation that should be updated? If not, we should add some. This is an important behavior for folks to understand.

kevin85421 · 2023-08-07T21:50:38Z

@architkulkarni @edoakes I updated the doc in 875c7f3.

…dered unhealthy and the default threshold (ray-project#1293) Revisit the conditions under which a RayService is considered unhealthy and the default threshold

kevin85421 added 3 commits August 4, 2023 23:31

update

bae0615

add unit tests

a2ea7bb

lint

0064ee8

kevin85421 changed the title ~~WIP~~ [RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold Aug 6, 2023

kevin85421 marked this pull request as ready for review August 6, 2023 17:38

kevin85421 requested review from zcin, architkulkarni and edoakes August 7, 2023 17:04

kevin85421 assigned zcin, architkulkarni and edoakes Aug 7, 2023

architkulkarni approved these changes Aug 7, 2023

View reviewed changes

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/rayservice_controller_unit_test.go Show resolved Hide resolved

kevin85421 and others added 2 commits August 7, 2023 11:57

Update ray-operator/controllers/ray/rayservice_controller.go

04c94ef

Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>

update

413a803

edoakes approved these changes Aug 7, 2023

View reviewed changes

update doc

875c7f3

kevin85421 mentioned this pull request Aug 7, 2023

[Feature][RayService] E2E tests for #1293 #1298

Open

2 tasks

kevin85421 merged commit 3959509 into ray-project:master Aug 9, 2023
21 checks passed

kevin85421 mentioned this pull request Nov 17, 2023

[RayService][Health-Check][1/n] Offload the health check responsibilities to K8s and RayCluster #1656

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold #1293

[RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold #1293

kevin85421 commented Aug 4, 2023 •

edited

architkulkarni left a comment

architkulkarni commented Aug 7, 2023

edoakes left a comment

kevin85421 commented Aug 7, 2023

[RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold #1293

[RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold #1293

Conversation

kevin85421 commented Aug 4, 2023 • edited

Why are these changes needed?

Definition of "unhealthy" without this PR

Definition of "unhealthy" with this PR

Related issue number

Checks

architkulkarni left a comment

Choose a reason for hiding this comment

architkulkarni commented Aug 7, 2023

edoakes left a comment

Choose a reason for hiding this comment

kevin85421 commented Aug 7, 2023

kevin85421 commented Aug 4, 2023 •

edited