Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold #1293

Merged

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Aug 4, 2023

Why are these changes needed?

Before Ray 2.6.0, the status of a Serve deployment would be UPDATING if it was attempting to scale up additional replicas. The Serve deployment could remain in the UPDATING status for longer than serviceUnhealthySecondThreshold seconds if the cluster lacked sufficient resources to accommodate the new replicas. In such cases, KubeRay would mark the RayCluster as unhealthy and proceed to prepare a new RayCluster instead. We should not trigger the new RayCluster preparation if it is in scaling process.

Definition of "unhealthy" without this PR

In the RayService, KubeRay can mark a RayCluster as unhealthy in two possible scenarios.

Case 1: The KubeRay operator cannot connect to the dashboard agent on the head Pod for more than the duration defined by the deploymentUnhealthySecondThreshold parameter.

  • deploymentUnhealthySecondThreshold: If it is not set, the default value is 60 seconds. In sample YAML files, we typically set the value to 300 seconds.

Case 2: The KubeRay operator will mark a RayCluster as unhealthy if the status of a serve application is not RUNNING or if a serve deployment isn't HEALTHY for a duration exceeding the serviceUnhealthySecondThreshold parameter.

  • serviceUnhealthySecondThreshold: If it is not set, the default value is 60 seconds. In sample YAML files, we typically set the value to 300 seconds.

After KubeRay marks a RayCluster as unhealthy, it initiates the creation of a new RayCluster. Once the new RayCluster is ready, KubeRay redirects network traffic to it, and subsequently deletes the old RayCluster.

Definition of "unhealthy" with this PR

In the RayService, KubeRay can mark a RayCluster as unhealthy in two possible scenarios.

Case 1: The KubeRay operator cannot connect to the dashboard agent on the head Pod for more than the duration defined by the deploymentUnhealthySecondThreshold parameter.

  • deploymentUnhealthySecondThreshold: Both the default value and values in sample YAML files are 300 seconds.

Case 2: The KubeRay operator will mark a RayCluster as unhealthy if the status of a serve application is DEPLOY_FAILED or UNHEALTHY for a duration exceeding the serviceUnhealthySecondThreshold parameter.

  • serviceUnhealthySecondThreshold: Both the default value and values in sample YAML files are 900 seconds.

After KubeRay marks a RayCluster as unhealthy, it initiates the creation of a new RayCluster. Once the new RayCluster is ready, KubeRay redirects network traffic to it, and subsequently deletes the old RayCluster.

To conclude, there are two main differences in this PR:

  • Increase the default values of deploymentUnhealthySecondThreshold and serviceUnhealthySecondThreshold to decrease the possibility of triggering new cluster preparation.
  • We only determine whether this RayCluster is unhealthy or not based on Serve applications. We will not take Serve deployments into consideration anymore. In addition, we also change the behavior for determining RayService health to not mark the cluster unhealthy if the Serve API returns UPDATING. Hence, the RayCluster will not be considered unhealthy when the Serve application tries to scale up more replicas.

Related issue number

Closes #1277

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
RAY_IMAGE=rayproject/ray:2.5.0 OPERATOR_IMAGE=controller:latest pytest -vs tests/test_sample_rayservice_yamls.py --log-cli-level=INFO
Screen Shot 2023-08-06 at 12 19 12 AM

@kevin85421 kevin85421 changed the title WIP [RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold Aug 6, 2023
@kevin85421 kevin85421 marked this pull request as ready for review August 6, 2023 17:38
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, just a few minor questions. Will leave it to Serve folks to decide if the new defaults and new behavior make sense. They seem reasonable to me.

Should this behavior appear in the docs somewhere?

@architkulkarni
Copy link
Contributor

Actually there is one question which Kai-Hsun asked earlier which I'm curious about, should we restart the cluster if the app is stuck in DEPLOYING too long? Or does Serve guarantee that after some time it transitions from DEPLOYING to DEPLOY_FAILED?

kevin85421 and others added 2 commits August 7, 2023 11:57
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Is there relevant documentation that should be updated? If not, we should add some. This is an important behavior for folks to understand.

@kevin85421
Copy link
Member Author

@architkulkarni @edoakes I updated the doc in 875c7f3.

@kevin85421 kevin85421 merged commit 3959509 into ray-project:master Aug 9, 2023
21 checks passed
blublinsky pushed a commit to blublinsky/kuberay that referenced this pull request Aug 15, 2023
…dered unhealthy and the default threshold (ray-project#1293)

Revisit the conditions under which a RayService is considered unhealthy and the default threshold
blublinsky pushed a commit to blublinsky/kuberay that referenced this pull request Aug 28, 2023
…dered unhealthy and the default threshold (ray-project#1293)

Revisit the conditions under which a RayService is considered unhealthy and the default threshold
blublinsky pushed a commit to blublinsky/kuberay that referenced this pull request Aug 29, 2023
…dered unhealthy and the default threshold (ray-project#1293)

Revisit the conditions under which a RayService is considered unhealthy and the default threshold
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…dered unhealthy and the default threshold (ray-project#1293)

Revisit the conditions under which a RayService is considered unhealthy and the default threshold
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold
4 participants