Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService][Health-Check][2/n] Remove the hotfix to prevent unnecessary HTTP requests #1658

Merged
merged 1 commit into from
Nov 17, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Nov 17, 2023

Why are these changes needed?

This PR removes the hotfix made by #1581. As stated in the description of PR #1581 shown below, this change is implemented to respect the health check mechanism of RayService. However, we have already decided to offload the health check responsibilities to K8s and RayCluster, and RayService controller will no longer trigger new RayCluster preparation based on the data plane status (Ray Serve status & Ray dashboard agent status). Thus, it is safe to remove the hotfix and offload the responsibilities to the Pod's liveness probes.

  • Question: When we delete the dashboard agent process on the head Pod, the head Pod becomes not ready. Hence, the dashboard agent health check will be skipped. Hence, the zero-downtime upgrade will not be triggered after deploymentUnhealthySecondThreshold seconds. The Ray head will fail and restart after the liveness probes fail 120 times consecutively (~600 seconds).
  • This PR skips the active RayCluster's head Pod status check. KubeRay still sends requests to the dashboard agent, and triggers a zero-downtime upgrade after deploymentUnhealthySecondThreshold seconds.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 marked this pull request as ready for review November 17, 2023 21:31
@kevin85421 kevin85421 merged commit 4557a01 into ray-project:master Nov 17, 2023
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants