Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService][Health-Check][1/n] Offload the health check responsibilities to K8s and RayCluster #1656

Merged
merged 1 commit into from
Nov 17, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Nov 17, 2023

Why are these changes needed?

RayService features a health check mechanism to monitor the status of the dashboard agent on the Ray head and the status of Ray Serve applications. It will trigger a new RayCluster preparation when the RayService controller thinks that the RayCluster is not healthy.

Almost no users benefit from this mechanism, and many have complained about why KubeRay created a new RayCluster automatically. Most users do not have enough computing resources to run two RayClusters simultaneously. In addition, maintaining the RayService health check mechanism is quite challenging due to the numerous interconnected health checks in KubeRay, including: (1) the RayCluster controller, (2) K8s readiness/liveness probes, (3) Raylet, (4) Ray Autoscaler, and (5) the RayService controller. That's why I decide to offload the health check responsibilities to K8s and RayCluster controller.

This PR avoids calling the function markRestart based on data plane status.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
# Step 1: Create a RayService with (1) serviceUnhealthySecondThreshold: 90 and (2) deploymentUnhealthySecondThreshold: 30.
# https://gist.github.com/kevin85421/deba2e8c8e18455d6911cfe90a03e493

# Step 2: Delete the dashboard agent process after the Ray Serve applications are ready
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- bash
kill $DASHBOARD_AGENT_PID

# Step 3: Wait for 3 mins, and no new RayCluster preparation should be triggered.
  • Note that the dashboard agent status should be monitored by K8s Pod probes. However, KubeRay currently only injects probes to Pods when GCS FT is enabled. Hence, I will open a following PR to add probes to Ray Pods no matter whether GCS FT is enabled or not.

@kevin85421 kevin85421 changed the title [Feature][RayService] Do not trigger new RayCluster preparation based on data plane status [Feature][RayService] Offload the health check responsibilities to K8s and RayCluster Nov 17, 2023
@kevin85421 kevin85421 changed the title [Feature][RayService] Offload the health check responsibilities to K8s and RayCluster [Feature][RayService][1/n] Offload the health check responsibilities to K8s and RayCluster Nov 17, 2023
@kevin85421 kevin85421 marked this pull request as ready for review November 17, 2023 19:41
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@kevin85421 kevin85421 changed the title [Feature][RayService][1/n] Offload the health check responsibilities to K8s and RayCluster [RayService][Health-Check][1/n] Offload the health check responsibilities to K8s and RayCluster Nov 17, 2023
@kevin85421 kevin85421 merged commit 6c2281c into ray-project:master Nov 17, 2023
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants