Skip to content

Conversation

@Yicheng-Lu-llll
Copy link
Collaborator

@Yicheng-Lu-llll Yicheng-Lu-llll commented Jan 5, 2024

Why are these changes needed?

Currently, for high availability of RayService, KubeRay checks each serving Pod's HTTP proxy's health at every reconciliation. If not healthy, KubeRay changes the corresponding Pod's label, and that Pod will later not be selected by serve service.

This method has several shortcomings:

  • KubeRay may become overloaded due to checking every Ray Pod at every reconciliation, especially if there is a large number of Ray Pods.
  • If KubeRay operator Pod fails, high availability will also no longer exist. An unhealthy worker Pod cannot be removed from the serving service!

This PR mainly moves the HTTP proxy's health check to the readiness probe for all worker Pods while still checking the head Pod's HTTP proxy's health at every reconciliation for head Pod. This has several benefits:

  • Reduce the load on KubeRay.
  • Decouple the current shared failure of KubeRay operator and RayService's high availability.

Note:

  • The reason for not moving the HTTP proxy's health check to the head Pod‘s readiness probe is complex. Here are two reasons:

    • If the readiness probe fails, the head Pod will be removed from all services, including the head service, which is the main way worker Pods communicate to head Pod.
    • The Serve app will not be submitted until the head Pod is in 'ready' status. If the readiness probe also checks the HTTP proxy's health, it will fail as there is no running Serve app before head Pod is in 'ready' status. However, if the readiness probe fails, the head Pod will never be in 'ready' status, and the Serve app will never be submitted to the head pod. This creates a circular dependency problem.
  • After this PR, if a worker Pod has no serving replica, it will also not be in a 'READY' state, as there is no HTTP proxy. The status should be interpreted as 'not ready for serving'. So, not be surprised if you see worker Pod never be in a ready status though you have do nothing wrong, it can simply because there is not serve replica scheduled to that worker Pod.

Additionally, This PR refactor some of the related code.

Related issue number

None

Checks

  • Test1:
    • Check if head and worker Pods' readiness probe and liveness probe are correctly set.
    • Check if endpoints of the serve service is correctly changed when scaling.
    • Check if no request is dropped during the scaling.
# Step 1: Create a RayService with one head Pod and three worker Pods, each worker Pod has a serve replica. There is no user defined probes.
# Also, a locust cluster is created to test the high availability of the RayService.
kubectl apply -f https://raw.githubusercontent.com/Yicheng-Lu-llll/serve-file/main/rayservice-config_v2.9_replicas-3.yaml

# Step 2: Check head Pod's readiness probe and liveness probe. 
kubectl describe $(kubectl get pods -o=name | grep head) | grep  "Readiness:\|Liveness:"
# Liveness:   exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success] delay=30s timeout=1s period=5s #success=1 #failure=120
# Readiness:  exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success] delay=10s timeout=1s period=5s #success=1 #failure=10

# Step 3: Check worker Pod's readiness probe and liveness probe.
kubectl describe $(kubectl get pods -o=name | grep worker) | grep  "Readiness:\|Liveness:"
# Liveness:   exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=30s timeout=1s period=5s #success=1 #failure=120
# Readiness:  exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8000/-/healthz | grep success] delay=10s timeout=1s period=5s #success=1 #failure=1

# Step 4: check the endpoints of the serve service. There should be four endpoints(1 head + 3 workers)
kubectl describe service rayservice-sample-serve-svc | grep "Endpoints"
# Endpoints:         10.244.0.19:8000,10.244.0.20:8000,10.244.0.21:8000 + 1 more...

# Step 4: scale down the RayService to 1 replica and recheck the endpoints of the serve service. 
# There should be only two endpoint(1 head + 1 workers)
kubectl apply -f https://raw.githubusercontent.com/Yicheng-Lu-llll/serve-file/main/rayservice-config_v2.9_replicas-1.yaml
kubectl describe service rayservice-sample-serve-svc | grep "Endpoints"
# Endpoints:         10.244.0.19:8000,10.244.0.21:8000

# Step 5: During all the above steps, We can also start an locust cluster to test the high availability of the RayService.
# You will find no request is dropped during the scaling down process.
kubectl apply -f https://raw.githubusercontent.com/Yicheng-Lu-llll/serve-file/main/locust_cluster.yaml
kubectl exec -it $(kubectl get pods -o=name | grep locust) -- /bin/sh
cd serve_workloads/microbenchmarks
python locust_runner.py -f /home/ray/serve_workloads/microbenchmarks/qps_test_locustfile.py -u 100 -r 100 -p 0 --host http://rayservice-sample-serve-svc:8000
  • Test2: Check if serve requests and serve health check can still success if using custom serve port.
#  Create a RayService with one head Pod and one worker Pod with serving port 9000, each worker Pod has a serve replica.
kubectl apply -f https://raw.githubusercontent.com/Yicheng-Lu-llll/serve-file/main/ray-service.different-port.yaml

kubectl describe $(kubectl get pods -o=name | grep worker) | grep  "Readiness:\|Liveness:"
# Liveness:   exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=30s timeout=1s period=5s #success=1 #failure=120
# Readiness:  exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:9000/-/healthz | grep success] delay=10s timeout=1s period=5s #success=1 #failure=1

kubectl describe svc rayservice-sample-serve-svc | grep "Endpoints"
# Endpoints:         10.244.0.6:9000,10.244.0.7:9000
  • Test3: Check for custom serve service
# Create a RayService with one head Pod and one worker Pod, each worker Pod has a serve replica. 
kubectl apply -f https://raw.githubusercontent.com/Yicheng-Lu-llll/serve-file/main/ray-service.custom-serve-service.yaml

kubectl describe svc custom-ray-serve-service-name | grep "Endpoints"
# Endpoints:                10.244.0.18:8000,10.244.0.19:8000
  • Test4: User has its own probes
    Has already added in in CI test.

@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the serve-health-check-to-readiness branch 2 times, most recently from f084ada to 1621e6d Compare January 5, 2024 23:47
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the serve-health-check-to-readiness branch from 1621e6d to 07f991c Compare January 7, 2024 04:21
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
return appStatus == rayv1.ApplicationStatusEnum.UNHEALTHY || appStatus == rayv1.ApplicationStatusEnum.DEPLOY_FAILED
}

func (r *RayServiceReconciler) getHeadPod(ctx context.Context, instance *rayv1.RayCluster) (*corev1.Pod, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the KubeRay codebase, several places need to retrieve the head Pod. We may move this function to util.go, and then always use this function to retrieve the head Pod in this PR or a follow up PR.

@kevin85421 kevin85421 self-assigned this Jan 7, 2024
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
@Yicheng-Lu-llll Yicheng-Lu-llll changed the title WIP [RayService] Move HTTP Proxy's Health Check to Readiness Probe for Wokers Jan 8, 2024
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
// See https://github.com/ray-project/kuberay/pull/1808 for reasons.
if enableServeService && rayNodeType == rayv1.WorkerNode {
rayContainer.ReadinessProbe.FailureThreshold = utils.ServeReadinessProbeFailureThreshold
rayServeProxyHealthCommand := fmt.Sprintf(utils.BaseWgetHealthCommand,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I executed the command wget -T 2 -q -O- http://localhost:8000/-/healthz in a RayService's worker Pod and received "success".

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@kevin85421
Copy link
Member

@Yicheng-Lu-llll, the test-rayservice-sample-yamls-nightly-operator test has failed three times consecutively, which is very rare. Could you please take a look?

@kevin85421 kevin85421 merged commit 96a2ce6 into ray-project:master Jan 10, 2024
@Yicheng-Lu-llll Yicheng-Lu-llll changed the title [RayService] Move HTTP Proxy's Health Check to Readiness Probe for Wokers [RayService] Move HTTP Proxy's Health Check to Readiness Probe for Workers Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants