[RayService] Move HTTP Proxy's Health Check to Readiness Probe for Workers #1808

Yicheng-Lu-llll · 2024-01-05T23:27:20Z

Why are these changes needed?

Currently, for high availability of RayService, KubeRay checks each serving Pod's HTTP proxy's health at every reconciliation. If not healthy, KubeRay changes the corresponding Pod's label, and that Pod will later not be selected by serve service.

This method has several shortcomings:

KubeRay may become overloaded due to checking every Ray Pod at every reconciliation, especially if there is a large number of Ray Pods.
If KubeRay operator Pod fails, high availability will also no longer exist. An unhealthy worker Pod cannot be removed from the serving service!

This PR mainly moves the HTTP proxy's health check to the readiness probe for all worker Pods while still checking the head Pod's HTTP proxy's health at every reconciliation for head Pod. This has several benefits:

Reduce the load on KubeRay.
Decouple the current shared failure of KubeRay operator and RayService's high availability.

Note:

The reason for not moving the HTTP proxy's health check to the head Pod‘s readiness probe is complex. Here are two reasons:
- If the readiness probe fails, the head Pod will be removed from all services, including the head service, which is the main way worker Pods communicate to head Pod.
- The Serve app will not be submitted until the head Pod is in 'ready' status. If the readiness probe also checks the HTTP proxy's health, it will fail as there is no running Serve app before head Pod is in 'ready' status. However, if the readiness probe fails, the head Pod will never be in 'ready' status, and the Serve app will never be submitted to the head pod. This creates a circular dependency problem.
After this PR, if a worker Pod has no serving replica, it will also not be in a 'READY' state, as there is no HTTP proxy. The status should be interpreted as 'not ready for serving'. So, not be surprised if you see worker Pod never be in a ready status though you have do nothing wrong, it can simply because there is not serve replica scheduled to that worker Pod.

Additionally, This PR refactor some of the related code.

Related issue number

None

Checks

Test1:
- Check if head and worker Pods' readiness probe and liveness probe are correctly set.
- Check if endpoints of the serve service is correctly changed when scaling.
- Check if no request is dropped during the scaling.

# Step 1: Create a RayService with one head Pod and three worker Pods, each worker Pod has a serve replica. There is no user defined probes.
# Also, a locust cluster is created to test the high availability of the RayService.
kubectl apply -f https://raw.githubusercontent.com/Yicheng-Lu-llll/serve-file/main/rayservice-config_v2.9_replicas-3.yaml

# Step 2: Check head Pod's readiness probe and liveness probe. 
kubectl describe $(kubectl get pods -o=name | grep head) | grep  "Readiness:\|Liveness:"
# Liveness:   exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success] delay=30s timeout=1s period=5s #success=1 #failure=120
# Readiness:  exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success] delay=10s timeout=1s period=5s #success=1 #failure=10

# Step 3: Check worker Pod's readiness probe and liveness probe.
kubectl describe $(kubectl get pods -o=name | grep worker) | grep  "Readiness:\|Liveness:"
# Liveness:   exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=30s timeout=1s period=5s #success=1 #failure=120
# Readiness:  exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8000/-/healthz | grep success] delay=10s timeout=1s period=5s #success=1 #failure=1

# Step 4: check the endpoints of the serve service. There should be four endpoints(1 head + 3 workers)
kubectl describe service rayservice-sample-serve-svc | grep "Endpoints"
# Endpoints:         10.244.0.19:8000,10.244.0.20:8000,10.244.0.21:8000 + 1 more...

# Step 4: scale down the RayService to 1 replica and recheck the endpoints of the serve service. 
# There should be only two endpoint(1 head + 1 workers)
kubectl apply -f https://raw.githubusercontent.com/Yicheng-Lu-llll/serve-file/main/rayservice-config_v2.9_replicas-1.yaml
kubectl describe service rayservice-sample-serve-svc | grep "Endpoints"
# Endpoints:         10.244.0.19:8000,10.244.0.21:8000

# Step 5: During all the above steps, We can also start an locust cluster to test the high availability of the RayService.
# You will find no request is dropped during the scaling down process.
kubectl apply -f https://raw.githubusercontent.com/Yicheng-Lu-llll/serve-file/main/locust_cluster.yaml
kubectl exec -it $(kubectl get pods -o=name | grep locust) -- /bin/sh
cd serve_workloads/microbenchmarks
python locust_runner.py -f /home/ray/serve_workloads/microbenchmarks/qps_test_locustfile.py -u 100 -r 100 -p 0 --host http://rayservice-sample-serve-svc:8000

Test2: Check if serve requests and serve health check can still success if using custom serve port.

#  Create a RayService with one head Pod and one worker Pod with serving port 9000, each worker Pod has a serve replica.
kubectl apply -f https://raw.githubusercontent.com/Yicheng-Lu-llll/serve-file/main/ray-service.different-port.yaml

kubectl describe $(kubectl get pods -o=name | grep worker) | grep  "Readiness:\|Liveness:"
# Liveness:   exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=30s timeout=1s period=5s #success=1 #failure=120
# Readiness:  exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:9000/-/healthz | grep success] delay=10s timeout=1s period=5s #success=1 #failure=1

kubectl describe svc rayservice-sample-serve-svc | grep "Endpoints"
# Endpoints:         10.244.0.6:9000,10.244.0.7:9000

Test3: Check for custom serve service

# Create a RayService with one head Pod and one worker Pod, each worker Pod has a serve replica. 
kubectl apply -f https://raw.githubusercontent.com/Yicheng-Lu-llll/serve-file/main/ray-service.custom-serve-service.yaml

kubectl describe svc custom-ray-serve-service-name | grep "Endpoints"
# Endpoints:                10.244.0.18:8000,10.244.0.19:8000

Test4: User has its own probes
Has already added in in CI test.

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

ray-operator/controllers/ray/common/pod.go

kevin85421 · 2024-01-07T06:37:26Z

ray-operator/controllers/ray/rayservice_controller.go

+	return appStatus == rayv1.ApplicationStatusEnum.UNHEALTHY || appStatus == rayv1.ApplicationStatusEnum.DEPLOY_FAILED
+}
+
+func (r *RayServiceReconciler) getHeadPod(ctx context.Context, instance *rayv1.RayCluster) (*corev1.Pod, error) {


In the KubeRay codebase, several places need to retrieve the head Pod. We may move this function to util.go, and then always use this function to retrieve the head Pod in this PR or a follow up PR.

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

kevin85421 · 2024-01-09T18:22:21Z

ray-operator/controllers/ray/common/pod.go

+		// See https://github.com/ray-project/kuberay/pull/1808 for reasons.
+		if enableServeService && rayNodeType == rayv1.WorkerNode {
+			rayContainer.ReadinessProbe.FailureThreshold = utils.ServeReadinessProbeFailureThreshold
+			rayServeProxyHealthCommand := fmt.Sprintf(utils.BaseWgetHealthCommand,


Note: I executed the command wget -T 2 -q -O- http://localhost:8000/-/healthz in a RayService's worker Pod and received "success".

kevin85421

Great!

kevin85421 · 2024-01-09T20:47:28Z

@Yicheng-Lu-llll, the test-rayservice-sample-yamls-nightly-operator test has failed three times consecutively, which is very rare. Could you please take a look?

Yicheng-Lu-llll force-pushed the serve-health-check-to-readiness branch 2 times, most recently from f084ada to 1621e6d Compare January 5, 2024 23:47

move serve health check to reainess probe

07f991c

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

Yicheng-Lu-llll force-pushed the serve-health-check-to-readiness branch from 1621e6d to 07f991c Compare January 7, 2024 04:21

Yicheng-Lu-llll added 4 commits January 7, 2024 04:34

labelHealthyServePods -> labelHeadPodForServeStatus

1896dd1

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

add test

c1b5bc3

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

remove redundant comment

321f1df

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

improve comment

b241290

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

kevin85421 reviewed Jan 7, 2024

View reviewed changes

kevin85421 self-assigned this Jan 7, 2024

further refactor initHealthProbe

f7b6801

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

Yicheng-Lu-llll changed the title ~~WIP~~ [RayService] Move HTTP Proxy's Health Check to Readiness Probe for Wokers Jan 8, 2024

nit

01f5e70

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

kevin85421 reviewed Jan 9, 2024

View reviewed changes

kevin85421 approved these changes Jan 9, 2024

View reviewed changes

kevin85421 merged commit 96a2ce6 into ray-project:master Jan 10, 2024

This was referenced Jan 10, 2024

[RayJob] Validate RayJob spec #1813

Merged

[RayService][HA] Fix flaky tests #1823

Merged

Fixing readiness probe #1827

Closed

Yicheng-Lu-llll changed the title ~~[RayService] Move HTTP Proxy's Health Check to Readiness Probe for Wokers~~ [RayService] Move HTTP Proxy's Health Check to Readiness Probe for Workers Jan 12, 2024

Yicheng-Lu-llll mentioned this pull request Jan 26, 2024

[RayService] Avoid Duplicate Serve Service #1867

Merged

kanwang mentioned this pull request Nov 18, 2024

[RayService][Bug] when doing multi-node serving with vLLM, non-primary worker pods reports unready. #2552

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RayService] Move HTTP Proxy's Health Check to Readiness Probe for Workers #1808

[RayService] Move HTTP Proxy's Health Check to Readiness Probe for Workers #1808

Uh oh!

Yicheng-Lu-llll commented Jan 5, 2024 •

edited

Loading

Uh oh!

Uh oh!

kevin85421 Jan 7, 2024

Uh oh!

kevin85421 Jan 9, 2024

Uh oh!

kevin85421 left a comment

Uh oh!

kevin85421 commented Jan 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[RayService] Move HTTP Proxy's Health Check to Readiness Probe for Workers #1808

[RayService] Move HTTP Proxy's Health Check to Readiness Probe for Workers #1808

Uh oh!

Conversation

Yicheng-Lu-llll commented Jan 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

kevin85421 Jan 7, 2024

Choose a reason for hiding this comment

Uh oh!

kevin85421 Jan 9, 2024

Choose a reason for hiding this comment

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

kevin85421 commented Jan 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yicheng-Lu-llll commented Jan 5, 2024 •

edited

Loading