-
Notifications
You must be signed in to change notification settings - Fork 647
[RayService] Move HTTP Proxy's Health Check to Readiness Probe for Workers #1808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayService] Move HTTP Proxy's Health Check to Readiness Probe for Workers #1808
Conversation
f084ada to
1621e6d
Compare
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
1621e6d to
07f991c
Compare
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
| return appStatus == rayv1.ApplicationStatusEnum.UNHEALTHY || appStatus == rayv1.ApplicationStatusEnum.DEPLOY_FAILED | ||
| } | ||
|
|
||
| func (r *RayServiceReconciler) getHeadPod(ctx context.Context, instance *rayv1.RayCluster) (*corev1.Pod, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the KubeRay codebase, several places need to retrieve the head Pod. We may move this function to util.go, and then always use this function to retrieve the head Pod in this PR or a follow up PR.
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
| // See https://github.com/ray-project/kuberay/pull/1808 for reasons. | ||
| if enableServeService && rayNodeType == rayv1.WorkerNode { | ||
| rayContainer.ReadinessProbe.FailureThreshold = utils.ServeReadinessProbeFailureThreshold | ||
| rayServeProxyHealthCommand := fmt.Sprintf(utils.BaseWgetHealthCommand, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I executed the command wget -T 2 -q -O- http://localhost:8000/-/healthz in a RayService's worker Pod and received "success".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
|
@Yicheng-Lu-llll, the |
Why are these changes needed?
Currently, for high availability of RayService, KubeRay checks each serving Pod's HTTP proxy's health at every reconciliation. If not healthy, KubeRay changes the corresponding Pod's label, and that Pod will later not be selected by serve service.
This method has several shortcomings:
This PR mainly moves the HTTP proxy's health check to the readiness probe for all worker Pods while still checking the head Pod's HTTP proxy's health at every reconciliation for head Pod. This has several benefits:
Note:
The reason for not moving the HTTP proxy's health check to the head Pod‘s readiness probe is complex. Here are two reasons:
After this PR, if a worker Pod has no serving replica, it will also not be in a 'READY' state, as there is no HTTP proxy. The status should be interpreted as 'not ready for serving'. So, not be surprised if you see worker Pod never be in a ready status though you have do nothing wrong, it can simply because there is not serve replica scheduled to that worker Pod.
Additionally, This PR refactor some of the related code.
Related issue number
None
Checks
Has already added in in CI test.