-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayService] Submit requests to the Dashboard after the head Pod is running and ready #1074
[RayService] Submit requests to the Dashboard after the head Pod is running and ready #1074
Conversation
cc @msumitjain would you mind giving some feedbacks? Thanks! |
@@ -903,7 +903,7 @@ func (r *RayClusterReconciler) updateStatus(instance *rayiov1alpha1.RayCluster) | |||
if !isValid { | |||
instance.Status.State = rayiov1alpha1.Unhealthy | |||
} else { | |||
if utils.CheckAllPodsRunnning(runtimePods) { | |||
if utils.CheckAllPodsRunning(runtimePods) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch
// Check if head pod is running and ready. If not, requeue the resource event to avoid | ||
// redundant custom resource status updates. | ||
// | ||
// TODO (kevin85421): Note that the Dashboard and GCS may take a few seconds to start up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is TODO here? Do you expect to make a fix in future to wait for Dashboard agent before UpdateDeployments
? if not remove this TODO
prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you expect to make a fix in the future to wait for the Dashboard agent before UpdateDeployments
Yes. The behavior is correct without the TODO. However, it may cause unnecessary network traffic.
return false, err | ||
} | ||
|
||
if len(podList.Items) != 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when do we expect this case? having multiple head nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in KubeRay, it is impossible to have more than one head Pod for a RayCluster. If there is more than one Pod that fulfills the matching labels, it may be caused by either a KubeRay bug or a user misconfiguration.
client.MatchingLabels{common.RayClusterLabelKey: instance.Name, common.RayNodeTypeLabelKey: string(rayv1alpha1.HeadNode)}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! This change looks good to me pending @gvspraveen's comments.
Thanks @kevin85421 for picking this up. I will test them tomorrow and let you know. |
Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
|
There are rayiov1alpha1, rayv1alpha1, and v1alpha1 in the codebase, and I would like to ensure consistency across them. This can be a good first issue for new contributors. |
…unning and ready (ray-project#1074) Submit requests to the Dashboard after the head Pod is running and ready
Why are these changes needed?
In #1065, the RayCluster initialization resulted in 47 status updates due to the controller sending HTTP requests to the Dashboard to create Serve Deployments before it was ready to handle requests. See this gist for more details.
updateState {"error": ...: connect: connection refused"}
=> Send requests to the Dashboard before it was ready to handle requests.r.Status().Update() isHealthy && !isReady
=> The Serve Deployments are created but not ready.In this PR, the RayService controller will submit the HTTP request to the Dashboard after the head Pod is running and ready. Note that the Dashboard and GCS may take a few seconds to start up after the head pod is running and ready. Hence, some requests to the Dashboard (e.g.
UpdateDeployments
) may fail. This is not a big issue sinceUpdateDeployments
is an idempotent operation.Compare with RayJob
Unlike RayJob, RayJob will submit a job to the RayCluster when the cluster is
ready
(by definition, "ready" means that all Pods are running, but the implementation may be a bit different.).kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 173 to 174 in 795db0d
The reasons are:
Hence, if RayService only updates Ray Serve Deployments when all Pods are running and ready, it may be risky.
Why do I update rayservice_controller_test.go?
Prior to this PR, the controller would submit an HTTP request to the Dashboard regardless of the status of the head Pod. As a result, the fake Dashboard client would create Ray Serve deployments with a
HEALTHY
status. However, in a real Kubernetes cluster, it is impossible for Ray Serve Deployments to be HEALTHY before the Dashboard is ready.r.getAndCheckServeStatus
->r.updateRayClusterInfo
-> TurnPendingRayCluster
intoActiveRayCluster
.kuberay/ray-operator/controllers/ray/rayservice_controller_test.go
Lines 553 to 559 in 795db0d
With this PR, the function
r.getAndCheckServeStatus
will not be executed if the head Pod is not running and ready. However,envtest
doesn't create a full K8s cluster. It only has the control plane. There's no container runtime or any other Kubernetes controllers. Pod's status will not be updated by Kubernetes built-in Pod controller, and thus the status will always be "pending". Hence, we need to manually update the status from "pending" to "running". The functionr.getAndCheckServeStatus
will be executed after the head Pod becomes running and ready, and then turnPendingRayCluster
intoActiveRayCluster
. See https://book.kubebuilder.io/reference/envtest.html for more details.Other updates
Typo:
CheckAllPodsRunnning
->CheckAllPodsRunning
rayservice_controller_test.go => rayiov1alpha1 -> v1alpha1
Related issue number
#1061
#1062
#1065
#983
Checks
I performed an experiment similar to #1062.