Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService] Submit requests to the Dashboard after the head Pod is running and ready #1074

Merged
merged 4 commits into from
May 11, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented May 10, 2023

Why are these changes needed?

In #1065, the RayCluster initialization resulted in 47 status updates due to the controller sending HTTP requests to the Dashboard to create Serve Deployments before it was ready to handle requests. See this gist for more details.

  • updateState {"error": ...: connect: connection refused"} => Send requests to the Dashboard before it was ready to handle requests.
  • r.Status().Update() isHealthy && !isReady => The Serve Deployments are created but not ready.

In this PR, the RayService controller will submit the HTTP request to the Dashboard after the head Pod is running and ready. Note that the Dashboard and GCS may take a few seconds to start up after the head pod is running and ready. Hence, some requests to the Dashboard (e.g. UpdateDeployments) may fail. This is not a big issue since UpdateDeployments is an idempotent operation.

Compare with RayJob

Unlike RayJob, RayJob will submit a job to the RayCluster when the cluster is ready (by definition, "ready" means that all Pods are running, but the implementation may be a bit different.).

// Check the current status of ray cluster before submitting.
if rayClusterInstance.Status.State != rayv1alpha1.Ready {

The reasons are:

  • Typically, RayService should be available 24/7. In addition, Ray Serve deployments may scale up/down based on the QPS, and cause the RayCluster up/down with Ray Autoscaler. Only updating Ray Serve deployments when all Pods are ready and running may cause big overhead.
  • In addition, there may be some edge cases that cause some Pods to crash unexpectedly.

Hence, if RayService only updates Ray Serve Deployments when all Pods are running and ready, it may be risky.

Why do I update rayservice_controller_test.go?

Prior to this PR, the controller would submit an HTTP request to the Dashboard regardless of the status of the head Pod. As a result, the fake Dashboard client would create Ray Serve deployments with a HEALTHY status. However, in a real Kubernetes cluster, it is impossible for Ray Serve Deployments to be HEALTHY before the Dashboard is ready.

  • r.getAndCheckServeStatus -> r.updateRayClusterInfo -> Turn PendingRayCluster into ActiveRayCluster.

func prepareFakeRayDashboardClient() utils.FakeRayDashboardClient {
client := utils.FakeRayDashboardClient{}
client.SetServeStatus(generateServeStatus(metav1.Now(), "HEALTHY"))
return client
}

With this PR, the function r.getAndCheckServeStatus will not be executed if the head Pod is not running and ready. However, envtest doesn't create a full K8s cluster. It only has the control plane. There's no container runtime or any other Kubernetes controllers. Pod's status will not be updated by Kubernetes built-in Pod controller, and thus the status will always be "pending". Hence, we need to manually update the status from "pending" to "running". The function r.getAndCheckServeStatus will be executed after the head Pod becomes running and ready, and then turn PendingRayCluster into ActiveRayCluster. See https://book.kubebuilder.io/reference/envtest.html for more details.

Other updates

Related issue number

#1061
#1062
#1065
#983

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

I performed an experiment similar to #1062.

  • status updates in the RayCluster initialization. => 47 -> 13
2023-05-10T06:44:04.862Z	INFO	controllers.RayService	r.Status().Update() Update RayService Status since reconcileRayCluster may mark RayCluster restart.
2023-05-10T06:44:06.908Z	INFO	controllers.RayService	r.Status().Update() updateState	{"error": "Service \"rayservice-sample-raycluster-c6d8t-dashboard-svc\" not found"}
2023-05-10T06:44:07.046Z	INFO	controllers.RayService	r.Status().Update() updateState	{"error": "Service \"rayservice-sample-raycluster-c6d8t-dashboard-svc\" not found"}
2023-05-10T06:45:27.955Z	INFO	controllers.RayService	r.Status().Update() updateState	{"error": "Put \"http://rayservice-sample-raycluster-c6d8t-dashboard-svc.default.svc.cluster.local:52365/api/serve/deployments/\": dial tcp 10.96.243.100:52365: connect: connection refused"}
2023-05-10T06:45:28.978Z	INFO	controllers.RayService	r.Status().Update() updateState	{"error": "Put \"http://rayservice-sample-raycluster-c6d8t-dashboard-svc.default.svc.cluster.local:52365/api/serve/deployments/\": dial tcp 10.96.243.100:52365: connect: connection refused"}
2023-05-10T06:45:31.002Z	INFO	controllers.RayService	r.Status().Update() updateState	{"error": "Put \"http://rayservice-sample-raycluster-c6d8t-dashboard-svc.default.svc.cluster.local:52365/api/serve/deployments/\": dial tcp 10.96.243.100:52365: connect: connection refused"}
2023-05-10T06:45:33.060Z	INFO	controllers.RayService	r.Status().Update() updateState	{"error": "Put \"http://rayservice-sample-raycluster-c6d8t-dashboard-svc.default.svc.cluster.local:52365/api/serve/deployments/\": dial tcp 10.96.243.100:52365: connect: connection refused"}
2023-05-10T06:45:42.919Z	INFO	controllers.RayService	r.Status().Update() isHealthy && !isReady
2023-05-10T06:45:42.963Z	INFO	controllers.RayService	r.Status().Update() isHealthy && !isReady
2023-05-10T06:45:44.958Z	INFO	controllers.RayService	r.Status().Update() isHealthy && !isReady
2023-05-10T06:45:46.995Z	INFO	controllers.RayService	r.Status().Update() isHealthy && !isReady
2023-05-10T06:45:49.060Z	INFO	controllers.RayService	r.Status().Update() isHealthy && !isReady
2023-05-10T06:45:51.102Z	INFO	controllers.RayService	r.Status().Update() isHealthy && !isReady
2023-05-10T06:45:53.261Z	INFO	controllers.RayService	r.Status().Update() Final status update for any CR modification.

@kevin85421 kevin85421 changed the title WIP [RayService] Submit requests to the Dashboard after the head Pod is running and ready May 10, 2023
@kevin85421 kevin85421 changed the title [RayService] Submit requests to the Dashboard after the head Pod is running and ready [WIP][RayService] Submit requests to the Dashboard after the head Pod is running and ready May 10, 2023
@kevin85421 kevin85421 changed the title [WIP][RayService] Submit requests to the Dashboard after the head Pod is running and ready [RayService] Submit requests to the Dashboard after the head Pod is running and ready May 10, 2023
@kevin85421 kevin85421 marked this pull request as ready for review May 10, 2023 07:12
@kevin85421
Copy link
Member Author

cc @msumitjain would you mind giving some feedbacks? Thanks!

@@ -903,7 +903,7 @@ func (r *RayClusterReconciler) updateStatus(instance *rayiov1alpha1.RayCluster)
if !isValid {
instance.Status.State = rayiov1alpha1.Unhealthy
} else {
if utils.CheckAllPodsRunnning(runtimePods) {
if utils.CheckAllPodsRunning(runtimePods) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch

// Check if head pod is running and ready. If not, requeue the resource event to avoid
// redundant custom resource status updates.
//
// TODO (kevin85421): Note that the Dashboard and GCS may take a few seconds to start up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is TODO here? Do you expect to make a fix in future to wait for Dashboard agent before UpdateDeployments? if not remove this TODO prefix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you expect to make a fix in the future to wait for the Dashboard agent before UpdateDeployments

Yes. The behavior is correct without the TODO. However, it may cause unnecessary network traffic.

return false, err
}

if len(podList.Items) != 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do we expect this case? having multiple head nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in KubeRay, it is impossible to have more than one head Pod for a RayCluster. If there is more than one Pod that fulfills the matching labels, it may be caused by either a KubeRay bug or a user misconfiguration.

client.MatchingLabels{common.RayClusterLabelKey: instance.Name, common.RayNodeTypeLabelKey: string(rayv1alpha1.HeadNode)}

Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! This change looks good to me pending @gvspraveen's comments.

@msumitjain
Copy link
Contributor

cc @msumitjain would you mind giving some feedbacks? Thanks!

Thanks @kevin85421 for picking this up. I will test them tomorrow and let you know.

kevin85421 and others added 2 commits May 10, 2023 15:41
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
@architkulkarni architkulkarni self-assigned this May 10, 2023
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@architkulkarni
Copy link
Contributor

rayiov1alpha1 -> v1alpha1 what's the reason for this out of curiosity?

@kevin85421
Copy link
Member Author

rayiov1alpha1 -> v1alpha1 what's the reason for this out of curiosity?

There are rayiov1alpha1, rayv1alpha1, and v1alpha1 in the codebase, and I would like to ensure consistency across them. This can be a good first issue for new contributors.

@kevin85421 kevin85421 merged commit b0649c4 into ray-project:master May 11, 2023
19 checks passed
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…unning and ready (ray-project#1074)

Submit requests to the Dashboard after the head Pod is running and ready
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants