-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayService] Track whether Serve app is ready before switching clusters #730
[RayService] Track whether Serve app is ready before switching clusters #730
Conversation
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
@@ -21,6 +21,30 @@ const ( | |||
FailedToUpdateService ServiceStatus = "FailedToUpdateService" | |||
) | |||
|
|||
// These statuses should match Ray Serve's application statuses | |||
var ApplicationStatus = struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The naming ApplicationStatus
here is a little tricky since it's the same with the field of RayServiceStatus. ApplicationStatus
but they are different struct, so it would be a little bit confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch– I renamed ApplicationStatus
to ApplicationStatusEnum
and DeploymentStatus
to DeploymentStatusEnum
.
@@ -138,7 +138,7 @@ func (r *RayServiceReconciler) Reconcile(ctx context.Context, request ctrl.Reque | |||
} | |||
} else if activeRayClusterInstance != nil && pendingRayClusterInstance != nil { | |||
if err = r.updateStatusForActiveCluster(ctx, rayServiceInstance, activeRayClusterInstance, logger); err != nil { | |||
logger.Error(err, "The updating of the status for active ray cluster while we have pending cluster failed") | |||
logger.Error(err, "Active Ray cluster's status update failed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Update active ray cluster's status failed." may be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to "Failed to update active Ray cluster's status."
Thank @shrekris-anyscale for this contribution. The PR is overall LGTM. by the way, do you have any diagram showing the state machine of the rayservice so we can update the kuberay github docs? |
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
…m and DeploymentStatusEnum Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Thanks for reviewing the change @scarlet25151! I don't have a state machine diagram for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please rerun the make manifests
to sync up the apis change to pass the test and we can merge it.
Please don't merge this yet. I believe there's an issue I still need to resolve. |
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Thanks @brucez-anyscale for your help! @simon-mo @scarlet25151 there are some new changes. Please take one more look. If it looks good to you, it should be ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took another pass, great work!
…configuration framework (#759) Refactors for integration tests -- Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested. Refactor: class CONST and class KubernetesClusterManager should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes. Refactor: Replace os with subprocess. The following paragraph is from Python's official documentation. The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes. Skip test_kill_head due to [Bug] Head pod is deleted rather than restarted when gcs_server on head pod is killed. #638 [Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed #634. Refactor: Replace all existing k8s api clients with K8S_CLUSTER_MANAGER. Refactor and relieve flakiness of test_ray_serve_work working_dir is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Update working_dir in ray-service.yaml.template. To elaborate, the error of the test logic mentioned above is that it only checks the exit code rather than STDOUT. When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready for serving requests. The time.sleep(60) function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters #730 is merged. Remove NodePort service in RayServiceTestCase. Use a curl Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container with network_mode='host' and NodePort service is very weird for me. Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase. Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.
…rs (ray-project#730) * Update error message Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Update comment Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Add ApplicationStatusType and DeploymentStatus Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Use string type for enums Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Use correct health check logic for app status Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Use enum instead of string Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Track whether Serve app is ready before switchover happens Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Improve Serve reconciliation logging statement Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Log isReady Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Move else clause Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Improve logging for ingress reconciliation Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Polish enableIngress log Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Rename ApplicationStatus and DeploymentStatus to ApplicationStatusEnum and DeploymentStatusEnum Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Revise log message Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Implement zero-downtime update Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Update RayServiceInstance's status when isReady but not isHealthy Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Update status directly Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Use cached statuses Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Delete dangling RayClusters after 60 seconds Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Add unit test for zero-downtime update Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Run make manifests Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Remove allServeDeploymentsHealthy Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Run make sync Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
…configuration framework (ray-project#759) Refactors for integration tests -- Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested. Refactor: class CONST and class KubernetesClusterManager should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes. Refactor: Replace os with subprocess. The following paragraph is from Python's official documentation. The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes. Skip test_kill_head due to [Bug] Head pod is deleted rather than restarted when gcs_server on head pod is killed. ray-project#638 [Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed ray-project#634. Refactor: Replace all existing k8s api clients with K8S_CLUSTER_MANAGER. Refactor and relieve flakiness of test_ray_serve_work working_dir is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Update working_dir in ray-service.yaml.template. To elaborate, the error of the test logic mentioned above is that it only checks the exit code rather than STDOUT. When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready for serving requests. The time.sleep(60) function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters ray-project#730 is merged. Remove NodePort service in RayServiceTestCase. Use a curl Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container with network_mode='host' and NodePort service is very weird for me. Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase. Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.
Why are these changes needed?
The
RayService
operator checks whether a pendingRayCluster
's Serve application is"HEALTHY"
before it sets it as the active cluster. There's a few issues with this approach:"HEALTHY"
is not a valid ServeApplicationStatus
.NOT_STARTED
or stillDEPLOYING
). If the cluster switchover happens before the app is ready, incoming traffic is briefly dropped until the Serve app becomes ready.This change updates
getAndCheckServeStatus
to track whether the Serve appisHealthy
andisReady
. With this change, the cluster switchover only happens when the Serve app is both healthy and ready. It only initiates a cluster restart if the application is unhealthy (not if it's unready).This change also makes the
RayService
operator delete danglingRayClusters
after a 60 second grace period, instead of immediately. That gives the ingress enough time to switch traffic to the new cluster without any downtime.Uptime Impact
This change's impact was measured with a Serve app that deploys a single deployment with two replicas. Each deployment replica sleeps for 15 seconds before becoming healthy. See this gist for the code and config.
The Serve app was tested on GKE with a Locust workload that reopened connections for each request. It created 100 users with a spawn rate of 50 users/second.
Without the change, there were distinct periods of downtime where all requests failed since the cluster switchover happened before the Serve app was ready.
With the change, there were no failures after two
RayCluster
config updates:Follow-up Changes
Kubernetes provides liveness and readiness probes to control which pods can serve traffic. Ideally, this status checking logic should be pushed into the probes, so if
HTTPProxies
temporarily crash, their pods can temporarily stop serving traffic. However, theRayService
operator relies on the liveness and readiness probes to check if theRayCluster
is ready to accept Serve deployments. This should be refactored, so theRayService
's probes can track Serve application-level behavior.Related issue number
Addresses #667
Checks