[RayService] Track whether Serve app is ready before switching clusters #730

shrekris-anyscale · 2022-11-15T23:01:06Z

Why are these changes needed?

The RayService operator checks whether a pending RayCluster's Serve application is "HEALTHY" before it sets it as the active cluster. There's a few issues with this approach:

"HEALTHY" is not a valid Serve ApplicationStatus.
A Serve application can be healthy but not ready (e.g. if it's NOT_STARTED or still DEPLOYING). If the cluster switchover happens before the app is ready, incoming traffic is briefly dropped until the Serve app becomes ready.

This change updates getAndCheckServeStatus to track whether the Serve app isHealthy and isReady. With this change, the cluster switchover only happens when the Serve app is both healthy and ready. It only initiates a cluster restart if the application is unhealthy (not if it's unready).

This change also makes the RayService operator delete dangling RayClusters after a 60 second grace period, instead of immediately. That gives the ingress enough time to switch traffic to the new cluster without any downtime.

Uptime Impact

This change's impact was measured with a Serve app that deploys a single deployment with two replicas. Each deployment replica sleeps for 15 seconds before becoming healthy. See this gist for the code and config.

The Serve app was tested on GKE with a Locust workload that reopened connections for each request. It created 100 users with a spawn rate of 50 users/second.

Without the change, there were distinct periods of downtime where all requests failed since the cluster switchover happened before the Serve app was ready.

With the change, there were no failures after two RayCluster config updates:

Follow-up Changes

Kubernetes provides liveness and readiness probes to control which pods can serve traffic. Ideally, this status checking logic should be pushed into the probes, so if HTTPProxies temporarily crash, their pods can temporarily stop serving traffic. However, the RayService operator relies on the liveness and readiness probes to check if the RayCluster is ready to accept Serve deployments. This should be refactored, so the RayService's probes can track Serve application-level behavior.

Related issue number

Addresses #667

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
  - This change adds a unit test for zero-downtime deployments.

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

scarlet25151 · 2022-11-17T00:06:40Z

ray-operator/apis/ray/v1alpha1/rayservice_types.go

@@ -21,6 +21,30 @@ const (
 	FailedToUpdateService            ServiceStatus = "FailedToUpdateService"
 )

+// These statuses should match Ray Serve's application statuses
+var ApplicationStatus = struct {


nit: The naming ApplicationStatus here is a little tricky since it's the same with the field of RayServiceStatus. ApplicationStatus but they are different struct, so it would be a little bit confusing

Good catch– I renamed ApplicationStatus to ApplicationStatusEnum and DeploymentStatus to DeploymentStatusEnum.

scarlet25151 · 2022-11-17T00:11:56Z

ray-operator/controllers/ray/rayservice_controller.go

@@ -138,7 +138,7 @@ func (r *RayServiceReconciler) Reconcile(ctx context.Context, request ctrl.Reque
 		}
 	} else if activeRayClusterInstance != nil && pendingRayClusterInstance != nil {
 		if err = r.updateStatusForActiveCluster(ctx, rayServiceInstance, activeRayClusterInstance, logger); err != nil {
-			logger.Error(err, "The updating of the status for active ray cluster while we have pending cluster failed")
+			logger.Error(err, "Active Ray cluster's status update failed.")


"Update active ray cluster's status failed." may be better

I changed it to "Failed to update active Ray cluster's status."

scarlet25151 · 2022-11-17T00:25:20Z

Thank @shrekris-anyscale for this contribution. The PR is overall LGTM. by the way, do you have any diagram showing the state machine of the rayservice so we can update the kuberay github docs?

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

…m and DeploymentStatusEnum Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

shrekris-anyscale · 2022-11-18T08:12:42Z

Thanks for reviewing the change @scarlet25151! I don't have a state machine diagram for RayServices, but I think that would be a great addition to the docs. Would you mind filing a GitHub issue, so we can track it?

scarlet25151

LGTM. Please rerun the make manifests to sync up the apis change to pass the test and we can merge it.

shrekris-anyscale · 2022-11-18T20:06:38Z

Please don't merge this yet. I believe there's an issue I still need to resolve.

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

…us_checking

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

brucez-anyscale

lgtm

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

…us_checking

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

shrekris-anyscale · 2022-11-23T02:00:08Z

Thanks @brucez-anyscale for your help! @simon-mo @scarlet25151 there are some new changes. Please take one more look. If it looks good to you, it should be ready to merge.

simon-mo

Took another pass, great work!

…configuration framework (#759) Refactors for integration tests -- Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested. Refactor: class CONST and class KubernetesClusterManager should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes. Refactor: Replace os with subprocess. The following paragraph is from Python's official documentation. The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes. Skip test_kill_head due to [Bug] Head pod is deleted rather than restarted when gcs_server on head pod is killed. #638 [Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed #634. Refactor: Replace all existing k8s api clients with K8S_CLUSTER_MANAGER. Refactor and relieve flakiness of test_ray_serve_work working_dir is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Update working_dir in ray-service.yaml.template. To elaborate, the error of the test logic mentioned above is that it only checks the exit code rather than STDOUT. When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready for serving requests. The time.sleep(60) function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters #730 is merged. Remove NodePort service in RayServiceTestCase. Use a curl Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container with network_mode='host' and NodePort service is very weird for me. Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase. Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.

…rs (ray-project#730) * Update error message Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Update comment Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Add ApplicationStatusType and DeploymentStatus Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Use string type for enums Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Use correct health check logic for app status Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Use enum instead of string Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Track whether Serve app is ready before switchover happens Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Improve Serve reconciliation logging statement Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Log isReady Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Move else clause Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Improve logging for ingress reconciliation Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Polish enableIngress log Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Rename ApplicationStatus and DeploymentStatus to ApplicationStatusEnum and DeploymentStatusEnum Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Revise log message Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Implement zero-downtime update Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Update RayServiceInstance's status when isReady but not isHealthy Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Update status directly Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Use cached statuses Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Delete dangling RayClusters after 60 seconds Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Add unit test for zero-downtime update Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Run make manifests Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Remove allServeDeploymentsHealthy Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> * Run make sync Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

…configuration framework (ray-project#759) Refactors for integration tests -- Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested. Refactor: class CONST and class KubernetesClusterManager should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes. Refactor: Replace os with subprocess. The following paragraph is from Python's official documentation. The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes. Skip test_kill_head due to [Bug] Head pod is deleted rather than restarted when gcs_server on head pod is killed. ray-project#638 [Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed ray-project#634. Refactor: Replace all existing k8s api clients with K8S_CLUSTER_MANAGER. Refactor and relieve flakiness of test_ray_serve_work working_dir is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Update working_dir in ray-service.yaml.template. To elaborate, the error of the test logic mentioned above is that it only checks the exit code rather than STDOUT. When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready for serving requests. The time.sleep(60) function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters ray-project#730 is merged. Remove NodePort service in RayServiceTestCase. Use a curl Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container with network_mode='host' and NodePort service is very weird for me. Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase. Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.

shrekris-anyscale added 7 commits November 15, 2022 13:17

Update error message

93e8c8e

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Update comment

782d27a

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Add ApplicationStatusType and DeploymentStatus

e051059

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Use string type for enums

7dfccfc

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Use correct health check logic for app status

887e3dd

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Use enum instead of string

e8098ed

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Track whether Serve app is ready before switchover happens

b79b4fb

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

scarlet25151 reviewed Nov 17, 2022

View reviewed changes

scarlet25151 added enhancement New feature or request operator observability labels Nov 17, 2022

scarlet25151 added this to the v0.4.0 release milestone Nov 17, 2022

shrekris-anyscale added 7 commits November 17, 2022 11:30

Improve Serve reconciliation logging statement

be8d808

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Log isReady

7d6f04d

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Move else clause

7ff6f58

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Improve logging for ingress reconciliation

532997e

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Polish enableIngress log

fcb3050

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Rename ApplicationStatus and DeploymentStatus to ApplicationStatusEnu…

d26cc0f

…m and DeploymentStatusEnum Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Revise log message

96ba9aa

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

shrekris-anyscale requested a review from simon-mo November 18, 2022 08:13

shrekris-anyscale assigned simon-mo Nov 18, 2022

simon-mo approved these changes Nov 18, 2022

View reviewed changes

scarlet25151 approved these changes Nov 18, 2022

View reviewed changes

shrekris-anyscale added 3 commits November 20, 2022 17:18

Implement zero-downtime update

7361cf5

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Update RayServiceInstance's status when isReady but not isHealthy

933f7cd

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Merge branch 'master' of github.com:ray-project/kuberay into fix_stat…

1ec7a68

…us_checking

shrekris-anyscale added 2 commits November 21, 2022 16:50

Update status directly

4cbae33

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Use cached statuses

935564b

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

brucez-anyscale reviewed Nov 22, 2022

View reviewed changes

shrekris-anyscale added 6 commits November 22, 2022 16:13

Delete dangling RayClusters after 60 seconds

b3a282b

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Merge branch 'master' of github.com:ray-project/kuberay into fix_stat…

7299a77

…us_checking

Add unit test for zero-downtime update

104bc90

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Run make manifests

496001e

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Remove allServeDeploymentsHealthy

df753e8

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Run make sync

af7f80b

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

shrekris-anyscale mentioned this pull request Nov 23, 2022

[WIP] [RayService] Add liveness and readiness probes that track Serve app health #706

Closed

4 tasks

simon-mo approved these changes Nov 23, 2022

View reviewed changes

brucez-anyscale approved these changes Nov 23, 2022

View reviewed changes

kevin85421 mentioned this pull request Nov 25, 2022

[Feature] Refactor test framework & test kuberay-operator chart with configuration framework #759

Merged

4 tasks

shrekris-anyscale merged commit 7940407 into ray-project:master Nov 28, 2022

shrekris-anyscale mentioned this pull request Dec 9, 2022

[Bug] [RayService] Missing readiness and liveness probe causes 5xx error #667

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayService] Track whether Serve app is ready before switching clusters #730

[RayService] Track whether Serve app is ready before switching clusters #730

shrekris-anyscale commented Nov 15, 2022 •

edited

Loading

scarlet25151 Nov 17, 2022

shrekris-anyscale Nov 18, 2022 •

edited

Loading

scarlet25151 Nov 17, 2022

shrekris-anyscale Nov 18, 2022

scarlet25151 commented Nov 17, 2022

shrekris-anyscale commented Nov 18, 2022 •

edited

Loading

scarlet25151 left a comment •

edited

Loading

shrekris-anyscale commented Nov 18, 2022

brucez-anyscale left a comment

shrekris-anyscale commented Nov 23, 2022

simon-mo left a comment

[RayService] Track whether Serve app is ready before switching clusters #730

[RayService] Track whether Serve app is ready before switching clusters #730

Conversation

shrekris-anyscale commented Nov 15, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

scarlet25151 Nov 17, 2022

Choose a reason for hiding this comment

shrekris-anyscale Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

scarlet25151 Nov 17, 2022

Choose a reason for hiding this comment

shrekris-anyscale Nov 18, 2022

Choose a reason for hiding this comment

scarlet25151 commented Nov 17, 2022

shrekris-anyscale commented Nov 18, 2022 • edited Loading

scarlet25151 left a comment • edited Loading

Choose a reason for hiding this comment

shrekris-anyscale commented Nov 18, 2022

brucez-anyscale left a comment

Choose a reason for hiding this comment

shrekris-anyscale commented Nov 23, 2022

simon-mo left a comment

Choose a reason for hiding this comment

shrekris-anyscale commented Nov 15, 2022 •

edited

Loading

shrekris-anyscale Nov 18, 2022 •

edited

Loading

shrekris-anyscale commented Nov 18, 2022 •

edited

Loading

scarlet25151 left a comment •

edited

Loading