Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService] Track whether Serve app is ready before switching clusters #730

Merged

Conversation

shrekris-anyscale
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale commented Nov 15, 2022

Why are these changes needed?

The RayService operator checks whether a pending RayCluster's Serve application is "HEALTHY" before it sets it as the active cluster. There's a few issues with this approach:

  1. "HEALTHY" is not a valid Serve ApplicationStatus.
  2. A Serve application can be healthy but not ready (e.g. if it's NOT_STARTED or still DEPLOYING). If the cluster switchover happens before the app is ready, incoming traffic is briefly dropped until the Serve app becomes ready.

This change updates getAndCheckServeStatus to track whether the Serve app isHealthy and isReady. With this change, the cluster switchover only happens when the Serve app is both healthy and ready. It only initiates a cluster restart if the application is unhealthy (not if it's unready).

This change also makes the RayService operator delete dangling RayClusters after a 60 second grace period, instead of immediately. That gives the ingress enough time to switch traffic to the new cluster without any downtime.

Uptime Impact

This change's impact was measured with a Serve app that deploys a single deployment with two replicas. Each deployment replica sleeps for 15 seconds before becoming healthy. See this gist for the code and config.

The Serve app was tested on GKE with a Locust workload that reopened connections for each request. It created 100 users with a spawn rate of 50 users/second.

Without the change, there were distinct periods of downtime where all requests failed since the cluster switchover happened before the Serve app was ready.

failures

Screen Shot 2022-11-22 at 5 50 44 PM

With the change, there were no failures after two RayCluster config updates:

zero_downtime

Screen Shot 2022-11-22 at 5 43 51 PM

Follow-up Changes

Kubernetes provides liveness and readiness probes to control which pods can serve traffic. Ideally, this status checking logic should be pushed into the probes, so if HTTPProxies temporarily crash, their pods can temporarily stop serving traffic. However, the RayService operator relies on the liveness and readiness probes to check if the RayCluster is ready to accept Serve deployments. This should be refactored, so the RayService's probes can track Serve application-level behavior.

Related issue number

Addresses #667

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
      • This change adds a unit test for zero-downtime deployments.

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
@@ -21,6 +21,30 @@ const (
FailedToUpdateService ServiceStatus = "FailedToUpdateService"
)

// These statuses should match Ray Serve's application statuses
var ApplicationStatus = struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The naming ApplicationStatus here is a little tricky since it's the same with the field of RayServiceStatus. ApplicationStatus but they are different struct, so it would be a little bit confusing

Copy link
Contributor Author

@shrekris-anyscale shrekris-anyscale Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch– I renamed ApplicationStatus to ApplicationStatusEnum and DeploymentStatus to DeploymentStatusEnum.

@@ -138,7 +138,7 @@ func (r *RayServiceReconciler) Reconcile(ctx context.Context, request ctrl.Reque
}
} else if activeRayClusterInstance != nil && pendingRayClusterInstance != nil {
if err = r.updateStatusForActiveCluster(ctx, rayServiceInstance, activeRayClusterInstance, logger); err != nil {
logger.Error(err, "The updating of the status for active ray cluster while we have pending cluster failed")
logger.Error(err, "Active Ray cluster's status update failed.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Update active ray cluster's status failed." may be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to "Failed to update active Ray cluster's status."

@scarlet25151 scarlet25151 added this to the v0.4.0 release milestone Nov 17, 2022
@scarlet25151
Copy link
Collaborator

Thank @shrekris-anyscale for this contribution. The PR is overall LGTM. by the way, do you have any diagram showing the state machine of the rayservice so we can update the kuberay github docs?

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
…m and DeploymentStatusEnum

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
@shrekris-anyscale
Copy link
Contributor Author

shrekris-anyscale commented Nov 18, 2022

Thanks for reviewing the change @scarlet25151! I don't have a state machine diagram for RayServices, but I think that would be a great addition to the docs. Would you mind filing a GitHub issue, so we can track it?

Copy link
Collaborator

@scarlet25151 scarlet25151 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please rerun the make manifests to sync up the apis change to pass the test and we can merge it.

@shrekris-anyscale
Copy link
Contributor Author

Please don't merge this yet. I believe there's an issue I still need to resolve.

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Copy link
Contributor

@brucez-anyscale brucez-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
@shrekris-anyscale
Copy link
Contributor Author

Thanks @brucez-anyscale for your help! @simon-mo @scarlet25151 there are some new changes. Please take one more look. If it looks good to you, it should be ready to merge.

Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another pass, great work!

@shrekris-anyscale shrekris-anyscale merged commit 7940407 into ray-project:master Nov 28, 2022
DmitriGekhtman pushed a commit that referenced this pull request Dec 1, 2022
…configuration framework (#759)

Refactors for integration tests --

Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested.

Refactor: class CONST and class KubernetesClusterManager should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes.

Refactor: Replace os with subprocess. The following paragraph is from Python's official documentation.

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.

Skip test_kill_head due to

[Bug] Head pod is deleted rather than restarted when gcs_server on head pod is killed. #638
[Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed  #634.
Refactor: Replace all existing k8s api clients with K8S_CLUSTER_MANAGER.

Refactor and relieve flakiness of test_ray_serve_work

working_dir is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Update working_dir in ray-service.yaml.template.
To elaborate, the error of the test logic mentioned above is that it only checks the exit code rather than STDOUT.
When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready for serving requests. The time.sleep(60) function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters #730 is merged.
Remove NodePort service in RayServiceTestCase. Use a curl Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container with network_mode='host' and NodePort service is very weird for me.
Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase.

Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…rs (ray-project#730)

* Update error message

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Update comment

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Add ApplicationStatusType and DeploymentStatus

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Use string type for enums

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Use correct health check logic for app status

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Use enum instead of string

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Track whether Serve app is ready before switchover happens

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Improve Serve reconciliation logging statement

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Log isReady

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Move else clause

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Improve logging for ingress reconciliation

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Polish enableIngress log

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Rename ApplicationStatus and DeploymentStatus to ApplicationStatusEnum and DeploymentStatusEnum

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Revise log message

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Implement zero-downtime update

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Update RayServiceInstance's status when isReady but not isHealthy

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Update status directly

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Use cached statuses

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Delete dangling RayClusters after 60 seconds

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Add unit test for zero-downtime update

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Run make manifests

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Remove allServeDeploymentsHealthy

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

* Run make sync

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…configuration framework (ray-project#759)

Refactors for integration tests --

Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested.

Refactor: class CONST and class KubernetesClusterManager should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes.

Refactor: Replace os with subprocess. The following paragraph is from Python's official documentation.

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.

Skip test_kill_head due to

[Bug] Head pod is deleted rather than restarted when gcs_server on head pod is killed. ray-project#638
[Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed  ray-project#634.
Refactor: Replace all existing k8s api clients with K8S_CLUSTER_MANAGER.

Refactor and relieve flakiness of test_ray_serve_work

working_dir is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Update working_dir in ray-service.yaml.template.
To elaborate, the error of the test logic mentioned above is that it only checks the exit code rather than STDOUT.
When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready for serving requests. The time.sleep(60) function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters ray-project#730 is merged.
Remove NodePort service in RayServiceTestCase. Use a curl Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container with network_mode='host' and NodePort service is very weird for me.
Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase.

Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants