[Bug] RayService restarts repeatedly with Autoscaler #1037

kevin85421 · 2023-04-20T04:08:52Z

Why are these changes needed?

PR #655 avoids the creation of a new RayCluster when an active RayCluster triggers Ray autoscaling. However, it is possible for a pending RayCluster to trigger Ray autoscaling. See the section "Case 1 (bug): Trigger Autoscaler scale-up when the RayCluster is pending" in #1030 as an example.

What's the difference between #655 and this PR?

For #655,

When a RayService creates a RayCluster, it will add an annotation ray.io/cluster-hash with the hash value of the "whole" RayService.Spec.RayClusterSpec.
Active RayCluster: Compare the hash value of RayService.Spec.RayClusterSpec and the value of the annotation ray.io/cluster-hash from the active RayCluster. Create a new RayCluster if they are not equal.
- Ray autoscaler updates only the RayCluster spec and not the RayService.Spec.RayClusterSpec. Therefore, the creation of a new RayCluster will not be triggered when an active one triggers Ray autoscaling."
Pending RayCluster: Compare the pendingRayCluster.Spec and RayService.Spec.RayClusterSpec by the function CompareJsonStruct. Hence, when the pendingRayCluster.Spec is updated by Ray autoscaling, the new pending RayCluster creation will be triggered.

For #1037 (this PR),

When a RayService creates a RayCluster, it will add an annotation ray.io/cluster-hash with the hash value of "some parts" of RayService.Spec.RayClusterSpec.
Active RayCluster: Compare the hash values of RayService.Spec.RayClusterSpec and the value of the annotation ray.io/cluster-hash from the active RayCluster.
Pending RayCluster: Compare the hash values of pendingRayCluster.Spec and RayService.Spec.RayClusterSpec.

The generation of hash is different between #655 and #1037.

PR #655 generates a hash value based on the whole RayClusterSpec. On the other hand, in PR #1037, some fields are set to nil before generating the hash.

In #655, every update in RayService.Spec.RayClusterSpec will trigger a new cluster preparation. However, we should categorize updates for RayService.Spec.RayClusterSpec into 3 categories.

Case 1. RayService should ignore this update and does nothing. (e.g. WorkersToDelete)
Case 2. RayService should not create a new RayCluster, but should update RayCluster (e.g. Replicas)
Case 3. RayService should trigger a new RayCluster creation (e.g. update the RayCluster's Docker image)

This PR is capable of identifying whether an update falls under Case 3 or not, but it does not have the ability to differentiate between Case 1 and Case 2. In this PR, if an update does not belong to Case 3, we will ignore the update.

For example, the update of WorkersToDelete for RayService.Spec.RayClusterSpec will be ignored. See the change of the test "should update a rayservice object and switch to new Ray Cluster" in rayservice_controller_test.go for more details.

In the follow-up PRs, we need to provide a clear definition to differentiate between Case 1 and Case 2.

Related issue number

Closes #1030

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Reproduce bug with v0.5.0

helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0
# path: ray-operator/config/samples/
# New RayCluster will be restarted by RayService repeatedly.
kubectl apply -f ray-service.autoscaler.yaml

Test this PR

helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0 --set image.repository=controller,image.tag=latest
# path: ray-operator/config/samples/
# New RayCluster preparation will not be triggered.
kubectl apply -f ray-service.autoscaler.yaml

# Update workerGroupSpecs[0].replicas from 1 to 3. (New cluster preparation should not happen.)
kubectl apply -f ray-service.autoscaler.yaml

# Update rayClusterConfig.rayVersion from '2.4.0' to '2.100.0'. (New cluster preparation should be triggered.)
kubectl apply -f ray-service.autoscaler.yaml

sihanwang41

nice catch for this issue.

sihanwang41 · 2023-05-01T19:59:58Z

ray-operator/controllers/ray/rayservice_controller.go

@@ -926,3 +927,31 @@ func (r *RayServiceReconciler) labelHealthyServePods(ctx context.Context, rayClu

 	return nil
 }
+
+func (r *RayServiceReconciler) generateRayClusterJsonHash(rayClusterSpec rayv1alpha1.RayClusterSpec) (string, error) {


Not sure how difficult it is, can we add unit test for this function?

Thank you for the suggestion! I have added a new unit test suite dc7dd3e.

shrekris-anyscale

Great work so far! Thanks for the detailed summary in the PR description. I left a few suggestions.

ray-operator/config/samples/ray-service.autoscaler.yaml

ray-operator/controllers/ray/rayservice_controller_test.go

ray-operator/controllers/ray/rayservice_controller.go

architkulkarni

Looks good to me pending Shreyas's comments.

In the bug, what's the field that gets updated that triggers the restart? (Is it Replicas or WorkersTodelete, or both?)

ray-operator/controllers/ray/rayservice_controller.go

shrekris-anyscale

Nice job! Thanks for addressing all my comments!

kevin85421 · 2023-05-04T04:16:06Z

In the bug, what's the field that gets updated that triggers the restart? (Is it Replicas or WorkersTodelete, or both?)

To explain this, we need to know that RayService.Spec.RayClusterSpec, pendingRayCluster.Spec, and activeRayCluster.Spec are different.

[Without this PR]
If the pending RayCluster triggers autoscaling and updates Replicas (scale up/down) or WorkersToDelete (scale down), the function utils.CompareJsonStruct(pendingRayCluster.Spec, rayServiceInstance.Spec.RayClusterSpec) will return false and trigger a new RayCluster preparation to replace the original pending RayCluster.

In ray-service.autoscaler.yaml, we have 5 serve deployments which require 1.4 CPUs in total, but we only have 1 head (0 CPU) and 1 worker (1 CPU). Hence, the autoscaler will be triggered to update Replicas and KubeRay will create a new worker Pod. At this moment, utils.CompareJsonStruct(pendingRayCluster.Spec, rayServiceInstance.Spec.RayClusterSpec) will return false.

RayService restarts repeatedly with Autoscaler

…cluster (#1734) Previously, when the RayCluster spec of a RayService was updated, one of two things would happen: A new cluster would be rolled out via "zero-downtime-upgrade", or In the case where only the Replicas and WorkersToDelete fields changed, nothing would happen. This behavior was added by [Bug] RayService restarts repeatedly with Autoscaler #1037 to prevent the Autoscaler from inadvertently triggering rollouts when modifying these fields.) This PR adds a third case: If WorkerGroupSpecs is modified in the following specific way and it doesn't fall into the case above, then the RayService controller will update the RayCluster instance in place without rolling out a new one. Here is the specific way that triggers the third case: The existing worker groups are not modified except for Replicas and WorkersToDelete, and one or more entries to WorkerGroupSpecs is added. In general, the updating happens in two places: For the active RayCluster For the pending RayCluster Either of these clusters two may see an update to the spec, so we must handle both of these places. In a followup, we may add the following optimization: If an existing worker group is modified and one or more entries to WorkerGroupSpecs is added, we should reject the spec. This will require using an admission webhook or storing the previous spec somehow. (If we just store the hash as we currently do, we cannot reconstruct the previous spec because all we have is the hash.) Other followup issues for this PR: [RayService] Refactor to unify cluster decision for active and pending RayClusters #1761 [RayService] [CI] Some tests for pending/active clusters may spuriously pass because head pod is not manually set to ready #1768 [RayService] [Enhancement] Avoid unnecessary pod deletion when updating RayCluster #1769 --------- Signed-off-by: Archit Kulkarni <archit@anyscale.com> Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

kevin85421 requested a review from gvspraveen April 20, 2023 21:03

kevin85421 changed the title ~~[WIP]~~ [Bug] RayService restarts repeatedly with Autoscaler Apr 24, 2023

kevin85421 marked this pull request as ready for review April 25, 2023 01:07

architkulkarni mentioned this pull request Apr 25, 2023

Expose entire head pod Service to the user #1040

Merged

4 tasks

kevin85421 force-pushed the rayservice-autoscaler branch from 149cec8 to 9f1ea4e Compare April 28, 2023 20:53

kevin85421 added 5 commits April 28, 2023 21:30

update

31927e2

update test

80f2d05

tmp

846c3b9

add test

7ff86f3

add a test

9418251

kevin85421 force-pushed the rayservice-autoscaler branch from 9f1ea4e to 9418251 Compare April 28, 2023 21:30

kevin85421 added 3 commits April 28, 2023 21:48

update

026eb79

update

a627245

add newline

e090bf3

kevin85421 requested review from architkulkarni, sihanwang41, brucez-anyscale and shrekris-anyscale May 1, 2023 19:53

sihanwang41 approved these changes May 1, 2023

View reviewed changes

add unit tests

dc7dd3e

kevin85421 requested a review from sihanwang41 May 2, 2023 20:10

shrekris-anyscale reviewed May 3, 2023

View reviewed changes

ray-operator/config/samples/ray-service.autoscaler.yaml Outdated Show resolved Hide resolved

ray-operator/controllers/ray/rayservice_controller_test.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/rayservice_controller.go Show resolved Hide resolved

architkulkarni approved these changes May 3, 2023

View reviewed changes

ray-operator/controllers/ray/rayservice_controller.go Show resolved Hide resolved

kevin85421 added 3 commits May 3, 2023 19:41

update rayVersion

cc93f97

add unit test

00c3ba0

remove debug msg

f7ab714

shrekris-anyscale approved these changes May 4, 2023

View reviewed changes

kevin85421 merged commit 52af139 into ray-project:master May 4, 2023
19 checks passed

kevin85421 mentioned this pull request May 29, 2023

[RayService] Ray cluster keeps restarting when autoscaling with RayServices #649

Closed

2 tasks

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[Bug] RayService restarts repeatedly with Autoscaler (ray-project#1037)

d5fee2c

RayService restarts repeatedly with Autoscaler

architkulkarni mentioned this pull request Dec 12, 2023

[RayService] Allow updating WorkerGroupSpecs without rolling out new cluster #1734

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] RayService restarts repeatedly with Autoscaler #1037

[Bug] RayService restarts repeatedly with Autoscaler #1037

kevin85421 commented Apr 20, 2023 •

edited

sihanwang41 left a comment

sihanwang41 May 1, 2023

kevin85421 May 2, 2023

shrekris-anyscale left a comment

architkulkarni left a comment

shrekris-anyscale left a comment

kevin85421 commented May 4, 2023

[Bug] RayService restarts repeatedly with Autoscaler #1037

[Bug] RayService restarts repeatedly with Autoscaler #1037

Conversation

kevin85421 commented Apr 20, 2023 • edited

Why are these changes needed?

What's the difference between #655 and this PR?

The generation of hash is different between #655 and #1037.

Related issue number

Checks

Reproduce bug with v0.5.0

Test this PR

sihanwang41 left a comment

Choose a reason for hiding this comment

sihanwang41 May 1, 2023

Choose a reason for hiding this comment

kevin85421 May 2, 2023

Choose a reason for hiding this comment

shrekris-anyscale left a comment

Choose a reason for hiding this comment

architkulkarni left a comment

Choose a reason for hiding this comment

shrekris-anyscale left a comment

Choose a reason for hiding this comment

kevin85421 commented May 4, 2023

kevin85421 commented Apr 20, 2023 •

edited