[RayService] Allow updating WorkerGroupSpecs without rolling out new cluster #1734

architkulkarni · 2023-12-11T20:11:37Z

Why are these changes needed?

Previously, when the RayCluster spec of a RayService was updated, one of two things would happen:

A new cluster would be rolled out via "zero-downtime-upgrade", or
In the case where only the Replicas and WorkersToDelete fields changed, nothing would happen. This behavior was added by [Bug] RayService restarts repeatedly with Autoscaler #1037 to prevent the Autoscaler from inadvertently triggering rollouts when modifying these fields.)

This PR adds a third case: If WorkerGroupSpecs is modified in the following specific way and it doesn't fall into the case above, then the RayService controller will update the RayCluster instance in place without rolling out a new one.

Here is the specific way that triggers the third case:

The existing worker groups are not modified except for Replicas and WorkersToDelete, and one or more entries to WorkerGroupSpecs is added.

In general, the updating happens in two places:

For the active RayCluster
For the pending RayCluster

Either of these clusters two may see an update to the spec, so we must handle both of these places.

In a followup, we may add the following optimization: If an existing worker group is modified and one or more entries to WorkerGroupSpecs is added, we should reject the spec. This will require using an admission webhook or storing the previous spec somehow. (If we just store the hash as we currently do, we cannot reconstruct the previous spec because all we have is the hash.)

Other followup issues for this PR:

TODO:

Add/update rayservice_controller_test.go

Related issue number

Closes #1643

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests

kubectl apply -f /Users/architkulkarni/kuberay/ray-operator/config/samples/ray-service.autoscaler.yaml
watch kubectl get pod

Wait for the worker pod to come up.

Add a new entry to the WorkerGroupSpec, copying the first one but changing the name to worker-group-2.

kubectl apply -f /Users/architkulkarni/kuberay/ray-operator/config/samples/ray-service.autoscaler.yaml
watch kubectl get pod

See the second worker pod come up.

Update worker-group-1 outside of replicas/WorkersToDelete. For example change the name to worker-group-111. At the same time, add another entry to WorkerGroupSpecs.

Watch a zero-downtime upgrade get triggered.

Finally, edit a Replicas field of a workergroup. Watch no pods get added or terminated.

This PR is not tested :(

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

…worker-group Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

ray-operator/controllers/ray/rayservice_controller.go

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

…into worker-group

…test.go` Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

kevin85421 · 2023-12-12T23:29:01Z

ray-operator/controllers/ray/rayservice_controller.go

+	} else if clusterAction == Update {
+		// Update the active cluster.
+		r.Log.Info("Updating the active RayCluster instance.")
+		if activeRayCluster, err = r.constructRayClusterForRayService(rayServiceInstance, activeRayCluster.Name); err != nil {


What will happen if there is inconsistency between the RayCluster and RayService's RayClusterSpec? For example, what if a worker group's replica count is set to 0 in RayClusterSpec, but the Ray Autoscaler has already scaled it up to 10? Updating the RayCluster may cause a lot of running Pods to be killed.

We don't do any special handling for this case. So that means in this case, the user-specified replica count will take precedence over the Ray Autoscaler's case.

One alternative would be to never update the replicas and workerstodelete when updating the RayCluster. The downside is that the user can never override replicas.

My guess is that the current approach (the first approach) is better, because the user should always have a way to set replicas, and this inconsistent case is an edge case, not the common case. But what are your thoughts?

My guess is that the current approach (the first approach) is better

From users' perspectives, it is much better to disregard the users' settings regarding replicas than to delete a Pod that has any running Ray task or actor.

On second thought, if Ray Autoscaling is enabled, Ray Autoscaler is the only decision maker to delete Ray Pods after #1253. Hence, users can increase the number of Pods, but can't delete Pods by updating replicas.

the user should always have a way to set replicas,

Users can directly update RayCluster. In addition, setting replicas for existing worker groups is not common. Most RayService users use Ray Autoscaler as well. If users still need to manually update replicas with Ray Autoscaling, Ray Autoscaler needs to be improved.

ray-operator/controllers/ray/rayservice_controller.go

kevin85421 · 2023-12-12T23:49:17Z

ray-operator/controllers/ray/rayservice_controller.go

+		return err
+	}
+
+	// Update the fetched RayCluster with new changes


We almost replace anything in currentRayCluster. Why do we need to get currentRayCluster? Do we need any information from it?

My first approach was to not get currentRayCluster, but then I got this error when calling r.Update():

rayclusters.ray.io \"rayservice-sample-raycluster-qb8z4\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update

After some research it seemed that a common approach was to first get the current object, then apply changes to the object, and then call Update().

Do you know what the best practice is? Is it better to use Patch() here, or is there some third approach?

rayclusters.ray.io "rayservice-sample-raycluster-qb8z4" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update

Got it. This makes sense.

Do you know what the best practice is? Is it better to use Patch() here, or is there some third approach?

In my understanding, Patch isn't protected by the Kubernetes optimistic concurrency model. I don't understand the use case for Patch. We should avoid using Patch until we understand it more.

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

kevin85421

Why does the PR description use RayCluster instead of RayService for manual testing?

architkulkarni · 2023-12-14T20:32:30Z

Why does the PR description use RayCluster instead of RayService for manual testing?

Ah typo, it should say ray-service.autoscaler.yaml. Updated

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

architkulkarni · 2023-12-19T16:23:29Z

@kevin85421 I've updated the PR and the PR description per our discussion offline. There is one change:

For the case where the existing workergroup has a change outside of Replicas/WorkersToDelete AND we append new workergroup(s) at the end, we had discussed we should reject the new spec.

In this PR, we instead just do a rolling upgrade in this case. It's not a regression, and we can add the rejection behavior in the future in an independent PR if a user asks for it. This might require some more design, such as using an admission webhook or storing the previous spec somehow. (If we just store the hash as we currently do, we cannot reconstruct the previous spec because all we have is the hash.)

The current state of the PR implements the minimal way to enable the feature request in the linked issue, so I think it can be merged as is.

kevin85421 · 2023-12-20T18:20:48Z

In this PR, we instead just do a rolling upgrade in this case. It's not a regression, and we can add the rejection behavior in the future in an independent PR if a user asks for it. This might require some more design, such as using an admission webhook or storing the previous spec somehow. (If we just store the hash as we currently do, we cannot reconstruct the previous spec because all we have is the hash.)

Good point!

ray-operator/controllers/ray/rayservice_controller.go

kevin85421 · 2023-12-20T19:13:01Z

ray-operator/controllers/ray/rayservice_controller.go

@@ -1137,8 +1245,41 @@ func (r *RayServiceReconciler) labelHealthyServePods(ctx context.Context, rayClu
 	return nil
 }

-func generateRayClusterJsonHash(rayClusterSpec rayv1.RayClusterSpec) (string, error) {
-	// Mute all fields that will not trigger new RayCluster preparation. For example,
+func getClusterAction(old_spec rayv1.RayClusterSpec, new_spec rayv1.RayClusterSpec) (ClusterAction, error) {


This function's coding style (old_spec -> oldSpec) is inconsistent.

Good call, fixed e62ac34

The commit still has some snake case style. Ex: newSpec_without_new_worker_groups.

Thanks, fixed 8b0056b

I wonder if there's a linter that can check for this

ray-operator/controllers/ray/rayservice_controller.go

kevin85421 · 2023-12-20T19:53:22Z

ray-operator/controllers/ray/rayservice_controller_test.go

+			Expect(err).NotTo(HaveOccurred(), "failed to update test RayService resource")
+
+			// Confirm it didn't switch to a new RayCluster
+			Consistently(


The RayCluster will only switch if the new head Pod is ready, and the Ray Serve applications in the new RayCluster are prepared to serve requests. There is no Pod controller in envtest. Hence, the head Pod will not become ready if you don't manually update it. We may need to check whether the pending RayCluster has been created, rather than whether the active RayCluster has changed.

I checked the existing tests. Some tests check whether the active RayCluster has changed or not to determine whether the rollout is triggered or not without manually updating the head Pod status. Would you mind opening an issue to track the progress? Thanks!

ray-operator/controllers/ray/rayservice_controller_test.go

kevin85421 · 2023-12-20T19:59:50Z

ray-operator/controllers/ray/rayservice_controller_test.go

+			)
+		})
+		It("should update the pending RayCluster in place when WorkerGroupSpecs are modified by the user in RayServiceSpec", func() {
+			// Trigger a new RayCluster preparation by updating the RayVersion.


Currently, our tests are not following good practices. They are all coupled together, so we may need to ensure that there is no pendingRayCluster before proceeding with the following logic. In the future, we might need to make them more independent.

ray-operator/controllers/ray/rayservice_controller_test.go

kevin85421 · 2023-12-20T20:02:55Z

Would you mind opening an issue to track the progress of https://github.com/ray-project/kuberay/pull/1734/files#r1424680579 if we decide not to cover this issue in this PR? Thanks!

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

architkulkarni · 2023-12-21T00:00:59Z

Thanks for the review - will respond to the remaining comments soon

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

architkulkarni · 2023-12-26T17:52:23Z

@kevin85421 Thanks for the review, I believe all comments are addressed now.
Followups:

[RayService] Refactor to unify cluster decision for active and pending RayClusters #1761
[RayService] [CI] Some tests for pending/active clusters may spuriously pass because head pod is not manually set to ready #1768
[RayService] [Enhancement] Avoid unnecessary pod deletion when updating RayCluster #1769

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

kevin85421

I haven't tested this PR manually. Feel free to merge it after CI passes if the PR works as expected.

architkulkarni · 2023-12-26T19:49:09Z

@kevin85421 Thanks! I will run through the manual test on the final commit before merging.

By the way, are rayservice-sample-yamls tests known to be flaky? They are somewhat flaky on this PR. If the tests were 100% green before, then that could indicate an issue with the PR, but if they were flaky before, then I'm less worried.

Also the raycluster-sample-yamls tests are flaky on this PR, but those are almost certainly unrelated.

architkulkarni · 2023-12-26T20:18:09Z

The manual test described in the PR description passed. Merging

Archit Kulkarni added 7 commits December 11, 2023 12:10

Add ClusterAction for active and pending cluster

534357d

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/kuberay into …

3637689

…worker-group Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Fix unit tests

3b3ca3f

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Fix unit test

19db35a

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Update spec before calling k8s client update

1f14e45

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Fix TestGetClusterAction

79fb7c7

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Pull RayCluster before updating

c189a68

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

architkulkarni commented Dec 12, 2023

View reviewed changes

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved

architkulkarni and others added 3 commits December 12, 2023 08:32

Update ray-operator/controllers/ray/rayservice_controller.go

1ec5b4d

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

Lint

33622cd

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Merge branch 'worker-group' of https://github.com/ray-project/kuberay …

47a4e4c

…into worker-group

architkulkarni requested a review from kevin85421 December 12, 2023 17:03

architkulkarni marked this pull request as ready for review December 12, 2023 17:03

kevin85421 self-assigned this Dec 12, 2023

Archit Kulkarni added 2 commits December 12, 2023 11:37

Add integration test for active RayCluster in `rayservice_controller_…

59bb53b

…test.go` Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Add test for updating pending cluster in rayservice_controller_test.go

ff7c56b

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

kevin85421 reviewed Dec 12, 2023

View reviewed changes

Archit Kulkarni added 2 commits December 14, 2023 10:48

Don't print the whole RayCluster

0520aa4

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Use getRayClusterByNamespacedName

e77a72d

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

kevin85421 reviewed Dec 14, 2023

View reviewed changes

Archit Kulkarni added 3 commits December 14, 2023 12:33

return DoNothing if failed to serialize

b91e451

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Only update if workergroups are the same and new ones appended

6576d73

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Fix unit tests

d76c904

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

architkulkarni requested a review from kevin85421 December 19, 2023 16:23

kevin85421 reviewed Dec 20, 2023

View reviewed changes

Remove redundant hash calculation

ce83a1f

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Archit Kulkarni added 3 commits December 20, 2023 15:55

Add log for number of worker groups

16aa1ec

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Fix Ray Cluster -> RayCluster

0517a10

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Change snake_case to CamelCase in hash function

e62ac34

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

architkulkarni mentioned this pull request Dec 20, 2023

[RayService] Refactor to unify cluster decision for active and pending RayClusters #1761

Open

Archit Kulkarni added 3 commits December 26, 2023 09:24

Delete unnecessary and broken test for updating minreplicas

65774d7

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Use oldNumWorkerGroupSpecs + 1 instead of hardcoding 2 in unit test

cb394c4

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Use oldNumWorkerGroupSpecs + 1 in remaining unit test

8136580

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

This was referenced Dec 26, 2023

[RayService] [CI] Some tests for pending/active clusters may spuriously pass because head pod is not manually set to ready #1768

Open

[RayService] [Enhancement] Avoid unnecessary pod deletion when updating RayCluster #1769

Open

Fix snake case

8b0056b

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

kevin85421 reviewed Dec 26, 2023

View reviewed changes

kevin85421 approved these changes Dec 26, 2023

View reviewed changes

architkulkarni merged commit a6cf6e0 into master Dec 26, 2023
25 checks passed

architkulkarni deleted the worker-group branch December 26, 2023 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayService] Allow updating WorkerGroupSpecs without rolling out new cluster #1734

[RayService] Allow updating WorkerGroupSpecs without rolling out new cluster #1734

architkulkarni commented Dec 11, 2023 •

edited

Loading

kevin85421 Dec 12, 2023

architkulkarni Dec 14, 2023

kevin85421 Dec 14, 2023

kevin85421 Dec 12, 2023

architkulkarni Dec 14, 2023

kevin85421 Dec 14, 2023

kevin85421 left a comment

architkulkarni commented Dec 14, 2023 •

edited

Loading

architkulkarni commented Dec 19, 2023

kevin85421 commented Dec 20, 2023

kevin85421 Dec 20, 2023

architkulkarni Dec 20, 2023

kevin85421 Dec 26, 2023

architkulkarni Dec 26, 2023

kevin85421 Dec 20, 2023

kevin85421 Dec 20, 2023

kevin85421 commented Dec 20, 2023

architkulkarni commented Dec 21, 2023

architkulkarni commented Dec 26, 2023

kevin85421 left a comment

architkulkarni commented Dec 26, 2023

architkulkarni commented Dec 26, 2023

[RayService] Allow updating WorkerGroupSpecs without rolling out new cluster #1734

[RayService] Allow updating WorkerGroupSpecs without rolling out new cluster #1734

Conversation

architkulkarni commented Dec 11, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 left a comment

Choose a reason for hiding this comment

architkulkarni commented Dec 14, 2023 • edited Loading

architkulkarni commented Dec 19, 2023

kevin85421 commented Dec 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 commented Dec 20, 2023

architkulkarni commented Dec 21, 2023

architkulkarni commented Dec 26, 2023

kevin85421 left a comment

Choose a reason for hiding this comment

architkulkarni commented Dec 26, 2023

architkulkarni commented Dec 26, 2023

architkulkarni commented Dec 11, 2023 •

edited

Loading

architkulkarni commented Dec 14, 2023 •

edited

Loading