-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] RayService restarts repeatedly with Autoscaler #1037
[Bug] RayService restarts repeatedly with Autoscaler #1037
Conversation
149cec8
to
9f1ea4e
Compare
9f1ea4e
to
9418251
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch for this issue.
@@ -926,3 +927,31 @@ func (r *RayServiceReconciler) labelHealthyServePods(ctx context.Context, rayClu | |||
|
|||
return nil | |||
} | |||
|
|||
func (r *RayServiceReconciler) generateRayClusterJsonHash(rayClusterSpec rayv1alpha1.RayClusterSpec) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how difficult it is, can we add unit test for this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion! I have added a new unit test suite dc7dd3e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work so far! Thanks for the detailed summary in the PR description. I left a few suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me pending Shreyas's comments.
In the bug, what's the field that gets updated that triggers the restart? (Is it Replicas
or WorkersTodelete
, or both?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job! Thanks for addressing all my comments!
To explain this, we need to know that [Without this PR] In |
RayService restarts repeatedly with Autoscaler
…cluster (#1734) Previously, when the RayCluster spec of a RayService was updated, one of two things would happen: A new cluster would be rolled out via "zero-downtime-upgrade", or In the case where only the Replicas and WorkersToDelete fields changed, nothing would happen. This behavior was added by [Bug] RayService restarts repeatedly with Autoscaler #1037 to prevent the Autoscaler from inadvertently triggering rollouts when modifying these fields.) This PR adds a third case: If WorkerGroupSpecs is modified in the following specific way and it doesn't fall into the case above, then the RayService controller will update the RayCluster instance in place without rolling out a new one. Here is the specific way that triggers the third case: The existing worker groups are not modified except for Replicas and WorkersToDelete, and one or more entries to WorkerGroupSpecs is added. In general, the updating happens in two places: For the active RayCluster For the pending RayCluster Either of these clusters two may see an update to the spec, so we must handle both of these places. In a followup, we may add the following optimization: If an existing worker group is modified and one or more entries to WorkerGroupSpecs is added, we should reject the spec. This will require using an admission webhook or storing the previous spec somehow. (If we just store the hash as we currently do, we cannot reconstruct the previous spec because all we have is the hash.) Other followup issues for this PR: [RayService] Refactor to unify cluster decision for active and pending RayClusters #1761 [RayService] [CI] Some tests for pending/active clusters may spuriously pass because head pod is not manually set to ready #1768 [RayService] [Enhancement] Avoid unnecessary pod deletion when updating RayCluster #1769 --------- Signed-off-by: Archit Kulkarni <archit@anyscale.com> Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Why are these changes needed?
PR #655 avoids the creation of a new RayCluster when an active RayCluster triggers Ray autoscaling. However, it is possible for a pending RayCluster to trigger Ray autoscaling. See the section "Case 1 (bug): Trigger Autoscaler scale-up when the RayCluster is pending" in #1030 as an example.
What's the difference between #655 and this PR?
For #655,
When a RayService creates a RayCluster, it will add an annotation
ray.io/cluster-hash
with the hash value of the "whole"RayService.Spec.RayClusterSpec
.Active RayCluster: Compare the hash value of
RayService.Spec.RayClusterSpec
and the value of the annotationray.io/cluster-hash
from the active RayCluster. Create a new RayCluster if they are not equal.RayCluster
spec and not theRayService.Spec.RayClusterSpec
. Therefore, the creation of a new RayCluster will not be triggered when an active one triggers Ray autoscaling."Pending RayCluster: Compare the
pendingRayCluster.Spec
andRayService.Spec.RayClusterSpec
by the functionCompareJsonStruct
. Hence, when thependingRayCluster.Spec
is updated by Ray autoscaling, the new pending RayCluster creation will be triggered.For #1037 (this PR),
ray.io/cluster-hash
with the hash value of "some parts" ofRayService.Spec.RayClusterSpec
.RayService.Spec.RayClusterSpec
and the value of the annotationray.io/cluster-hash
from the active RayCluster.pendingRayCluster.Spec
andRayService.Spec.RayClusterSpec
.The generation of hash is different between #655 and #1037.
PR #655 generates a hash value based on the whole RayClusterSpec. On the other hand, in PR #1037, some fields are set to nil before generating the hash.
In #655, every update in
RayService.Spec.RayClusterSpec
will trigger a new cluster preparation. However, we should categorize updates forRayService.Spec.RayClusterSpec
into 3 categories.This PR is capable of identifying whether an update falls under Case 3 or not, but it does not have the ability to differentiate between Case 1 and Case 2. In this PR, if an update does not belong to Case 3, we will ignore the update.
For example, the update of
WorkersToDelete
forRayService.Spec.RayClusterSpec
will be ignored. See the change of the test "should update a rayservice object and switch to new Ray Cluster" in rayservice_controller_test.go for more details.In the follow-up PRs, we need to provide a clear definition to differentiate between Case 1 and Case 2.
Related issue number
Closes #1030
Checks
Reproduce bug with v0.5.0
Test this PR