Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] RayService restarts repeatedly with Autoscaler #1037

Merged
merged 12 commits into from
May 4, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Apr 20, 2023

Why are these changes needed?

PR #655 avoids the creation of a new RayCluster when an active RayCluster triggers Ray autoscaling. However, it is possible for a pending RayCluster to trigger Ray autoscaling. See the section "Case 1 (bug): Trigger Autoscaler scale-up when the RayCluster is pending" in #1030 as an example.

What's the difference between #655 and this PR?

For #655,

  • When a RayService creates a RayCluster, it will add an annotation ray.io/cluster-hash with the hash value of the "whole" RayService.Spec.RayClusterSpec.

  • Active RayCluster: Compare the hash value of RayService.Spec.RayClusterSpec and the value of the annotation ray.io/cluster-hash from the active RayCluster. Create a new RayCluster if they are not equal.

    • Ray autoscaler updates only the RayCluster spec and not the RayService.Spec.RayClusterSpec. Therefore, the creation of a new RayCluster will not be triggered when an active one triggers Ray autoscaling."
  • Pending RayCluster: Compare the pendingRayCluster.Spec and RayService.Spec.RayClusterSpec by the function CompareJsonStruct. Hence, when the pendingRayCluster.Spec is updated by Ray autoscaling, the new pending RayCluster creation will be triggered.

For #1037 (this PR),

  • When a RayService creates a RayCluster, it will add an annotation ray.io/cluster-hash with the hash value of "some parts" of RayService.Spec.RayClusterSpec.
  • Active RayCluster: Compare the hash values of RayService.Spec.RayClusterSpec and the value of the annotation ray.io/cluster-hash from the active RayCluster.
  • Pending RayCluster: Compare the hash values of pendingRayCluster.Spec and RayService.Spec.RayClusterSpec.

The generation of hash is different between #655 and #1037.

PR #655 generates a hash value based on the whole RayClusterSpec. On the other hand, in PR #1037, some fields are set to nil before generating the hash.

In #655, every update in RayService.Spec.RayClusterSpec will trigger a new cluster preparation. However, we should categorize updates for RayService.Spec.RayClusterSpec into 3 categories.

  • Case 1. RayService should ignore this update and does nothing. (e.g. WorkersToDelete)
  • Case 2. RayService should not create a new RayCluster, but should update RayCluster (e.g. Replicas)
  • Case 3. RayService should trigger a new RayCluster creation (e.g. update the RayCluster's Docker image)

This PR is capable of identifying whether an update falls under Case 3 or not, but it does not have the ability to differentiate between Case 1 and Case 2. In this PR, if an update does not belong to Case 3, we will ignore the update.

For example, the update of WorkersToDelete for RayService.Spec.RayClusterSpec will be ignored. See the change of the test "should update a rayservice object and switch to new Ray Cluster" in rayservice_controller_test.go for more details.

In the follow-up PRs, we need to provide a clear definition to differentiate between Case 1 and Case 2.

Related issue number

Closes #1030

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Reproduce bug with v0.5.0

helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0
# path: ray-operator/config/samples/
# New RayCluster will be restarted by RayService repeatedly.
kubectl apply -f ray-service.autoscaler.yaml

Test this PR

helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0 --set image.repository=controller,image.tag=latest
# path: ray-operator/config/samples/
# New RayCluster preparation will not be triggered.
kubectl apply -f ray-service.autoscaler.yaml

# Update workerGroupSpecs[0].replicas from 1 to 3. (New cluster preparation should not happen.)
kubectl apply -f ray-service.autoscaler.yaml

# Update rayClusterConfig.rayVersion from '2.4.0' to '2.100.0'. (New cluster preparation should be triggered.)
kubectl apply -f ray-service.autoscaler.yaml

@kevin85421 kevin85421 changed the title [WIP] [Bug] RayService restarts repeatedly with Autoscaler Apr 24, 2023
@kevin85421 kevin85421 marked this pull request as ready for review April 25, 2023 01:07
Copy link
Contributor

@sihanwang41 sihanwang41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch for this issue.

@@ -926,3 +927,31 @@ func (r *RayServiceReconciler) labelHealthyServePods(ctx context.Context, rayClu

return nil
}

func (r *RayServiceReconciler) generateRayClusterJsonHash(rayClusterSpec rayv1alpha1.RayClusterSpec) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how difficult it is, can we add unit test for this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion! I have added a new unit test suite dc7dd3e.

Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work so far! Thanks for the detailed summary in the PR description. I left a few suggestions.

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me pending Shreyas's comments.

In the bug, what's the field that gets updated that triggers the restart? (Is it Replicas or WorkersTodelete, or both?)

Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job! Thanks for addressing all my comments!

@kevin85421
Copy link
Member Author

In the bug, what's the field that gets updated that triggers the restart? (Is it Replicas or WorkersTodelete, or both?)

To explain this, we need to know that RayService.Spec.RayClusterSpec, pendingRayCluster.Spec, and activeRayCluster.Spec are different.

[Without this PR]
If the pending RayCluster triggers autoscaling and updates Replicas (scale up/down) or WorkersToDelete (scale down), the function utils.CompareJsonStruct(pendingRayCluster.Spec, rayServiceInstance.Spec.RayClusterSpec) will return false and trigger a new RayCluster preparation to replace the original pending RayCluster.

In ray-service.autoscaler.yaml, we have 5 serve deployments which require 1.4 CPUs in total, but we only have 1 head (0 CPU) and 1 worker (1 CPU). Hence, the autoscaler will be triggered to update Replicas and KubeRay will create a new worker Pod. At this moment, utils.CompareJsonStruct(pendingRayCluster.Spec, rayServiceInstance.Spec.RayClusterSpec) will return false.

@kevin85421 kevin85421 merged commit 52af139 into ray-project:master May 4, 2023
19 checks passed
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
RayService restarts repeatedly with Autoscaler
architkulkarni added a commit that referenced this pull request Dec 26, 2023
…cluster (#1734)

Previously, when the RayCluster spec of a RayService was updated, one of two things would happen:

A new cluster would be rolled out via "zero-downtime-upgrade", or
In the case where only the Replicas and WorkersToDelete fields changed, nothing would happen. This behavior was added by [Bug] RayService restarts repeatedly with Autoscaler  #1037 to prevent the Autoscaler from inadvertently triggering rollouts when modifying these fields.)
This PR adds a third case: If WorkerGroupSpecs is modified in the following specific way and it doesn't fall into the case above, then the RayService controller will update the RayCluster instance in place without rolling out a new one.

Here is the specific way that triggers the third case:

The existing worker groups are not modified except for Replicas and WorkersToDelete, and one or more entries to WorkerGroupSpecs is added.

In general, the updating happens in two places:

For the active RayCluster
For the pending RayCluster
Either of these clusters two may see an update to the spec, so we must handle both of these places.

In a followup, we may add the following optimization: If an existing worker group is modified and one or more entries to WorkerGroupSpecs is added, we should reject the spec. This will require using an admission webhook or storing the previous spec somehow. (If we just store the hash as we currently do, we cannot reconstruct the previous spec because all we have is the hash.)

Other followup issues for this PR:

[RayService] Refactor to unify cluster decision for active and pending RayClusters #1761
[RayService] [CI] Some tests for pending/active clusters may spuriously pass because head pod is not manually set to ready #1768
[RayService] [Enhancement] Avoid unnecessary pod deletion when updating RayCluster #1769


---------

Signed-off-by: Archit Kulkarni <archit@anyscale.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] RayService restarts repeatedly with Autoscaler
4 participants