[RayService][Kueue] Support top-level Spec.Suspend#4841
Conversation
Propagate RayService.Spec.RayClusterSpec.Suspend onto the RayCluster only when the RayCluster is first created. After the RayCluster exists, its Suspend is delegated to Kueue: hash comparisons ignore Suspend, and modifyRayCluster preserves the existing cluster's Suspend instead of overwriting it with the RayService spec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When Spec.Suspend is true, the RayService controller deletes every Kubernetes resource it owns (RayClusters, head/serve Services, and Gateway/HTTPRoute when the RayServiceIncrementalUpgrade gate is on) and reports the lifecycle through two new conditions, Suspending and Suspended. The transition is atomic: the first reconcile commits Suspending=True together with the reset of ActiveServiceStatus, PendingServiceStatus, NumServeEndpoints, and ServiceStatus in a single Status update; deletion runs on the next reconcile once that commit is durable, so an errored or interrupted attempt is always resumed. Flipping Spec.Suspend back to false removes the Suspended condition and the regular reconcile recreates the resources. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The suspend behavior is exercised end-to-end on a kind cluster, so the handler-level unit tests are removed for now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run `make helm` so the helm-chart/kuberay-operator/crds copy of the RayService CRD matches config/crd/bases after adding Spec.Suspend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These tests reached into private helpers and duplicated coverage already exercised by the zero-downtime upgrade + Suspend e2e walkthrough; drop them to reduce coupling to internal structure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers four scenarios:
- Suspend a Running service then resume it.
- Atomic suspend: deletion completes even if Spec.Suspend is flipped
back to false mid-suspend; service then exits Suspended and serves
traffic again.
- Service created with Spec.Suspend=true: never spins up resources,
reaches Suspended directly, comes up normally on resume.
- Suspend during a zero-downtime upgrade: both active and pending
clusters are deleted; resuming applies the upgraded spec.
The atomic case surfaced a bug where the controller would stay in
Suspended forever if Spec.Suspend had been flipped to false before the
transition landed: after persisting Suspended=True we returned
ctrl.Result{} with no requeue, and the status-only update did not
re-trigger the watch predicate. Fix by requeuing after the transition so
the next reconcile observes Spec.Suspend and exits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove trailing blank line at end of rayservice_controller_unit_test.go. - Group corev1 with the other k8s.io imports in rayservice_suspend_test.go so goimports leaves it alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the curlRayServiceFruitWithError wrapper; it duplicated what ExecPodCmdWithError already does. Build the curl command inline at the single call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stop asserting that Suspended=True is observed during the atomic flow. That condition is only persisted for ~2s (one requeue interval) after deletion completes if Spec.Suspend has already been flipped to false, which made the test vulnerable to scheduling jitter under load. Instead record the original RayCluster name before suspending and assert (1) that cluster is deleted and (2) the eventually-Ready RayService is backed by a different RayCluster. This proves atomic completion more directly: the underlying cluster was actually torn down and recreated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pendAtomic Pin the underlying RayCluster with a synthetic finalizer so deletion cannot complete, then flip Spec.Suspend back and forth while asserting via Consistently that Suspending stays True. This matches the "RayJob suspend operation shoud be atomic" pattern in rayjob_controller_suspended_test.go and exercises the atomicity property directly instead of inferring it from cluster recreation. After removing the finalizer the test still verifies that the original RayCluster is deleted and a different RayCluster eventually backs the Ready RayService. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The function name and its own doc comment already convey what the call site does; the inline comment was duplicative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The comment block above `clusterSpec := rayService.Spec.RayClusterSpec.DeepCopy()` in constructRayClusterForRayService restated what is already documented on rayClusterSpecForHashing and modifyRayCluster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The full ./test/e2erayservice suite was nearly exhausting the 30m Go-test timeout, and TestRayServiceSuspendResume was failing because a single curl attempt hung on TCP retransmits past TestTimeoutShort — Eventually couldn't retry. Two fixes: - Split the suspend tests into their own Make target (test-e2e-rayservice-suspend) and Buildkite job; the existing test-e2e-rayservice / rayservice job runs with `-skip Suspend`. Each job gets its own 30m budget. - Add --connect-timeout 3 --max-time 5 to the resumed-service curl so each attempt fails fast and Eventually can actually iterate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
This is a POC PR (kubernetes-sigs/kueue#11264) based on the top-level If this makes sense to you, I will ping other KubeRay folks to review this PR. Thanks! |
|
Thank you @kevin85421 👍 cc @yaroslava-serdiuk who could likely provide the first review pass from the Kueue side. |
|
Thanks @mimowo! @yaroslava-serdiuk: The first step is to make sure the proposal in this KubeRay PR makes sense to you. Then I can ping other KubeRay folks to review it. The Kueue PR kubernetes-sigs/kueue#11264 is a POC based on the API introduced in this KubeRay PR. It's not ready for review, but you can take a quick look to understand what the integration will look like. Thanks! |
|
Thanks @kevin85421 for the change! In general RayService.Spec.Suspend field makes sense to me. |
|
+1 for top-level RayService.Spec.Suspend, which will help Kueue MultiKueue to support RayService as well! Same question for Spec.RayClusterSpec.Suspend, do we still need it |
|
Thanks @yaroslava-serdiuk @hiboyang for the review.
Sure. ContextBefore we dive into more details, it will be helpful to understand the following context:
|
Why are these changes needed?
RayService.spec.RayClusterSpec.Suspendchanges. However, this introduces multiple issues:suspendto true during upgrade, the pending RayCluster will be suspended.The active RayCluster will not be suspended becauseUpgradeInProgressis true (code). However, the active RayCluster will also not be deleted after the pending RayCluster is suspended, since the reconciliation will fail early because the head Pod of the pending RayCluster doesn't exist. That is, a "suspended" RayService still has a running active RayCluster.suspendto false to admit the RayService again, both the pending and active RayCluster CRs will be created again. However, we don't need the pending RayCluster.suspendeither both false or both true (there are bugs that cause the two RayClusters' suspend fields to drift apart — see Issue 1 for details).suspend: false, and the pending RayCluster should havesuspend: trueuntil Kueue admits it.Due to the issues from KubeRay and Kueue mentioned above, the Kueue integration for RayService doesn't work, especially during upgrades.
Changes
This PR adds top-level
RayService.Spec.Suspendand tightens the semantics of the existing nestedSpec.RayClusterSpec.Suspendso the two stop fighting each other. Together they make Kueue-style suspension work cleanly forRayService.1.
Spec.RayClusterSpec.Suspendis now applied only at RayCluster creationBackground: changing
RayService.Spec.RayClusterSpec.Suspendon a live RayService used to be propagated to the existing RayCluster on every reconcile, which fought Kueue any time it toggledSuspenddirectly on the cluster.This PR changes the semantics:
Suspendis copied from the RayService onto the new RayCluster.rayClusterSpecForHashing) ignoreSuspend, so toggling it on the RayService never triggers ashouldUpdateCluster/shouldPrepareNewClusterpath; andmodifyRayClustersnapshots and re-applies the existing cluster'sSuspendinstead of letting the goal spec overwrite it. Ownership ofSuspendon the RayCluster is delegated to Kueue (or whichever queueing controller owns it).2. Top-level
RayService.Spec.Suspendtears down all owned resourcesWhen
Spec.Suspend=true, the controller deletes everything it owns:RayServiceIncrementalUpgradesince those are only registered in the scheme when the feature gate is on)The lifecycle is reported through two new conditions,
SuspendingandSuspended, and follows this state machine:Atomicity is enforced by treating the persisted
Suspendingcondition as a commit point. The first reconcile that observesSpec.Suspend=trueonly stages the status transition — settingSuspending=Truetogether with resettingActiveServiceStatus,PendingServiceStatus,NumServeEndpoints, andServiceStatusin the sameStatus().Update. The actual deletion runs on the next reconcile, once that commit is durable. If a deletion attempt errors out orSpec.Suspendis flipped back to false mid-way,Suspendingstays True in storage and subsequent reconciles continue the cleanup.handleSuspendonly mutatesStatus; the caller persists changes via a singleStatus().Update.Related issue number
Checks
test/e2erayservice/rayservice_suspend_test.go):TestRayServiceSuspendResume— happy path: Running → Suspended (all owned resources gone, status reset) → resume → Ready, serves traffic again.TestRayServiceSuspendAtomic— mirrors the RayJobshould be atomicpattern: pins the underlying RayCluster with a synthetic finalizer, then flipsSpec.Suspendback and forth while assertingConsistently(Suspending=True); after removing the finalizer, verifies the original RayCluster is deleted and a different RayCluster eventually backs the Ready RayService.TestRayServiceCreatedSuspended—Spec.Suspend=truefrom creation: no owned resources are ever created,Suspended=Trueis reached directly; flipping to false brings the service up normally.TestRayServiceSuspendDuringUpgrade— triggers a zero-downtime upgrade, then suspends mid-upgrade: both active and pending RayClusters are deleted; resuming applies the upgraded spec.