Skip to content

[RayService][Kueue] Support top-level Spec.Suspend#4841

Draft
kevin85421 wants to merge 13 commits into
ray-project:masterfrom
kevin85421:support-rayservice-suspend
Draft

[RayService][Kueue] Support top-level Spec.Suspend#4841
kevin85421 wants to merge 13 commits into
ray-project:masterfrom
kevin85421:support-rayservice-suspend

Conversation

@kevin85421
Copy link
Copy Markdown
Member

@kevin85421 kevin85421 commented May 17, 2026

Why are these changes needed?

Kueue POC PR based on top-level suspend API: kubernetes-sigs/kueue#11264

  • Fix Issue #4686 (rayservice): update RayCluster when Suspend toggles #4739 makes the RayService controller update the underlying RayCluster CRs when RayService.spec.RayClusterSpec.Suspend changes. However, this introduces multiple issues:
    • Issue 1: The RayService CR is no longer purely declarative.
      • For example, if we set suspend to true during upgrade, the pending RayCluster will be suspended.The active RayCluster will not be suspended because UpgradeInProgress is true (code). However, the active RayCluster will also not be deleted after the pending RayCluster is suspended, since the reconciliation will fail early because the head Pod of the pending RayCluster doesn't exist. That is, a "suspended" RayService still has a running active RayCluster.
      • Even if we prevent the reconciliation from returning early and suspend the active RayCluster, the behavior is still incorrect. For example, when Kueue sets suspend to false to admit the RayService again, both the pending and active RayCluster CRs will be created again. However, we don't need the pending RayCluster.
      • If the RayService is not undergoing an upgrade, the only RayCluster will be suspended, which is the expected behavior. That is, the goal state of a "suspended" RayService is different from that of a "suspended" RayService during an upgrade.
    • Issue 2: The CR status is incorrect.
    • Issue 3: Doesn't take the upgrade into consideration.
      • Ideally, the logic introduced by Fix Issue #4686 (rayservice): update RayCluster when Suspend toggles #4739 should keep the pending and active RayCluster CRs in sync on suspend either both false or both true (there are bugs that cause the two RayClusters' suspend fields to drift apart — see Issue 1 for details).
        • However, during an upgrade, the active RayCluster should have suspend: false, and the pending RayCluster should have suspend: true until Kueue admits it.
    • Issue 4: Kueue also injects scheduling information into the CR spec before admission. Therefore, suspending and then unsuspending the existing RayCluster CRs may trigger multiple unnecessary upgrades.
  • Kueue also has its own issues: MultiKueue: Support Elastic RayService kubernetes-sigs/kueue#11102 (comment).

Due to the issues from KubeRay and Kueue mentioned above, the Kueue integration for RayService doesn't work, especially during upgrades.

Changes

This PR adds top-level RayService.Spec.Suspend and tightens the semantics of the existing nested Spec.RayClusterSpec.Suspend so the two stop fighting each other. Together they make Kueue-style suspension work cleanly for RayService.

1. Spec.RayClusterSpec.Suspend is now applied only at RayCluster creation

Background: changing RayService.Spec.RayClusterSpec.Suspend on a live RayService used to be propagated to the existing RayCluster on every reconcile, which fought Kueue any time it toggled Suspend directly on the cluster.

This PR changes the semantics:

  • At creation time: Suspend is copied from the RayService onto the new RayCluster.
  • After the RayCluster exists: hash comparisons (rayClusterSpecForHashing) ignore Suspend, so toggling it on the RayService never triggers a shouldUpdateCluster / shouldPrepareNewCluster path; and modifyRayCluster snapshots and re-applies the existing cluster's Suspend instead of letting the goal spec overwrite it. Ownership of Suspend on the RayCluster is delegated to Kueue (or whichever queueing controller owns it).

2. Top-level RayService.Spec.Suspend tears down all owned resources

When Spec.Suspend=true, the controller deletes everything it owns:

  • All RayClusters labeled with the RayService association
  • Head/serve Kubernetes Services
  • Gateway/HTTPRoute (gated on RayServiceIncrementalUpgrade since those are only registered in the scheme when the feature gate is on)

The lifecycle is reported through two new conditions, Suspending and Suspended, and follows this state machine:

(no suspend) --Spec.Suspend=true--> Suspending --owned resources deleted--> Suspended --Spec.Suspend=false--> (no suspend)

Atomicity is enforced by treating the persisted Suspending condition as a commit point. The first reconcile that observes Spec.Suspend=true only stages the status transition — setting Suspending=True together with resetting ActiveServiceStatus, PendingServiceStatus, NumServeEndpoints, and ServiceStatus in the same Status().Update. The actual deletion runs on the next reconcile, once that commit is durable. If a deletion attempt errors out or Spec.Suspend is flipped back to false mid-way, Suspending stays True in storage and subsequent reconciles continue the cleanup.

handleSuspend only mutates Status; the caller persists changes via a single Status().Update.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • E2E tests (test/e2erayservice/rayservice_suspend_test.go):
      • TestRayServiceSuspendResume — happy path: Running → Suspended (all owned resources gone, status reset) → resume → Ready, serves traffic again.
      • TestRayServiceSuspendAtomic — mirrors the RayJob should be atomic pattern: pins the underlying RayCluster with a synthetic finalizer, then flips Spec.Suspend back and forth while asserting Consistently(Suspending=True); after removing the finalizer, verifies the original RayCluster is deleted and a different RayCluster eventually backs the Ready RayService.
      • TestRayServiceCreatedSuspendedSpec.Suspend=true from creation: no owned resources are ever created, Suspended=True is reached directly; flipping to false brings the service up normally.
      • TestRayServiceSuspendDuringUpgrade — triggers a zero-downtime upgrade, then suspends mid-upgrade: both active and pending RayClusters are deleted; resuming applies the upgraded spec.
    • Manual tests on a kind cluster (suspend/resume + suspend-during-upgrade).

kevin85421 and others added 12 commits May 16, 2026 21:10
Propagate RayService.Spec.RayClusterSpec.Suspend onto the RayCluster only
when the RayCluster is first created. After the RayCluster exists, its
Suspend is delegated to Kueue: hash comparisons ignore Suspend, and
modifyRayCluster preserves the existing cluster's Suspend instead of
overwriting it with the RayService spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When Spec.Suspend is true, the RayService controller deletes every
Kubernetes resource it owns (RayClusters, head/serve Services, and
Gateway/HTTPRoute when the RayServiceIncrementalUpgrade gate is on) and
reports the lifecycle through two new conditions, Suspending and
Suspended. The transition is atomic: the first reconcile commits
Suspending=True together with the reset of ActiveServiceStatus,
PendingServiceStatus, NumServeEndpoints, and ServiceStatus in a single
Status update; deletion runs on the next reconcile once that commit is
durable, so an errored or interrupted attempt is always resumed.
Flipping Spec.Suspend back to false removes the Suspended condition and
the regular reconcile recreates the resources.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The suspend behavior is exercised end-to-end on a kind cluster, so the
handler-level unit tests are removed for now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run `make helm` so the helm-chart/kuberay-operator/crds copy of the
RayService CRD matches config/crd/bases after adding Spec.Suspend.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These tests reached into private helpers and duplicated coverage already
exercised by the zero-downtime upgrade + Suspend e2e walkthrough; drop
them to reduce coupling to internal structure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers four scenarios:
- Suspend a Running service then resume it.
- Atomic suspend: deletion completes even if Spec.Suspend is flipped
  back to false mid-suspend; service then exits Suspended and serves
  traffic again.
- Service created with Spec.Suspend=true: never spins up resources,
  reaches Suspended directly, comes up normally on resume.
- Suspend during a zero-downtime upgrade: both active and pending
  clusters are deleted; resuming applies the upgraded spec.

The atomic case surfaced a bug where the controller would stay in
Suspended forever if Spec.Suspend had been flipped to false before the
transition landed: after persisting Suspended=True we returned
ctrl.Result{} with no requeue, and the status-only update did not
re-trigger the watch predicate. Fix by requeuing after the transition so
the next reconcile observes Spec.Suspend and exits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove trailing blank line at end of rayservice_controller_unit_test.go.
- Group corev1 with the other k8s.io imports in
  rayservice_suspend_test.go so goimports leaves it alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the curlRayServiceFruitWithError wrapper; it duplicated what
ExecPodCmdWithError already does. Build the curl command inline at the
single call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stop asserting that Suspended=True is observed during the atomic flow.
That condition is only persisted for ~2s (one requeue interval) after
deletion completes if Spec.Suspend has already been flipped to false,
which made the test vulnerable to scheduling jitter under load.

Instead record the original RayCluster name before suspending and
assert (1) that cluster is deleted and (2) the eventually-Ready
RayService is backed by a different RayCluster. This proves atomic
completion more directly: the underlying cluster was actually torn down
and recreated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pendAtomic

Pin the underlying RayCluster with a synthetic finalizer so deletion
cannot complete, then flip Spec.Suspend back and forth while asserting
via Consistently that Suspending stays True. This matches the
"RayJob suspend operation shoud be atomic" pattern in
rayjob_controller_suspended_test.go and exercises the atomicity
property directly instead of inferring it from cluster recreation.

After removing the finalizer the test still verifies that the original
RayCluster is deleted and a different RayCluster eventually backs the
Ready RayService.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The function name and its own doc comment already convey what the call
site does; the inline comment was duplicative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The comment block above `clusterSpec := rayService.Spec.RayClusterSpec.DeepCopy()`
in constructRayClusterForRayService restated what is already documented on
rayClusterSpecForHashing and modifyRayCluster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kevin85421 kevin85421 changed the title [WIP] Support rayservice suspend [WIP] Support RayService top-level suspend May 17, 2026
@kevin85421 kevin85421 changed the title [WIP] Support RayService top-level suspend [WIP][RayService][Kueue] Support RayService top-level suspend May 17, 2026
@kevin85421 kevin85421 changed the title [WIP][RayService][Kueue] Support RayService top-level suspend [RayService] Support top-level Spec.Suspend May 17, 2026
@kevin85421 kevin85421 changed the title [RayService] Support top-level Spec.Suspend [RayService][Kueue] Support top-level Spec.Suspend May 17, 2026
The full ./test/e2erayservice suite was nearly exhausting the 30m
Go-test timeout, and TestRayServiceSuspendResume was failing because a
single curl attempt hung on TCP retransmits past TestTimeoutShort —
Eventually couldn't retry. Two fixes:

- Split the suspend tests into their own Make target
  (test-e2e-rayservice-suspend) and Buildkite job; the existing
  test-e2e-rayservice / rayservice job runs with `-skip Suspend`. Each
  job gets its own 30m budget.
- Add --connect-timeout 3 --max-time 5 to the resumed-service curl so
  each attempt fails fast and Eventually can actually iterate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kevin85421
Copy link
Copy Markdown
Member Author

kevin85421 commented May 18, 2026

Hi @mimowo @hiboyang,

This is a POC PR (kubernetes-sigs/kueue#11264) based on the top-level suspend API. Could you take a look at whether the API and contract proposed in this KubeRay PR make sense? You don't need to review the implementation details — reading the PR description and the e2e tests should be enough to understand the behavior.

If this makes sense to you, I will ping other KubeRay folks to review this PR. Thanks!

@mimowo
Copy link
Copy Markdown

mimowo commented May 18, 2026

Thank you @kevin85421 👍

cc @yaroslava-serdiuk who could likely provide the first review pass from the Kueue side.

@kevin85421
Copy link
Copy Markdown
Member Author

Thanks @mimowo!

@yaroslava-serdiuk: The first step is to make sure the proposal in this KubeRay PR makes sense to you. Then I can ping other KubeRay folks to review it.

The Kueue PR kubernetes-sigs/kueue#11264 is a POC based on the API introduced in this KubeRay PR. It's not ready for review, but you can take a quick look to understand what the integration will look like.

Thanks!

@yaroslava-serdiuk
Copy link
Copy Markdown
Contributor

Thanks @kevin85421 for the change! In general RayService.Spec.Suspend field makes sense to me.
Could you please clarify — what is the purpose of Spec.RayClusterSpec.Suspend field if there is RayService.Spec.Suspend field?

@hiboyang
Copy link
Copy Markdown

+1 for top-level RayService.Spec.Suspend, which will help Kueue MultiKueue to support RayService as well! Same question for Spec.RayClusterSpec.Suspend, do we still need it

@kevin85421
Copy link
Copy Markdown
Member Author

Thanks @yaroslava-serdiuk @hiboyang for the review.

Could you please clarify — what is the purpose of Spec.RayClusterSpec.Suspend field if there is RayService.Spec.Suspend field?

Sure.

Context

Before we dive into more details, it will be helpful to understand the following context:

  • RayService.Spec.RayClusterSpec:
    • The RayClusterSpec struct is defined in the RayCluster CRD. Therefore, RayService.Spec.RayClusterSpec.Suspend comes from the RayCluster CRD, not from RayService.
  • RayService's zero-downtime upgrade
    • When users update RayService's spec (e.g., upgrade the image from rayproject/ray:2.X.0 to rayproject/ray:2.{X+1}.0), it will trigger a zero-downtime upgrade.
    • A new RayCluster (let's call it new_cluster below) will be created, and KubeRay will create Ray Serve applications in the new_cluster.
    • Then, KubeRay will delete the old_cluster.

RayService.Spec.Suspend / RayService.Spec.RayClusterSpec.Suspend

Behavior:

  • RayService.Spec.Suspend:
    • true: delete all K8s resources owned by the RayService.
    • false: admit the RayService and consume the quota defined in RayService.Spec.RayClusterSpec.
  • RayService.Spec.RayClusterSpec.Suspend
    • At creation timeRayService.Spec.RayClusterSpec.Suspend is copied from the RayService onto the new RayCluster.
    • After the RayCluster existsRayService.Spec.RayClusterSpec.Suspend will be ignored, and Kueue's RayService controller manages the underlying RayCluster CRs' suspend values.

Example: kubernetes-sigs/kueue#11264:

  1. A user creates a RayService CR. To simplify the explanation, we assume the RayService has only one head Pod and no worker Pods.
  2. Kueue's webhook sets RayService.Spec.Suspend and RayService.Spec.RayClusterSpec.Suspend to true, suspending the RayService.
    1. RayService.Spec.Suspend: true, RayService.Spec.RayClusterSpec.Suspend: true
  3. Kueue creates a Workload CR W1, which requires resources derived from RayService.Spec.RayClusterSpec (i.e., one head Pod).
  4. W1 will be admitted if the ClusterQueue has enough resources to accommodate one head Pod. Admitting the RayService means setting RayService.Spec.Suspend to false.
    1. RayService.Spec.Suspend: false, RayService.Spec.RayClusterSpec.Suspend: true
  5. Then, the KubeRay RayService controller will create a RayCluster CR (call it old_cluster in this example) with suspend=true, because RayService.Spec.RayClusterSpec.Suspend was set to true by Kueue's mutation webhook in Step 2.
    1. RayService.Spec.Suspend: false, RayService.Spec.RayClusterSpec.Suspend: true, old_cluster.Spec.Suspend: true
  6. The Kueue RayService controller will set old_cluster.Spec.Suspend to false because W1 is admitted. See the function unsuspendAdmittedChildren in the POC PR for more details.
    1. RayService.Spec.Suspend: false, RayService.Spec.RayClusterSpec.Suspend: true, old_cluster.Spec.Suspend: false
  7. The KubeRay RayCluster controller starts reconciling old_cluster.
  8. A user updates the RayService CR's image field to trigger a zero-downtime upgrade.
  9. The KubeRay RayService controller will create a new RayCluster CR (call it new_cluster in this example) with suspend=true, because RayService.Spec.RayClusterSpec.Suspend was set to true by Kueue's mutation webhook in Step 2.
    1. RayService.Spec.Suspend: false, RayService.Spec.RayClusterSpec.Suspend: true, old_cluster.Spec.Suspend: false, new_cluster.Spec.Suspend: true
  10. The Kueue RayService controller will create a new Workload CR W2 (elastic workload slicing must be enabled). To be admitted, W2 requests the sum of the resources of old_cluster and new_cluster (i.e., two head Pods).
  11. When the ClusterQueue has enough resources to admit W2, the Kueue RayService controller will set new_cluster.Spec.Suspend to false, and W1 will be marked as Finished.
    1. RayService.Spec.Suspend: false, RayService.Spec.RayClusterSpec.Suspend: true, old_cluster.Spec.Suspend: false, new_cluster.Spec.Suspend: false
  12. The KubeRay RayCluster controller starts reconciling new_cluster.
  13. When the new_cluster is ready to serve traffic, the KubeRay RayService controller will switch traffic to the new_cluster, then delete the old_cluster.
  14. The Kueue RayService controller will reduce W2's resource requirement to one head Pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants