Skip to content

[Autoscaler] Add idleTerminationSeconds for cluster-level idle termination#63465

Draft
win5923 wants to merge 1 commit into
ray-project:masterfrom
win5923:autoscaler-terminate-idle
Draft

[Autoscaler] Add idleTerminationSeconds for cluster-level idle termination#63465
win5923 wants to merge 1 commit into
ray-project:masterfrom
win5923:autoscaler-terminate-idle

Conversation

@win5923
Copy link
Copy Markdown
Member

@win5923 win5923 commented May 18, 2026

Description

Terminate idle cluster when the cluster with autoscaler

When autoscalerOptions.idleTerminationSeconds is set, the V2 autoscaler evaluates a cluster-level idle predicate every reconcile loop and, when it fires, patch a single annotation on the RayCluster CR:

metadata:
  annotations:
    ray.io/idle-ttl-expired: "true"

The KubeRay operator observes the condition and decides the terminal action. (delete RayCluster)

Changes

This PR adds autoscalerOptions.idleTerminationSeconds and a four-layer predicate that decides when the cluster is truly idle. The autoscaler emits an annotation; KubeRay owns the lifecycle action.

1. New autoscalerOptions.idleTerminationSeconds field, V2 + KubeRay only

Background: per-node idleTimeoutSeconds only scales worker pods

This PR introduces the field with the following semantics:

  • Set in spec.autoscalerOptions.idleTerminationSeconds.
  • Validated at parse time: numeric, non-negative, and strictly greater than idleTimeoutSeconds (default 60 when unset). Strict > avoids the race where worker scale-down and cluster termination fire on the same reconcile loop, and keeps the predicate's Gate 0 well-behaved.
  • Unset = feature disabled. Existing CRs and existing V1 / non-KubeRay deployments see no behavior change.
  • V1 paths emit a one-time startup warning if the field is set, then ignore it. Three isolation layers (config getter only on V2, predicate only in V2 scheduler, dispatch only in KubeRayProvider) ensure no leakage.

2. Four-layer idle predicate in the V2 scheduler + reconciler

Background: a native min(idle_duration_ms across alive nodes) > threshold is unsound. Drivers register through WorkerPool::RegisterDriver and never enter leased_workers_, so a pure-Python driver on the head keeps idle_duration_ms growing while status is IDLE. Per-node and cluster-level idle also race during scale-down, and the scheduler does not see pending demand placed mid-reconcile.

The predicate composes four layers, each addressing a distinct failure mode:

  • Gate 0 (scheduler): every worker group must already be at minReplicas. Defers the cluster predicate until per-node idle termination has finished its work.
  • Layer 1 (scheduler): every alive node's idle_duration_ms must exceed idleTerminationSeconds. Aligned with the existing _enforce_idle_termination definition of "alive" =
    SCHEDULABLE.
  • Layer 2 (scheduler): request.resource_requests, gang_resource_requests, and cluster_resource_constraints must all be empty. Catches the moment between "user submitted task" and "worker assigned".
  • Layer 3 (reconciler): GCS job table reports zero alive drivers. Closes raylet's head blind spot. Runs in the reconciler. Not the scheduler, to keep the scheduler free of I/O. Only queries GCS when Layers 0–2 have already passed, so the steady-state cost is zero extra RPCs. Fails closed on RPC error (treats unknown as "drivers present").

Related issues

Closes #63452

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…ation

Signed-off-by: win5923 <ken89@kimo.com>
@win5923 win5923 force-pushed the autoscaler-terminate-idle branch from cad0af4 to 3298172 Compare May 19, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Autoscaler] Cluster-level idle termination for Ray Autoscaler V2

1 participant