feat: remove default 8h max TTL for AgentRuntime sandboxes#360
feat: remove default 8h max TTL for AgentRuntime sandboxes#360HarshitPal25 wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the sandbox session duration logic to make MaxSessionDuration optional, allowing sandboxes to run indefinitely without a hard expiry when omitted. The Redis and Valkey store implementations are updated to skip the expiry index when ExpiresAt is zero, and helper functions are modified to prevent falling back to DefaultSandboxTTL. The reviewer points out that removing the fallback to DefaultSandboxTTL inside buildSandboxObject could lead to CodeInterpreter instances running indefinitely if created without an explicit duration (e.g., in unit tests). They suggest explicitly handling this fallback in buildSandboxByCodeInterpreter to maintain the 8-hour default guarantee.
| if params.ttl > 0 { | ||
| shutdownTime := metav1.NewTime(time.Now().Add(params.ttl)) | ||
| sandbox.Spec.Lifecycle.ShutdownTime = &shutdownTime | ||
| } |
There was a problem hiding this comment.
By removing the fallback to DefaultSandboxTTL when params.ttl == 0 inside buildSandboxObject, any CodeInterpreter created without an explicit MaxSessionDuration (for example, in unit tests or direct client-go calls where kubebuilder defaults are not applied) will now run indefinitely without a hard expiry.
To maintain the 8h default guarantee for CodeInterpreter across all environments, the fallback to DefaultSandboxTTL should be explicitly handled in buildSandboxByCodeInterpreter (both in the warm pool path and the standard path).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #360 +/- ##
==========================================
+ Coverage 47.57% 55.43% +7.86%
==========================================
Files 30 34 +4
Lines 2819 3195 +376
==========================================
+ Hits 1341 1771 +430
+ Misses 1338 1244 -94
- Partials 140 180 +40
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Removes the implicit 8-hour hard TTL for AgentRuntime sandboxes so long-running agent services aren’t forcibly deleted, making spec.maxSessionDuration truly opt-in and propagating “no hard expiry” through sandbox info + store GC indexing.
Changes:
- Make
AgentRuntime.spec.maxSessionDurationoptional (omit => no hard cap). - Only set
Sandbox.Spec.Lifecycle.ShutdownTimewhen a positive TTL is explicitly provided. - Allow
ExpiresAtto be zero (no expiry) and skip indexing such sessions into the expiry sorted-set in Redis/Valkey.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/apis/runtime/v1alpha1/agent_type.go | Makes AgentRuntime max session duration optional in the API type. |
| pkg/workloadmanager/workload_builder.go | Stops defaulting TTL to 8h; only sets ShutdownTime when TTL > 0. |
| pkg/workloadmanager/sandbox_helper.go | Leaves ExpiresAt as zero when ShutdownTime is absent. |
| pkg/workloadmanager/sandbox_helper_test.go | Updates placeholder behavior expectations for zero ExpiresAt. |
| pkg/store/store_redis.go | Allows zero ExpiresAt and skips adding to expiry index when unset. |
| pkg/store/store_valkey.go | Same as Redis: zero ExpiresAt bypasses expiry indexing. |
Comments suppressed due to low confidence (1)
pkg/workloadmanager/workload_builder.go:206
- New behavior (
ttl == 0=> noShutdownTime/ no hard expiry) is introduced here but isn’t covered by tests inpkg/workloadmanager/workload_builder_test.go(current tests always pass a positive ttl). Please add a unit test asserting that whenttlis zero,sandbox.Spec.Lifecycle.ShutdownTimeremains nil so the sandbox is not hard-expired.
// Only set ShutdownTime when MaxSessionDuration is explicitly configured.
// When omitted (ttl == 0), the sandbox runs indefinitely and is only
// cleaned up by idle timeout (SessionTimeout).
if params.ttl > 0 {
shutdownTime := metav1.NewTime(time.Now().Add(params.ttl))
sandbox.Spec.Lifecycle.ShutdownTime = &shutdownTime
}
| // +kubebuilder:default="8h" | ||
| // When omitted, sessions have no maximum duration and will only be cleaned up | ||
| // by idle timeout (SessionTimeout). | ||
| // +optional |
| // Only set ShutdownTime when MaxSessionDuration is explicitly configured. | ||
| // When omitted (ttl == 0), the sandbox runs indefinitely and is only | ||
| // cleaned up by idle timeout (SessionTimeout). | ||
| if params.ttl > 0 { | ||
| shutdownTime := metav1.NewTime(time.Now().Add(params.ttl)) | ||
| sandbox.Spec.Lifecycle.ShutdownTime = &shutdownTime | ||
| } |
| pipe := rs.cli.Pipeline() | ||
| pipe.SetNX(ctx, sessionKey, b, 0) | ||
| pipe.ZAdd(ctx, rs.expiryIndexKey, redisv9.Z{ | ||
| Score: float64(sandboxRedis.ExpiresAt.Unix()), | ||
| Member: sandboxRedis.SessionID, | ||
| }) | ||
| // Only add to expiry index when ExpiresAt is set. | ||
| // Zero ExpiresAt means the sandbox has no maximum lifetime. | ||
| if !sandboxRedis.ExpiresAt.IsZero() { | ||
| pipe.ZAdd(ctx, rs.expiryIndexKey, redisv9.Z{ | ||
| Score: float64(sandboxRedis.ExpiresAt.Unix()), | ||
| Member: sandboxRedis.SessionID, | ||
| }) | ||
| } |
| if sandboxStore.ExpiresAt.IsZero() { | ||
| // Zero ExpiresAt means no maximum lifetime — skip expiry index. | ||
| // The sandbox will still be tracked in the activity index for idle timeout. | ||
| commands := make(valkey.Commands, 0, 3) | ||
| commands = append(commands, vs.cli.B().Setnx().Key(sessionKey).Value(string(b)).Build()) | ||
| commands = append(commands, vs.cli.B().Zadd().Key(vs.lastActivityIndexKey).ScoreMember(). | ||
| ScoreMember(float64(time.Now().Unix()), sandboxStore.SessionID).Build()) | ||
|
|
||
| for i, resp := range vs.cli.DoMulti(ctx, commands...) { | ||
| if err = resp.Error(); err != nil { | ||
| return fmt.Errorf("StoreSandbox: DoMulti failed: %w, command index: %v", err, i) | ||
| } | ||
| } | ||
| return nil | ||
| } |
| // +kubebuilder:default="8h" | ||
| // When omitted, sessions have no maximum duration and will only be cleaned up | ||
| // by idle timeout (SessionTimeout). | ||
| // +optional |
There was a problem hiding this comment.
This needs the generated CRD artifact update as well. Right now the installed AgentRuntime CRD still has default: 8h and still lists maxSessionDuration as required, so the API server will keep defaulting omitted values to 8h and this change will not actually fix #303 in a deployed cluster.
| if params.ttl > 0 { | ||
| shutdownTime := metav1.NewTime(time.Now().Add(params.ttl)) | ||
| sandbox.Spec.Lifecycle.ShutdownTime = &shutdownTime | ||
| } |
There was a problem hiding this comment.
This makes ttl == 0 mean no hard expiry for every caller of buildSandboxObject. That is correct for AgentRuntime, but CodeInterpreter is supposed to keep the 8h max-session default from the issue discussion and PR description. Can we set DefaultSandboxTTL explicitly in the CodeInterpreter paths when Spec.MaxSessionDuration is nil, including the warm-pool placeholder path?
There was a problem hiding this comment.
+1 It's better to align with CodeInterpreter.
| }) | ||
| // Only add to expiry index when ExpiresAt is set. | ||
| // Zero ExpiresAt means the sandbox has no maximum lifetime. | ||
| if !sandboxRedis.ExpiresAt.IsZero() { |
There was a problem hiding this comment.
Can we add store tests for the zero ExpiresAt branch? The important behavior is that the session is still stored and tracked in the last-activity index, but is not added to the expiry index.
There was a problem hiding this comment.
yes ofc we can i was fixing the failure test cases till now i would add store test for the zero expireat now
Remove the 'default: 8h' value and drop maxSessionDuration from the required list in the generated CRD YAML. This matches the updated kubebuilder markers in agent_type.go where +kubebuilder:default was removed and the field marked as +optional. Fixes codegen check failure in PR volcano-sh#360.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Remove the 'default: 8h' value and drop maxSessionDuration from the required list in the generated CRD YAML. This matches the updated kubebuilder markers in agent_type.go where +kubebuilder:default was removed and the field marked as +optional. Fixes codegen check failure in PR volcano-sh#360. Signed-off-by: HarshitPal25 <harshit13082006@gmail.com>
a3d5351 to
879413e
Compare
| pipe.SetNX(ctx, sessionKey, b, 0) | ||
| pipe.ZAdd(ctx, rs.expiryIndexKey, redisv9.Z{ | ||
| Score: float64(sandboxRedis.ExpiresAt.Unix()), | ||
| Member: sandboxRedis.SessionID, | ||
| }) | ||
| // Only add to expiry index when ExpiresAt is set. | ||
| // Zero ExpiresAt means the sandbox has no maximum lifetime. | ||
| if !sandboxRedis.ExpiresAt.IsZero() { | ||
| pipe.ZAdd(ctx, rs.expiryIndexKey, redisv9.Z{ |
| if sandboxStore.ExpiresAt.IsZero() { | ||
| // Zero ExpiresAt means no maximum lifetime — skip expiry index. | ||
| // The sandbox will still be tracked in the activity index for idle timeout. | ||
| commands := make(valkey.Commands, 0, 3) | ||
| commands = append(commands, vs.cli.B().Setnx().Key(sessionKey).Value(string(b)).Build()) |
FAUST-BENCHOU
left a comment
There was a problem hiding this comment.
recommand to rebase first
| Labels: map[string]string{ | ||
| SessionIdLabelKey: params.sessionID, | ||
| WorkloadNameLabelKey: params.workloadName, | ||
| "managed-by": "agentcube-workload-manager", |
There was a problem hiding this comment.
revert all this kind of fmt and rebase to the newest upstream first.We have solved all fmt problem in f17fa17 i think
| }{ | ||
| { | ||
| name: "no ShutdownTime falls back to DefaultSandboxTTL", | ||
| name: "no ShutdownTime means no expiry (sandbox runs indefinitely)", |
There was a problem hiding this comment.
why not remove all DefaultSandboxTTL in ut test like in workload_builder_test.go L44 since we are not using it anymore
| type: object | ||
| type: array | ||
| required: | ||
| - maxSessionDuration |
There was a problem hiding this comment.
also upd example/agent-runtime/agent-runtime.yaml
| if params.ttl > 0 { | ||
| shutdownTime := metav1.NewTime(time.Now().Add(params.ttl)) | ||
| sandbox.Spec.Lifecycle.ShutdownTime = &shutdownTime | ||
| } |
There was a problem hiding this comment.
+1 It's better to align with CodeInterpreter.
AgentRuntime.Spec.MaxSessionDuration had a kubebuilder default of "8h", meaning every AgentRuntime sandbox was forcefully terminated after 8 hours regardless of whether the user configured this field. For long-running agents (e.g., hosting OpenClaw within agentcube), this made AgentRuntime unusable as a persistent runtime. This change makes MaxSessionDuration truly optional: - Remove `+kubebuilder:default="8h"` from AgentRuntimeSpec, replacing it with `+optional`. When omitted, sandboxes have no maximum lifetime. - Update buildSandboxObject() to only set Lifecycle.ShutdownTime when ttl > 0 (i.e., MaxSessionDuration is explicitly configured). When ttl == 0, the sandbox runs indefinitely. - Update buildSandboxPlaceHolder() and buildSandboxInfo() to use zero ExpiresAt when ShutdownTime is nil, instead of falling back to the hardcoded DefaultSandboxTTL constant. - Sandboxes without MaxSessionDuration are still cleaned up by idle timeout (SessionTimeout), which defaults to 15 minutes. - Update tests to reflect the new behavior. Fixes volcano-sh#303 Signed-off-by: HarshitPal25 <harshit13082006@gmail.com>
Remove the 'default: 8h' value and drop maxSessionDuration from the required list in the generated CRD YAML. This matches the updated kubebuilder markers in agent_type.go where +kubebuilder:default was removed and the field marked as +optional. Fixes codegen check failure in PR volcano-sh#360. Signed-off-by: HarshitPal25 <harshit13082006@gmail.com>
879413e to
d5439a1
Compare
- CodeInterpreter explicitly sets DefaultSandboxTTL (8h) when MaxSessionDuration is nil, in both the direct sandbox path and the warm-pool placeholder path. This preserves the agreed-upon default for CodeInterpreter while AgentRuntime remains unlimited by default. - Add TestRedisStore_StoreSandbox_ZeroExpiresAt to verify that a sandbox with zero ExpiresAt is stored, retrievable, absent from the expiry index (not GC'd by TTL), and still tracked in the last- activity index (eligible for idle-timeout GC). - Add TestBuildSandboxObject_NoTTL to verify that ttl==0 produces a Sandbox with nil ShutdownTime (AgentRuntime indefinite lifetime path). - Replace DefaultSandboxTTL with an explicit time.Hour in workload_builder_test.go L44 since that constant is no longer the implied default for every buildSandboxObject caller. - Update example/agent-runtime/agent-runtime.yaml: remove hard-coded maxSessionDuration: 8h and replace it with a commented-out example documenting the optional nature of the field. Signed-off-by: HarshitPal25 <harshit13082006@gmail.com>
…iewers ask volcano-sh#360 - CodeInterpreter now explicitly sets DefaultSandboxTTL (8h) when MaxSessionDuration is nil, in both the direct sandbox path and the warm-pool placeholder path, preserving the agreed-upon behavior while AgentRuntime remains unlimited by default. - Add TestRedisStore_StoreSandbox_ZeroExpiresAt: verifies that a sandbox with zero ExpiresAt is stored and retrievable, absent from the expiry index (not hard-deleted), and still tracked in the last-activity index (idle-timeout GC still works). - Add TestBuildSandboxObject_NoTTL: verifies that ttl==0 produces a Sandbox with nil ShutdownTime (AgentRuntime indefinite lifetime path). - Replace DefaultSandboxTTL with time.Hour in workload_builder_test.go since DefaultSandboxTTL is no longer the implied default for every buildSandboxObject caller. - Update example/agent-runtime/agent-runtime.yaml: remove hard-coded maxSessionDuration: 8h and replace with a commented-out example that documents the field is optional. Signed-off-by: HarshitPal25 <harshit13082006@gmail.com>
d5439a1 to
8ff1fe7
Compare
…iewers ask volcano-sh#360 - CodeInterpreter now explicitly sets DefaultSandboxTTL (8h) when MaxSessionDuration is nil, in both the direct sandbox path and the warm-pool placeholder path, preserving the agreed-upon behavior while AgentRuntime remains unlimited by default. - Add TestRedisStore_StoreSandbox_ZeroExpiresAt: verifies that a sandbox with zero ExpiresAt is stored and retrievable, absent from the expiry index (not hard-deleted), and still tracked in the last-activity index (idle-timeout GC still works). - Add TestBuildSandboxObject_NoTTL: verifies that ttl==0 produces a Sandbox with nil ShutdownTime (AgentRuntime indefinite lifetime path). - Replace DefaultSandboxTTL with time.Hour in workload_builder_test.go since DefaultSandboxTTL is no longer the implied default for every buildSandboxObject caller. - Update example/agent-runtime/agent-runtime.yaml: remove hard-coded maxSessionDuration: 8h and replace with a commented-out example that documents the field is optional. - Fix go.mod: re-sort require entries alphabetically (go mod tidy). Signed-off-by: HarshitPal25 <harshit13082006@gmail.com>
8ff1fe7 to
71a3bdf
Compare
Signed-off-by: HarshitPal25 <harshit13082006@gmail.com>
|
Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages. The list of commits with invalid commit messages:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
| if sandboxStore.ExpiresAt.IsZero() { | ||
| // Zero ExpiresAt means no maximum lifetime — skip expiry index. | ||
| // The sandbox will still be tracked in the activity index for idle timeout. | ||
| commands := make(valkey.Commands, 0, 3) | ||
| commands = append(commands, vs.cli.B().Setnx().Key(sessionKey).Value(string(b)).Build()) | ||
| commands = append(commands, vs.cli.B().Zadd().Key(vs.lastActivityIndexKey).ScoreMember(). | ||
| ScoreMember(float64(time.Now().Unix()), sandboxStore.SessionID).Build()) | ||
|
|
||
| for i, resp := range vs.cli.DoMulti(ctx, commands...) { | ||
| if err = resp.Error(); err != nil { | ||
| return fmt.Errorf("StoreSandbox: DoMulti failed: %w, command index: %v", err, i) | ||
| } |
There was a problem hiding this comment.
I am seeing this is a little different from redis, is that because valkey sdk does not align with redis?
There was a problem hiding this comment.
the structural difference is purely due to the Valkey SDK's API design, not a logic difference.
Redis (go-redis/v9) has a Pipeline() abstraction that lets you queue a variable number of commands and skip some conditionally before calling Exec():
pipe := rs.cli.Pipeline()
pipe.SetNX(...)
if !sandboxRedis.ExpiresAt.IsZero() {
pipe.ZAdd(...) // conditionally added
}
pipe.ZAdd(...) // lastActivity always added
pipe.Exec(ctx)
Valkey (valkey-go) uses DoMulti() which takes a fixed slice of commands built upfront. Since Go slices can't have "holes", the cleanest pattern with this SDK is to branch the two cases (zero vs non-zero ExpiresAt) into separate command slices with an early return, rather than trying to conditionally append inside a single block:
if sandboxStore.ExpiresAt.IsZero() {
commands := [SETNX, ZADD lastActivity]
vs.cli.DoMulti(ctx, commands...)
return nil
}
commands := [SETNX, ZADD expiryIndex, ZADD lastActivity]
vs.cli.DoMulti(ctx, commands...)
The end result is identical behaviour — zero ExpiresAt skips the expiry index in both stores. The branching is an SDK-driven implementation detail, not a semantic difference.
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Another point from me, we can later expose sandbox lifecycle managerment api through sdk. Otherwise for agentruntime, user may not be able to delete the sandbox
|
@hzxuzhonghu Agreed! This is a great point. Since maxSessionDuration is now opt-in, AgentRuntime sandboxes can run indefinitely — users will need a way to explicitly delete them. Exposing a sandbox lifecycle management API (like a delete/terminate session call) through the SDK would address this nicely. I'll open a separate follow-up issue to track it so it doesn't hold up this PR. |
What this PR does
Fixes #303
Problem
AgentRuntime.spec.maxSessionDurationhad a kubebuilder default of8h, which caused the workload-manager to always setSandbox.Spec.Lifecycle.ShutdownTimetonow + 8h. This meant every AgentRuntime sandbox was hard-deleted after 8 hours, even if it was still actively serving a long-running agent (e.g. hosting OpenClaw or any persistent agent service).Solution
Make
maxSessionDurationtruly opt-in forAgentRuntime:agent_type.go: Removed+kubebuilder:default="8h"fromAgentRuntimeSpec.MaxSessionDuration; added+optionaland a clear comment that omitting it means no hard cap.workload_builder.go:buildSandboxObjectno longer falls back toDefaultSandboxTTLwhenttl == 0.ShutdownTimeis only set whenMaxSessionDurationis explicitly configured by the user.sandbox_helper.go:buildSandboxPlaceHolderandbuildSandboxInfono longer fall back toDefaultSandboxTTLwhenShutdownTimeis absent.ExpiresAtis left as zero time (meaning no hard expiry).store_redis.go: Skips adding to the expiry sorted-set whenExpiresAtis zero, so the GC loop never forcibly expires these sandboxes.store_valkey.go: Same — zeroExpiresAtbypasses the expiry index, only tracking the session in the activity index for idle-timeout cleanup.ShutdownTime→ zeroExpiresAt→ sandbox runs indefinitely, cleaned up only bysessionTimeoutidle logic.CodeInterpreteris unchanged — it retains the 8h default at thebuildSandboxByCodeInterpretercall-site.Behaviour after this change
maxSessionDurationset?24h)