Skip to content

feat: add --duration flag for self-terminated load runs#31

Merged
bdchatham merged 2 commits intomainfrom
feat/run-duration
Apr 30, 2026
Merged

feat: add --duration flag for self-terminated load runs#31
bdchatham merged 2 commits intomainfrom
feat/run-duration

Conversation

@bdchatham
Copy link
Copy Markdown
Contributor

@bdchatham bdchatham commented Apr 30, 2026

Summary

Add --duration flag so seiload self-terminates cleanly inside K8s Job activeDeadlineSeconds, producing pod exit 0 → Job condition Complete instead of K8s-mandated Failed/DeadlineExceeded.

Why this exists

Followup to #30. The exit-code fix in #30 is correct in isolation (graceful SIGTERM → exit 0), but on Kubernetes Jobs with Job-level activeDeadlineSeconds, the K8s Job controller sets condition=Failed, reason=DeadlineExceeded regardless of the container's exit code:

Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded.
Kubernetes Job docs

A run on harbor's nightly namespace today (workflow 25189102396) confirmed this: seiload's pod was SIGTERMed at the deadline and likely exited 0 (post-#30), but the Job condition was set:

"reason": "DeadlineExceeded",
"message": "Job was active longer than specified deadline"

This means kube_job_failed=1 and KubeJobFailed keeps firing — the original symptom that motivated this whole investigation. The actual fix has to make seiload self-terminate before K8s decides "DeadlineExceeded."

What changes

rootCmd.Flags().Duration("duration", 0, "Run duration; the load test ctx is canceled after this elapses, the existing graceful-shutdown path runs, and the process exits 0. 0 means run until SIGTERM/SIGINT.")
if duration, _ := cmd.Flags().GetDuration("duration"); duration > 0 {
    log.Printf("⏰ Run duration: %s", duration)
    var cancel context.CancelFunc
    ctx, cancel = context.WithTimeout(ctx, duration)
    defer cancel()
}

When --duration is set:

  1. The load-test context is wrapped with WithTimeout.
  2. After duration elapses, ctx is canceled with context.DeadlineExceeded.
  3. Background tasks (dispatcher, logger, block_collector) unwind via ctx.Done().
  4. service.Run returns the wrapped DeadlineExceeded error.
  5. Final stats emit, EmitRunSummary runs, post-summary flush sleeps for 45s.
  6. Existing boundary check from fix: exit 0 on graceful SIGTERM/deadline shutdown #30 (errors.Is(err, context.DeadlineExceeded)) clears the error.
  7. Process exits 0.

The existing post-summary flush delay still runs by design — it sits after service.Run returns, in the cleanup pipeline.

What doesn't change

  • Default is 0 (unlimited) so existing callers without the flag are unaffected.
  • SIGTERM/SIGINT handling in main.go:347-352 is untouched and still works.
  • Exit-code semantics from fix: exit 0 on graceful SIGTERM/deadline shutdown #30 (Canceled OR DeadlineExceeded → exit 0) already cover both internal-timeout and external-SIGTERM paths.

Test plan

  • GOWORK=off go build . passes
  • GOWORK=off go vet ./... clean
  • Companion platform PR will pass --duration=${DURATION_MINUTES}m to seiload args; tomorrow's nightly will verify Job condition flips to Complete

Companion change

Will follow with a small PR on sei-protocol/platform that:

  1. Bumps the seiload image to the new SHA after this merges.
  2. Adds --duration=${DURATION_MINUTES}m to seiload args in clusters/harbor/nightly/templates/seiload-job.yaml.
  3. Optionally raises JOB_DEADLINE_SECONDS slightly to keep activeDeadlineSeconds as a backstop only.

🤖 Generated with Claude Code

Wraps the load-test ctx in context.WithTimeout when --duration > 0.
When the timeout fires, the existing graceful-shutdown path runs
(background tasks unwind via ctx.Done(), final stats emit, post-
summary flush completes), and the existing errors.Is check at
exit boundary treats DeadlineExceeded as exit 0.

Motivation: K8s Job-level activeDeadlineSeconds always sets the
Job condition to Failed/DeadlineExceeded regardless of container
exit code. With --duration set inside the deadline, seiload exits
cleanly before K8s force-terminates, so the Job goes Complete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham requested a review from amir-deris April 30, 2026 21:53
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham merged commit 5d3badf into main Apr 30, 2026
3 checks passed
@bdchatham bdchatham deleted the feat/run-duration branch April 30, 2026 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants