fix: exit 0 on graceful SIGTERM/deadline shutdown by bdchatham · Pull Request #30 · sei-protocol/sei-load

bdchatham · 2026-04-30T20:22:19Z

Summary

seiload exits non-zero on every graceful shutdown. Treat context.Canceled (and context.DeadlineExceeded for future-proofing) as the expected end-of-run signal at the top-level boundary of runLoadTest.

Root cause

Tracing the chain when Kubernetes SIGTERMs the container at activeDeadlineSeconds:

Signal handler in runLoadTest (main.go:347-352) catches SIGTERM cleanly and the main task closure passed to service.Run returns nil.
service.Run (utils/service/start.go:124-132) cancels its internal context once the main task finishes.
Background tasks see ctx.Done() and return ctx.Err() (= context.Canceled):
- sender/dispatcher.go:106
- stats/logger.go:253
- stats/block_collector.go:87
SpawnBgNamed (utils/service/start.go:101-113) wraps with fmt.Errorf("%s: %w", name, err) — preserves errors.Is traversal.
errgroup returns the first error → runLoadTest returns it (main.go:365) → rootCmd.Run calls log.Fatal(err) (main.go:49) → exit 1.

Net effect: every successful nightly run exits 1, the K8s Job goes to Failed, and kube_job_failed=1 fires KubeJobFailed alerts on the harbor cluster (8 distinct run IDs in the last 14d).

Fix

log.Printf("👋 Shutdown complete")
if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
    err = nil
}
return err

Conventional Go pattern at the process boundary (parallels http.ErrServerClosed). The signal handler at main.go:351-352 already declares "SIGTERM is graceful" by returning nil; this honors that intent. DeadlineExceeded is included defensively — the internal ctx in service.Run is WithCancel today so it can't appear, but if anyone wraps the parent in WithTimeout later we'd regress.

Why at the boundary, not at task sites

Considered pushing return nil into each background task, but return ctx.Err() at task sites is honest signal: if a sibling task fails for a real reason (sender error, RPC down) and that propagates ctx-cancellation, you want each background task to surface that fact, not silently swallow it. The boundary check is the right level — a single point where the process owner decides "this counts as success."

Test plan

go build . passes
Next nightly run on harbor: K8s Job should reach condition=Complete, kube_job_failed stays at 0, and KubeJobFailed alert clears for new runs

Production impact

Fixes the upstream cause of KubeJobFailed warnings on nightly/seiload-* Jobs (harbor cluster, nightly namespace). Companion PR in sei-protocol/platform adds (a) EKS auth refresh in the workflow's teardown step and (b) a defense-in-depth GC CronJob, which together address resource leaks discovered while diagnosing this exit-code issue.

🤖 Generated with Claude Code

Background tasks return ctx.Err() on shutdown, which bubbles to runLoadTest as context.Canceled and is then log.Fatal()'d, so every graceful shutdown exits 1. Treat context.Canceled (and DeadlineExceeded for future-proofing) as the expected end-of-run signal at the process boundary. Fixes KubeJobFailed alerts for nightly/seiload-* on harbor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

## Summary Add `--duration` flag so seiload self-terminates cleanly inside K8s Job `activeDeadlineSeconds`, producing pod exit 0 → Job condition `Complete` instead of K8s-mandated `Failed/DeadlineExceeded`. ## Why this exists Followup to #30. The exit-code fix in #30 is correct in isolation (graceful SIGTERM → exit 0), but on Kubernetes Jobs with **Job-level `activeDeadlineSeconds`**, the K8s Job controller sets `condition=Failed, reason=DeadlineExceeded` *regardless of the container's exit code*: > Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded. > — [Kubernetes Job docs](https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup) A run on harbor's `nightly` namespace today (workflow [25189102396](https://github.com/sei-protocol/platform/actions/runs/25189102396)) confirmed this: seiload's pod was SIGTERMed at the deadline and likely exited 0 (post-#30), but the Job condition was set: ``` "reason": "DeadlineExceeded", "message": "Job was active longer than specified deadline" ``` This means `kube_job_failed=1` and `KubeJobFailed` keeps firing — the original symptom that motivated this whole investigation. The actual fix has to make seiload self-terminate *before* K8s decides "DeadlineExceeded." ## What changes ```go rootCmd.Flags().Duration("duration", 0, "Run duration; the load test ctx is canceled after this elapses, the existing graceful-shutdown path runs, and the process exits 0. 0 means run until SIGTERM/SIGINT.") ``` ```go if duration, _ := cmd.Flags().GetDuration("duration"); duration > 0 { log.Printf("⏰ Run duration: %s", duration) var cancel context.CancelFunc ctx, cancel = context.WithTimeout(ctx, duration) defer cancel() } ``` When `--duration` is set: 1. The load-test context is wrapped with `WithTimeout`. 2. After `duration` elapses, ctx is canceled with `context.DeadlineExceeded`. 3. Background tasks (dispatcher, logger, block_collector) unwind via `ctx.Done()`. 4. `service.Run` returns the wrapped DeadlineExceeded error. 5. Final stats emit, `EmitRunSummary` runs, post-summary flush sleeps for 45s. 6. Existing boundary check from #30 (`errors.Is(err, context.DeadlineExceeded)`) clears the error. 7. Process exits 0. The existing post-summary flush delay still runs by design — it sits *after* `service.Run` returns, in the cleanup pipeline. ## What doesn't change - Default is `0` (unlimited) so existing callers without the flag are unaffected. - SIGTERM/SIGINT handling in `main.go:347-352` is untouched and still works. - Exit-code semantics from #30 (Canceled OR DeadlineExceeded → exit 0) already cover both internal-timeout and external-SIGTERM paths. ## Test plan - [x] `GOWORK=off go build .` passes - [x] `GOWORK=off go vet ./...` clean - [ ] Companion platform PR will pass `--duration=${DURATION_MINUTES}m` to seiload args; tomorrow's nightly will verify Job condition flips to `Complete` ## Companion change Will follow with a small PR on `sei-protocol/platform` that: 1. Bumps the seiload image to the new SHA after this merges. 2. Adds `--duration=${DURATION_MINUTES}m` to seiload args in `clusters/harbor/nightly/templates/seiload-job.yaml`. 3. Optionally raises `JOB_DEADLINE_SECONDS` slightly to keep `activeDeadlineSeconds` as a backstop only. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bdchatham requested a review from amir-deris April 30, 2026 20:25

amir-deris approved these changes Apr 30, 2026

View reviewed changes

bdchatham merged commit ae42cb0 into main Apr 30, 2026
2 checks passed

bdchatham deleted the fix/exit-graceful-shutdown branch April 30, 2026 20:41

bdchatham mentioned this pull request Apr 30, 2026

feat: add --duration flag for self-terminated load runs #31

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: exit 0 on graceful SIGTERM/deadline shutdown#30

fix: exit 0 on graceful SIGTERM/deadline shutdown#30
bdchatham merged 1 commit intomainfrom
fix/exit-graceful-shutdown

bdchatham commented Apr 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bdchatham commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Why at the boundary, not at task sites

Test plan

Production impact

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bdchatham commented Apr 30, 2026 •

edited

Loading