Skip to content

fix: exit 0 on graceful SIGTERM/deadline shutdown#30

Merged
bdchatham merged 1 commit intomainfrom
fix/exit-graceful-shutdown
Apr 30, 2026
Merged

fix: exit 0 on graceful SIGTERM/deadline shutdown#30
bdchatham merged 1 commit intomainfrom
fix/exit-graceful-shutdown

Conversation

@bdchatham
Copy link
Copy Markdown
Contributor

@bdchatham bdchatham commented Apr 30, 2026

Summary

seiload exits non-zero on every graceful shutdown. Treat context.Canceled (and context.DeadlineExceeded for future-proofing) as the expected end-of-run signal at the top-level boundary of runLoadTest.

Root cause

Tracing the chain when Kubernetes SIGTERMs the container at activeDeadlineSeconds:

  1. Signal handler in runLoadTest (main.go:347-352) catches SIGTERM cleanly and the main task closure passed to service.Run returns nil.
  2. service.Run (utils/service/start.go:124-132) cancels its internal context once the main task finishes.
  3. Background tasks see ctx.Done() and return ctx.Err() (= context.Canceled):
    • sender/dispatcher.go:106
    • stats/logger.go:253
    • stats/block_collector.go:87
  4. SpawnBgNamed (utils/service/start.go:101-113) wraps with fmt.Errorf("%s: %w", name, err) — preserves errors.Is traversal.
  5. errgroup returns the first error → runLoadTest returns it (main.go:365) → rootCmd.Run calls log.Fatal(err) (main.go:49) → exit 1.

Net effect: every successful nightly run exits 1, the K8s Job goes to Failed, and kube_job_failed=1 fires KubeJobFailed alerts on the harbor cluster (8 distinct run IDs in the last 14d).

Fix

log.Printf("👋 Shutdown complete")
if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
    err = nil
}
return err

Conventional Go pattern at the process boundary (parallels http.ErrServerClosed). The signal handler at main.go:351-352 already declares "SIGTERM is graceful" by returning nil; this honors that intent. DeadlineExceeded is included defensively — the internal ctx in service.Run is WithCancel today so it can't appear, but if anyone wraps the parent in WithTimeout later we'd regress.

Why at the boundary, not at task sites

Considered pushing return nil into each background task, but return ctx.Err() at task sites is honest signal: if a sibling task fails for a real reason (sender error, RPC down) and that propagates ctx-cancellation, you want each background task to surface that fact, not silently swallow it. The boundary check is the right level — a single point where the process owner decides "this counts as success."

Test plan

  • go build . passes
  • Next nightly run on harbor: K8s Job should reach condition=Complete, kube_job_failed stays at 0, and KubeJobFailed alert clears for new runs

Production impact

Fixes the upstream cause of KubeJobFailed warnings on nightly/seiload-* Jobs (harbor cluster, nightly namespace). Companion PR in sei-protocol/platform adds (a) EKS auth refresh in the workflow's teardown step and (b) a defense-in-depth GC CronJob, which together address resource leaks discovered while diagnosing this exit-code issue.

🤖 Generated with Claude Code

Background tasks return ctx.Err() on shutdown, which bubbles to
runLoadTest as context.Canceled and is then log.Fatal()'d, so every
graceful shutdown exits 1.

Treat context.Canceled (and DeadlineExceeded for future-proofing) as
the expected end-of-run signal at the process boundary.

Fixes KubeJobFailed alerts for nightly/seiload-* on harbor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham requested a review from amir-deris April 30, 2026 20:25
@bdchatham bdchatham merged commit ae42cb0 into main Apr 30, 2026
2 checks passed
@bdchatham bdchatham deleted the fix/exit-graceful-shutdown branch April 30, 2026 20:41
bdchatham added a commit that referenced this pull request Apr 30, 2026
## Summary

Add `--duration` flag so seiload self-terminates cleanly inside K8s Job
`activeDeadlineSeconds`, producing pod exit 0 → Job condition `Complete`
instead of K8s-mandated `Failed/DeadlineExceeded`.

## Why this exists

Followup to #30. The exit-code fix in #30 is correct in isolation
(graceful SIGTERM → exit 0), but on Kubernetes Jobs with **Job-level
`activeDeadlineSeconds`**, the K8s Job controller sets
`condition=Failed, reason=DeadlineExceeded` *regardless of the
container's exit code*:

> Once a Job reaches activeDeadlineSeconds, all of its running Pods are
terminated and the Job status will become type: Failed with reason:
DeadlineExceeded.
> — [Kubernetes Job
docs](https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup)

A run on harbor's `nightly` namespace today (workflow
[25189102396](https://github.com/sei-protocol/platform/actions/runs/25189102396))
confirmed this: seiload's pod was SIGTERMed at the deadline and likely
exited 0 (post-#30), but the Job condition was set:

```
"reason": "DeadlineExceeded",
"message": "Job was active longer than specified deadline"
```

This means `kube_job_failed=1` and `KubeJobFailed` keeps firing — the
original symptom that motivated this whole investigation. The actual fix
has to make seiload self-terminate *before* K8s decides
"DeadlineExceeded."

## What changes

```go
rootCmd.Flags().Duration("duration", 0, "Run duration; the load test ctx is canceled after this elapses, the existing graceful-shutdown path runs, and the process exits 0. 0 means run until SIGTERM/SIGINT.")
```

```go
if duration, _ := cmd.Flags().GetDuration("duration"); duration > 0 {
    log.Printf("⏰ Run duration: %s", duration)
    var cancel context.CancelFunc
    ctx, cancel = context.WithTimeout(ctx, duration)
    defer cancel()
}
```

When `--duration` is set:
1. The load-test context is wrapped with `WithTimeout`.
2. After `duration` elapses, ctx is canceled with
`context.DeadlineExceeded`.
3. Background tasks (dispatcher, logger, block_collector) unwind via
`ctx.Done()`.
4. `service.Run` returns the wrapped DeadlineExceeded error.
5. Final stats emit, `EmitRunSummary` runs, post-summary flush sleeps
for 45s.
6. Existing boundary check from #30 (`errors.Is(err,
context.DeadlineExceeded)`) clears the error.
7. Process exits 0.

The existing post-summary flush delay still runs by design — it sits
*after* `service.Run` returns, in the cleanup pipeline.

## What doesn't change

- Default is `0` (unlimited) so existing callers without the flag are
unaffected.
- SIGTERM/SIGINT handling in `main.go:347-352` is untouched and still
works.
- Exit-code semantics from #30 (Canceled OR DeadlineExceeded → exit 0)
already cover both internal-timeout and external-SIGTERM paths.

## Test plan

- [x] `GOWORK=off go build .` passes
- [x] `GOWORK=off go vet ./...` clean
- [ ] Companion platform PR will pass `--duration=${DURATION_MINUTES}m`
to seiload args; tomorrow's nightly will verify Job condition flips to
`Complete`

## Companion change

Will follow with a small PR on `sei-protocol/platform` that:
1. Bumps the seiload image to the new SHA after this merges.
2. Adds `--duration=${DURATION_MINUTES}m` to seiload args in
`clusters/harbor/nightly/templates/seiload-job.yaml`.
3. Optionally raises `JOB_DEADLINE_SECONDS` slightly to keep
`activeDeadlineSeconds` as a backstop only.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants