Skip to content

fix: address K8s and Platform review findings for SQLite store#46

Merged
bdchatham merged 5 commits intomainfrom
feat/sqlite-store-review-fixes
Mar 30, 2026
Merged

fix: address K8s and Platform review findings for SQLite store#46
bdchatham merged 5 commits intomainfrom
feat/sqlite-store-review-fixes

Conversation

@bdchatham
Copy link
Copy Markdown
Contributor

Summary

Builds on #45 and addresses all findings from the Kubernetes and Platform specialist reviews.

Must-fix (4 items)

  • Goroutine drain on shutdown: Engine.Wait() + sync.WaitGroup ensures all in-flight task goroutines complete before store.Close(). Wired into serve.go between server stop and DB close.
  • Stale task recovery: RecoverStaleTasks() marks one-shot tasks left as running from a previous crash as failed on startup. Scheduled tasks are preserved for re-evaluation.
  • Transactional migrations: v1 migration DDL + PRAGMA user_version wrapped in BEGIN/COMMIT for atomicity.
  • UTC timestamps: All time.Now() and format calls normalized to .UTC() to prevent incorrect string-based comparisons in SQLite.

Should-fix (3 items)

  • RemoveResult race fix: Completion goroutine checks if its cancel func is still active before saving. Skips save if RemoveResult already cleaned it up.
  • Singleton guard removed: Dropped the storeMu/storeCreated guard on NewSQLiteStore. Rely on serve.go wiring.
  • PVC constraint documented: NewSQLiteStore doc now warns that WAL mode requires block-device-backed storage (not NFS).

Test improvements

  • TestE2E_StaleTaskRecovery — verifies crash recovery behavior
  • TestE2E_ShutdownDrainsGoroutines — verifies Wait() drains before close
  • TestE2E_ConcurrentSubmit — replaced time.Sleep with sync.WaitGroup

Test plan

  • go test ./sidecar/... — all tests pass
  • CGO_ENABLED=0 go build . — binary builds
  • gofmt -s — clean

🤖 Generated with Claude Code

bdchatham and others added 4 commits March 30, 2026 13:37
Must-fix items:
- Add sync.WaitGroup to drain in-flight task goroutines before
  store.Close() on shutdown. Engine.Wait() called in serve.go between
  server stop and DB close.
- Mark stale one-shot tasks as failed on startup via
  RecoverStaleTasks(). Scheduled tasks are left as-is for the
  scheduler to re-evaluate.
- Wrap v1 migration in an explicit transaction so DDL and
  PRAGMA user_version are atomic. Future migrations follow same pattern.
- Normalize all timestamps to UTC before formatting to RFC3339Nano.
  Prevents incorrect string-based comparisons in SQLite when the
  system timezone is not UTC.

Should-fix items:
- Fix RemoveResult vs completion goroutine race: the completing
  goroutine now checks if its cancel func is still in the map before
  saving. If RemoveResult already cleaned it up, the save is skipped.
- Remove process-global singleton guard on NewSQLiteStore. Rely on
  serve.go wiring to call it once.
- Document PVC StorageClass constraint on NewSQLiteStore: WAL mode
  requires block-device-backed storage, not NFS.

Test fixes:
- Replace time.Sleep with sync.WaitGroup in TestE2E_ConcurrentSubmit.
- Add TestE2E_StaleTaskRecovery: verifies one-shot running tasks are
  marked failed on restart while scheduled tasks are preserved.
- Add TestE2E_ShutdownDrainsGoroutines: verifies Wait() blocks until
  a long-running task completes and its result is persisted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…down

Address final review comments:

- Replace RecoverStaleTasks (mark failed) with rehydration: on startup,
  one-shot tasks left as "running" from a previous crash are re-executed
  through the same runTask path as normal submissions.
- Replace Wait() with Shutdown(ctx): uses a context deadline instead of
  blocking forever. Handlers observe ctx.Done() and stop gracefully;
  Shutdown gives a bounded grace period for final store writes.
- Extract runTask() from Submit to share goroutine lifecycle logic
  between new submissions and rehydrated tasks.
- Update serve.go to use Shutdown with a 10-second deadline.
- Update e2e tests: TestE2E_StaleTaskRehydration verifies tasks are
  re-executed (not just marked failed), TestE2E_ShutdownDrainsGoroutines
  uses Shutdown with context timeout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per review: leave graceful task termination out of this PR. Scheduled
tasks are being removed in a follow-up, which eliminates the need for
per-task cancel funcs entirely. The root context from SIGTERM is
sufficient to stop all handlers gracefully.

- Remove cancels map and wg from Engine
- Remove Shutdown() method and serve.go drain call
- Simplify runTask to pass e.ctx directly to handlers
- Simplify RemoveResult to just delete from store
- Remove TestE2E_ShutdownDrainsGoroutines (no longer applicable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move RowScanner to sqlite_store.go as unexported rowScanner (impl
  detail, not part of the interface contract)
- Extract selectColumns constant and queryMany helper to DRY up
  List, ListScheduled, and ListStaleTasks into one-liners
- store.go is now a pure interface file with no implementation imports

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham force-pushed the feat/sqlite-store-review-fixes branch from 7f4ddea to 8324432 Compare March 30, 2026 20:39
Remove all scheduled task (cron) functionality from the engine,
server, client, and OpenAPI spec. This is a significant simplification
that reduces the codebase by ~820 lines.

Engine simplification:
- Remove SubmitScheduled, EvalSchedules, runningTasks overlap guard
- Replace sync.RWMutex with atomic.Bool for the ready flag
- Engine struct is now 4 fields: handlers, ctx, ready, store
- Remove cron.go and cron_test.go entirely
- Drop robfig/cron/v3 dependency

Store simplification:
- Remove ListScheduled from ResultStore interface (6 methods → 5)
- Remove schedule/next_run_at marshaling from SQLiteStore
- Simplify ListStaleTasks query (no schedule IS NULL filter needed)
- Add v2 schema migration dropping schedule, next_run_at columns
  and idx_task_results_schedule index

Type cleanup:
- Remove ScheduleConfig struct
- Remove Schedule and NextRunAt fields from TaskResult
- Add idempotency doc comment to TaskHandler

API changes:
- Remove schedule field from TaskRequest in OpenAPI spec (v0.7.0)
- Remove ScheduleConfig schema
- Remove nextRunAt from TaskResult
- Remove schedule branch from handlePostTask in server
- Regenerate OpenAPI client

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham merged commit 2e580fc into main Mar 30, 2026
2 checks passed
@bdchatham bdchatham deleted the feat/sqlite-store-review-fixes branch March 30, 2026 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant