fix: address K8s and Platform review findings for SQLite store#46
Merged
fix: address K8s and Platform review findings for SQLite store#46
Conversation
bdchatham
commented
Mar 30, 2026
bdchatham
commented
Mar 30, 2026
Must-fix items: - Add sync.WaitGroup to drain in-flight task goroutines before store.Close() on shutdown. Engine.Wait() called in serve.go between server stop and DB close. - Mark stale one-shot tasks as failed on startup via RecoverStaleTasks(). Scheduled tasks are left as-is for the scheduler to re-evaluate. - Wrap v1 migration in an explicit transaction so DDL and PRAGMA user_version are atomic. Future migrations follow same pattern. - Normalize all timestamps to UTC before formatting to RFC3339Nano. Prevents incorrect string-based comparisons in SQLite when the system timezone is not UTC. Should-fix items: - Fix RemoveResult vs completion goroutine race: the completing goroutine now checks if its cancel func is still in the map before saving. If RemoveResult already cleaned it up, the save is skipped. - Remove process-global singleton guard on NewSQLiteStore. Rely on serve.go wiring to call it once. - Document PVC StorageClass constraint on NewSQLiteStore: WAL mode requires block-device-backed storage, not NFS. Test fixes: - Replace time.Sleep with sync.WaitGroup in TestE2E_ConcurrentSubmit. - Add TestE2E_StaleTaskRecovery: verifies one-shot running tasks are marked failed on restart while scheduled tasks are preserved. - Add TestE2E_ShutdownDrainsGoroutines: verifies Wait() blocks until a long-running task completes and its result is persisted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…down Address final review comments: - Replace RecoverStaleTasks (mark failed) with rehydration: on startup, one-shot tasks left as "running" from a previous crash are re-executed through the same runTask path as normal submissions. - Replace Wait() with Shutdown(ctx): uses a context deadline instead of blocking forever. Handlers observe ctx.Done() and stop gracefully; Shutdown gives a bounded grace period for final store writes. - Extract runTask() from Submit to share goroutine lifecycle logic between new submissions and rehydrated tasks. - Update serve.go to use Shutdown with a 10-second deadline. - Update e2e tests: TestE2E_StaleTaskRehydration verifies tasks are re-executed (not just marked failed), TestE2E_ShutdownDrainsGoroutines uses Shutdown with context timeout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per review: leave graceful task termination out of this PR. Scheduled tasks are being removed in a follow-up, which eliminates the need for per-task cancel funcs entirely. The root context from SIGTERM is sufficient to stop all handlers gracefully. - Remove cancels map and wg from Engine - Remove Shutdown() method and serve.go drain call - Simplify runTask to pass e.ctx directly to handlers - Simplify RemoveResult to just delete from store - Remove TestE2E_ShutdownDrainsGoroutines (no longer applicable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move RowScanner to sqlite_store.go as unexported rowScanner (impl detail, not part of the interface contract) - Extract selectColumns constant and queryMany helper to DRY up List, ListScheduled, and ListStaleTasks into one-liners - store.go is now a pure interface file with no implementation imports Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7f4ddea to
8324432
Compare
Remove all scheduled task (cron) functionality from the engine, server, client, and OpenAPI spec. This is a significant simplification that reduces the codebase by ~820 lines. Engine simplification: - Remove SubmitScheduled, EvalSchedules, runningTasks overlap guard - Replace sync.RWMutex with atomic.Bool for the ready flag - Engine struct is now 4 fields: handlers, ctx, ready, store - Remove cron.go and cron_test.go entirely - Drop robfig/cron/v3 dependency Store simplification: - Remove ListScheduled from ResultStore interface (6 methods → 5) - Remove schedule/next_run_at marshaling from SQLiteStore - Simplify ListStaleTasks query (no schedule IS NULL filter needed) - Add v2 schema migration dropping schedule, next_run_at columns and idx_task_results_schedule index Type cleanup: - Remove ScheduleConfig struct - Remove Schedule and NextRunAt fields from TaskResult - Add idempotency doc comment to TaskHandler API changes: - Remove schedule field from TaskRequest in OpenAPI spec (v0.7.0) - Remove ScheduleConfig schema - Remove nextRunAt from TaskResult - Remove schedule branch from handlePostTask in server - Regenerate OpenAPI client Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds on #45 and addresses all findings from the Kubernetes and Platform specialist reviews.
Must-fix (4 items)
Engine.Wait()+sync.WaitGroupensures all in-flight task goroutines complete beforestore.Close(). Wired intoserve.gobetween server stop and DB close.RecoverStaleTasks()marks one-shot tasks left asrunningfrom a previous crash asfailedon startup. Scheduled tasks are preserved for re-evaluation.PRAGMA user_versionwrapped inBEGIN/COMMITfor atomicity.time.Now()and format calls normalized to.UTC()to prevent incorrect string-based comparisons in SQLite.Should-fix (3 items)
RemoveResultalready cleaned it up.storeMu/storeCreatedguard onNewSQLiteStore. Rely onserve.gowiring.NewSQLiteStoredoc now warns that WAL mode requires block-device-backed storage (not NFS).Test improvements
TestE2E_StaleTaskRecovery— verifies crash recovery behaviorTestE2E_ShutdownDrainsGoroutines— verifiesWait()drains before closeTestE2E_ConcurrentSubmit— replacedtime.Sleepwithsync.WaitGroupTest plan
go test ./sidecar/...— all tests passCGO_ENABLED=0 go build .— binary buildsgofmt -s— clean🤖 Generated with Claude Code