Step 11: worker scaffold + 5 sweepers + retry + notifications#2
Merged
Conversation
Closes #1. Bootstraps the secrets-bridge worker service. Multi-replica safe via Redis-backed leader election per sweeper; minimal Postgres + Redis dependency surface (no provider SDKs). What landed ----------- Scaffold: - cmd/worker/{main,config}.go — boot sequence, env-driven config, signal-driven graceful shutdown - internal/observability/logger.go — slog JSON logger - internal/probes — /healthz /readyz /metrics on a loopback listener; per-check readiness aggregator (same shape as api) - Dockerfile — golang:1.25-alpine → distroless/static:nonroot - .github/workflows/ci.yml — 4 jobs (build / test -race / lint / tidy). Each job checks out a sibling api repo so the local `replace` directive resolves in CI internal/retry: - Policy{InitialDelay, MaxDelay, Multiplier, JitterFraction, MaxAttempts} - DefaultPolicy(): 1s initial, 2x growth, 1h cap (matches issue #1's "capped at 1h"), 20% jitter, retry forever - ErrPermanent shortcut: callers wrap permanent errors to bail the loop without retrying internal/notifications: - Notifier interface (Notify + Name) - Webhook: posts JSON; FormatSlack=true emits {"text": ...}; 4xx → PermanentStatusError (don't retry); 5xx → StatusError (retry) - NoOp: logs at the event's severity (used by default) - Fanout: dispatches to multiple sinks; per-sink errors joined but one bad sink doesn't block others - Hard rule: plaintext secrets MUST NEVER appear in Title / Detail / Metadata. Notifications are treated like logs for "no plaintext" internal/scheduler: - Task interface (Name + Run) - TaskRegistration{Task, Interval, Retry, Lease} - Scheduler runs each task in a goroutine. At each tick: acquire Redis lock "worker:sweeper:<name>" with auto-renewal; run; release. Lock contention is the EXPECTED outcome for N-1 of N replicas — observed as a metric (worker_scheduler_lock_skipped_total), not logged at warn. - Metrics: runs (success / failure / skipped_lock), durations, missed-due-to-lock counter - NewForTest skips the lock for unit tests internal/sweepers — five tasks: - WrapsExpired — purge secret_wraps past expires_at (uses api/pkg/storage.SecretWraps.DeleteExpired) - SecretsStale — flip discovered-secret rows to `missing` after a configurable cutoff (uses api/pkg/storage.Secrets.MarkStaleAsMissing) - AgentsStale — UPDATE agents SET status='stale' WHERE status='active' AND last_seen_at < cutoff. Raw SQL on the api's pgxpool — the query is worker-specific and doesn't belong in api's domain layer. Same pattern for jobs-recovery. - JobsRecovery — UPDATE sync_jobs SET status='expired' WHERE status='claimed' AND claim_expires_at < now. Observability / audit-trail companion to api's inline ClaimNext expired-claim reentry; the api side handles the claim-path correctness, the worker explicitly transitions the status so admin views aren't showing forever-claimed rows. - DiscoverScheduler — enqueue one discover job per configured target every interval. Targets come from SB_DISCOVER_TARGETS_JSON (a future PR will swap this for a provider_connections admin API). Hard rules respected -------------------- - Stateless: all state lives in Postgres + Redis (NFR-08) - No secret values logged, audited, or notified - Multi-replica safe: leader election per sweeper via Redis lock - Fail-loud on misconfig: SB_DISCOVER_TARGETS_JSON parses at boot - Worker imports api/pkg/{storage,runtime} via local `replace` directive (per REFACTOR_PLAN §4 polyrepo rule) — CI checks out both repos side-by-side - No provider SDKs imported by the worker; only api repos for shared storage/runtime types Verification ------------ - go build / vet / test -race -count=1 ./... all green - 24 unit tests across retry / notifications / scheduler / sweepers - Live smoke against api's docker-compose (Postgres 17 + Redis 7): worker boots, registers 4 sweepers (discover skipped because no targets configured), runs the wraps-expired sweep successfully within 2s, /healthz + /readyz + /metrics all respond, worker_scheduler_runs_total{outcome="success",task="wraps-expired"}=1, graceful shutdown drains cleanly on SIGTERM
CI's lint job (golangci-lint v2.12.2) flagged 6 issues. All are minor and unrelated to the worker's behavior; this commit fixes them verbatim. - errcheck: defer rt.Close() now wraps the error explicitly (runtime.Client.Close returns an error). We discard it because the worker is already shutting down at that point. - staticcheck SA1019 (×4): prometheus.NewProcessCollector + prometheus.NewGoCollector are deprecated in favor of the collectors subpackage. Swap to collectors.NewProcessCollector + collectors.NewGoCollector in cmd/worker/main.go AND internal/probes/probes.go. - unused: Scheduler.stopped field was never read. Removed. Verified locally: go build / vet / test -race -count=1 ./... still green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1.
Summary
Bootstraps the secrets-bridge worker service. Multi-replica safe via Redis-backed leader election per sweeper; minimal Postgres + Redis dependency surface (no provider SDKs); imports
api/pkg/{storage,runtime}per REFACTOR_PLAN §4.What landed
Scaffold
cmd/worker/{main,config}.go— boot, env-driven config, graceful shutdowninternal/observability/logger.go— slog JSONinternal/probes/— /healthz /readyz /metrics on loopbackDockerfile— golang:1.25-alpine → distroless/static:nonroot.github/workflows/ci.yml— 4 jobs (build / test -race + Postgres+Redis service / lint / tidy). Each job checks out a sibling api repo so the localreplacedirective resolves in CIinternal/retry— exponential-backoff policy.DefaultPolicy(): 1s initial, 2x growth, 1h cap (matches issue #1), 20% jitter.Permanent()lets callers bail the retry loop.internal/notifications— pluggable sink interfaceWebhook(POST JSON;FormatSlack=true→{"text": ...}; 4xx permanent, 5xx transient)NoOp(logs at the event's severity — default)Fanout(per-sink errors joined; one bad sink doesn't block others)internal/scheduler— Redis-lock leader-elected periodic runnerworker:sweeper:<name>and runs; others skipruntime.Lock.StartRenewal; lease loss cancels the run's ctxworker_scheduler_runs_total{task,outcome}(success / failure / skipped_lock),worker_scheduler_run_duration_seconds,worker_scheduler_lock_skipped_totalNewForTestskips the lock for unit testsinternal/sweepers— five tasks:wraps-expiredsecret_wrapspastexpires_atsecrets-stalemissingagents-staleactive→stalewhen heartbeat stoppedjobs-recoveryclaimedsync_jobs toexpiredwhenclaim_expires_atpasseddiscover-scheduleragents-stale+jobs-recoveryuse raw SQL on the api'spgxpoolbecause the queries are worker concerns.discover-schedulerreadsSB_DISCOVER_TARGETS_JSON; a future PR swaps that for aprovider_connectionsadmin API.Hard rules respected
Verification
go build ./...cleango vet ./...cleango test -race -count=1 ./...green (24 tests across retry / notifications / scheduler / sweepers)wraps-expiredran successfully within 2s/healthz200{"status":"ok"},/readyz200{"status":"ready"},/metricsshowsworker_scheduler_runs_total{outcome="success",task="wraps-expired"} 1Open follow-ups (not in scope)
SB_DISCOVER_TARGETS_JSONfor aprovider_connectionsadmin endpoint (needs an api PR adding the repository)core.GetMetadata) — needs thesecret_mappingsflow to be the active sync mechanism