Step 11: worker scaffold + 5 sweepers + retry + notifications by haydercyber · Pull Request #2 · secrets-bridge/worker

haydercyber · 2026-05-28T23:42:12Z

Closes #1.

Summary

Bootstraps the secrets-bridge worker service. Multi-replica safe via Redis-backed leader election per sweeper; minimal Postgres + Redis dependency surface (no provider SDKs); imports api/pkg/{storage,runtime} per REFACTOR_PLAN §4.

What landed

Scaffold

cmd/worker/{main,config}.go — boot, env-driven config, graceful shutdown
internal/observability/logger.go — slog JSON
internal/probes/ — /healthz /readyz /metrics on loopback
Dockerfile — golang:1.25-alpine → distroless/static:nonroot
.github/workflows/ci.yml — 4 jobs (build / test -race + Postgres+Redis service / lint / tidy). Each job checks out a sibling api repo so the local replace directive resolves in CI

internal/retry — exponential-backoff policy. DefaultPolicy(): 1s initial, 2x growth, 1h cap (matches issue #1), 20% jitter. Permanent() lets callers bail the retry loop.

internal/notifications — pluggable sink interface

Webhook (POST JSON; FormatSlack=true → {"text": ...}; 4xx permanent, 5xx transient)
NoOp (logs at the event's severity — default)
Fanout (per-sink errors joined; one bad sink doesn't block others)
Hard rule: plaintext secrets MUST NEVER appear in Title / Detail / Metadata

internal/scheduler — Redis-lock leader-elected periodic runner

Tick fires; one replica acquires worker:sweeper:<name> and runs; others skip
Lock auto-renews via runtime.Lock.StartRenewal; lease loss cancels the run's ctx
Metrics: worker_scheduler_runs_total{task,outcome} (success / failure / skipped_lock), worker_scheduler_run_duration_seconds, worker_scheduler_lock_skipped_total
NewForTest skips the lock for unit tests

internal/sweepers — five tasks:

Sweeper	Owns	Default cadence	Default cutoff
`wraps-expired`	Purge `secret_wraps` past `expires_at`	1m	—
`secrets-stale`	Flip discovered-secret rows to `missing`	5m	24h
`agents-stale`	Flip agents `active` → `stale` when heartbeat stopped	1m	5m
`jobs-recovery`	Flip `claimed` sync_jobs to `expired` when `claim_expires_at` passed	30s	—
`discover-scheduler`	Enqueue one discover job per configured target	1h	—

agents-stale + jobs-recovery use raw SQL on the api's pgxpool because the queries are worker concerns. discover-scheduler reads SB_DISCOVER_TARGETS_JSON; a future PR swaps that for a provider_connections admin API.

Hard rules respected

Stateless (NFR-08): all state in Postgres + Redis
No secret values logged, audited, or notified
Multi-replica safe: leader election per sweeper
Fail-loud on misconfig: targets parse at boot
No provider SDKs imported by the worker

Verification

go build ./... clean
go vet ./... clean
go test -race -count=1 ./... green (24 tests across retry / notifications / scheduler / sweepers)
Live smoke against api's docker-compose (Postgres 17 + Redis 7):
- Worker boots, registers 4 sweepers (discover skipped — no targets)
- wraps-expired ran successfully within 2s
- /healthz 200 {"status":"ok"}, /readyz 200 {"status":"ready"}, /metrics shows worker_scheduler_runs_total{outcome="success",task="wraps-expired"} 1
- Graceful shutdown drains on SIGTERM

Open follow-ups (not in scope)

Swap SB_DISCOVER_TARGETS_JSON for a provider_connections admin endpoint (needs an api PR adding the repository)
Slack + email + PagerDuty sink implementations (interface is in; impls fold in cleanly)
Drift-check sweeper (compare source vs. destination checksums via core.GetMetadata) — needs the secret_mappings flow to be the active sync mechanism

Closes #1. Bootstraps the secrets-bridge worker service. Multi-replica safe via Redis-backed leader election per sweeper; minimal Postgres + Redis dependency surface (no provider SDKs). What landed ----------- Scaffold: - cmd/worker/{main,config}.go — boot sequence, env-driven config, signal-driven graceful shutdown - internal/observability/logger.go — slog JSON logger - internal/probes — /healthz /readyz /metrics on a loopback listener; per-check readiness aggregator (same shape as api) - Dockerfile — golang:1.25-alpine → distroless/static:nonroot - .github/workflows/ci.yml — 4 jobs (build / test -race / lint / tidy). Each job checks out a sibling api repo so the local `replace` directive resolves in CI internal/retry: - Policy{InitialDelay, MaxDelay, Multiplier, JitterFraction, MaxAttempts} - DefaultPolicy(): 1s initial, 2x growth, 1h cap (matches issue #1's "capped at 1h"), 20% jitter, retry forever - ErrPermanent shortcut: callers wrap permanent errors to bail the loop without retrying internal/notifications: - Notifier interface (Notify + Name) - Webhook: posts JSON; FormatSlack=true emits {"text": ...}; 4xx → PermanentStatusError (don't retry); 5xx → StatusError (retry) - NoOp: logs at the event's severity (used by default) - Fanout: dispatches to multiple sinks; per-sink errors joined but one bad sink doesn't block others - Hard rule: plaintext secrets MUST NEVER appear in Title / Detail / Metadata. Notifications are treated like logs for "no plaintext" internal/scheduler: - Task interface (Name + Run) - TaskRegistration{Task, Interval, Retry, Lease} - Scheduler runs each task in a goroutine. At each tick: acquire Redis lock "worker:sweeper:<name>" with auto-renewal; run; release. Lock contention is the EXPECTED outcome for N-1 of N replicas — observed as a metric (worker_scheduler_lock_skipped_total), not logged at warn. - Metrics: runs (success / failure / skipped_lock), durations, missed-due-to-lock counter - NewForTest skips the lock for unit tests internal/sweepers — five tasks: - WrapsExpired — purge secret_wraps past expires_at (uses api/pkg/storage.SecretWraps.DeleteExpired) - SecretsStale — flip discovered-secret rows to `missing` after a configurable cutoff (uses api/pkg/storage.Secrets.MarkStaleAsMissing) - AgentsStale — UPDATE agents SET status='stale' WHERE status='active' AND last_seen_at < cutoff. Raw SQL on the api's pgxpool — the query is worker-specific and doesn't belong in api's domain layer. Same pattern for jobs-recovery. - JobsRecovery — UPDATE sync_jobs SET status='expired' WHERE status='claimed' AND claim_expires_at < now. Observability / audit-trail companion to api's inline ClaimNext expired-claim reentry; the api side handles the claim-path correctness, the worker explicitly transitions the status so admin views aren't showing forever-claimed rows. - DiscoverScheduler — enqueue one discover job per configured target every interval. Targets come from SB_DISCOVER_TARGETS_JSON (a future PR will swap this for a provider_connections admin API). Hard rules respected -------------------- - Stateless: all state lives in Postgres + Redis (NFR-08) - No secret values logged, audited, or notified - Multi-replica safe: leader election per sweeper via Redis lock - Fail-loud on misconfig: SB_DISCOVER_TARGETS_JSON parses at boot - Worker imports api/pkg/{storage,runtime} via local `replace` directive (per REFACTOR_PLAN §4 polyrepo rule) — CI checks out both repos side-by-side - No provider SDKs imported by the worker; only api repos for shared storage/runtime types Verification ------------ - go build / vet / test -race -count=1 ./... all green - 24 unit tests across retry / notifications / scheduler / sweepers - Live smoke against api's docker-compose (Postgres 17 + Redis 7): worker boots, registers 4 sweepers (discover skipped because no targets configured), runs the wraps-expired sweep successfully within 2s, /healthz + /readyz + /metrics all respond, worker_scheduler_runs_total{outcome="success",task="wraps-expired"}=1, graceful shutdown drains cleanly on SIGTERM

CI's lint job (golangci-lint v2.12.2) flagged 6 issues. All are minor and unrelated to the worker's behavior; this commit fixes them verbatim. - errcheck: defer rt.Close() now wraps the error explicitly (runtime.Client.Close returns an error). We discard it because the worker is already shutting down at that point. - staticcheck SA1019 (×4): prometheus.NewProcessCollector + prometheus.NewGoCollector are deprecated in favor of the collectors subpackage. Swap to collectors.NewProcessCollector + collectors.NewGoCollector in cmd/worker/main.go AND internal/probes/probes.go. - unused: Scheduler.stopped field was never read. Removed. Verified locally: go build / vet / test -race -count=1 ./... still green.

haydercyber added 2 commits May 28, 2026 23:41

haydercyber merged commit c38b8bb into main May 28, 2026
4 checks passed

haydercyber deleted the feat/step-11-scaffold branch May 28, 2026 23:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step 11: worker scaffold + 5 sweepers + retry + notifications#2

Step 11: worker scaffold + 5 sweepers + retry + notifications#2
haydercyber merged 2 commits into
mainfrom
feat/step-11-scaffold

haydercyber commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haydercyber commented May 28, 2026

Summary

What landed

Hard rules respected

Verification

Open follow-ups (not in scope)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant