Skip to content

Step 11: worker scaffold + 5 sweepers + retry + notifications#2

Merged
haydercyber merged 2 commits into
mainfrom
feat/step-11-scaffold
May 28, 2026
Merged

Step 11: worker scaffold + 5 sweepers + retry + notifications#2
haydercyber merged 2 commits into
mainfrom
feat/step-11-scaffold

Conversation

@haydercyber
Copy link
Copy Markdown
Contributor

Closes #1.

Summary

Bootstraps the secrets-bridge worker service. Multi-replica safe via Redis-backed leader election per sweeper; minimal Postgres + Redis dependency surface (no provider SDKs); imports api/pkg/{storage,runtime} per REFACTOR_PLAN §4.

What landed

Scaffold

  • cmd/worker/{main,config}.go — boot, env-driven config, graceful shutdown
  • internal/observability/logger.go — slog JSON
  • internal/probes/ — /healthz /readyz /metrics on loopback
  • Dockerfile — golang:1.25-alpine → distroless/static:nonroot
  • .github/workflows/ci.yml — 4 jobs (build / test -race + Postgres+Redis service / lint / tidy). Each job checks out a sibling api repo so the local replace directive resolves in CI

internal/retry — exponential-backoff policy. DefaultPolicy(): 1s initial, 2x growth, 1h cap (matches issue #1), 20% jitter. Permanent() lets callers bail the retry loop.

internal/notifications — pluggable sink interface

  • Webhook (POST JSON; FormatSlack=true{"text": ...}; 4xx permanent, 5xx transient)
  • NoOp (logs at the event's severity — default)
  • Fanout (per-sink errors joined; one bad sink doesn't block others)
  • Hard rule: plaintext secrets MUST NEVER appear in Title / Detail / Metadata

internal/scheduler — Redis-lock leader-elected periodic runner

  • Tick fires; one replica acquires worker:sweeper:<name> and runs; others skip
  • Lock auto-renews via runtime.Lock.StartRenewal; lease loss cancels the run's ctx
  • Metrics: worker_scheduler_runs_total{task,outcome} (success / failure / skipped_lock), worker_scheduler_run_duration_seconds, worker_scheduler_lock_skipped_total
  • NewForTest skips the lock for unit tests

internal/sweepers — five tasks:

Sweeper Owns Default cadence Default cutoff
wraps-expired Purge secret_wraps past expires_at 1m
secrets-stale Flip discovered-secret rows to missing 5m 24h
agents-stale Flip agents activestale when heartbeat stopped 1m 5m
jobs-recovery Flip claimed sync_jobs to expired when claim_expires_at passed 30s
discover-scheduler Enqueue one discover job per configured target 1h

agents-stale + jobs-recovery use raw SQL on the api's pgxpool because the queries are worker concerns. discover-scheduler reads SB_DISCOVER_TARGETS_JSON; a future PR swaps that for a provider_connections admin API.

Hard rules respected

  • Stateless (NFR-08): all state in Postgres + Redis
  • No secret values logged, audited, or notified
  • Multi-replica safe: leader election per sweeper
  • Fail-loud on misconfig: targets parse at boot
  • No provider SDKs imported by the worker

Verification

  • go build ./... clean
  • go vet ./... clean
  • go test -race -count=1 ./... green (24 tests across retry / notifications / scheduler / sweepers)
  • Live smoke against api's docker-compose (Postgres 17 + Redis 7):
    • Worker boots, registers 4 sweepers (discover skipped — no targets)
    • wraps-expired ran successfully within 2s
    • /healthz 200 {"status":"ok"}, /readyz 200 {"status":"ready"}, /metrics shows worker_scheduler_runs_total{outcome="success",task="wraps-expired"} 1
    • Graceful shutdown drains on SIGTERM

Open follow-ups (not in scope)

  • Swap SB_DISCOVER_TARGETS_JSON for a provider_connections admin endpoint (needs an api PR adding the repository)
  • Slack + email + PagerDuty sink implementations (interface is in; impls fold in cleanly)
  • Drift-check sweeper (compare source vs. destination checksums via core.GetMetadata) — needs the secret_mappings flow to be the active sync mechanism

Closes #1. Bootstraps the secrets-bridge worker service. Multi-replica
safe via Redis-backed leader election per sweeper; minimal Postgres
+ Redis dependency surface (no provider SDKs).

What landed
-----------
Scaffold:
- cmd/worker/{main,config}.go — boot sequence, env-driven config,
  signal-driven graceful shutdown
- internal/observability/logger.go — slog JSON logger
- internal/probes — /healthz /readyz /metrics on a loopback listener;
  per-check readiness aggregator (same shape as api)
- Dockerfile — golang:1.25-alpine → distroless/static:nonroot
- .github/workflows/ci.yml — 4 jobs (build / test -race / lint /
  tidy). Each job checks out a sibling api repo so the local
  `replace` directive resolves in CI

internal/retry:
- Policy{InitialDelay, MaxDelay, Multiplier, JitterFraction, MaxAttempts}
- DefaultPolicy(): 1s initial, 2x growth, 1h cap (matches issue #1's
  "capped at 1h"), 20% jitter, retry forever
- ErrPermanent shortcut: callers wrap permanent errors to bail the
  loop without retrying

internal/notifications:
- Notifier interface (Notify + Name)
- Webhook: posts JSON; FormatSlack=true emits {"text": ...};
  4xx → PermanentStatusError (don't retry); 5xx → StatusError (retry)
- NoOp: logs at the event's severity (used by default)
- Fanout: dispatches to multiple sinks; per-sink errors joined but
  one bad sink doesn't block others
- Hard rule: plaintext secrets MUST NEVER appear in Title / Detail /
  Metadata. Notifications are treated like logs for "no plaintext"

internal/scheduler:
- Task interface (Name + Run)
- TaskRegistration{Task, Interval, Retry, Lease}
- Scheduler runs each task in a goroutine. At each tick: acquire
  Redis lock "worker:sweeper:<name>" with auto-renewal; run; release.
  Lock contention is the EXPECTED outcome for N-1 of N replicas —
  observed as a metric (worker_scheduler_lock_skipped_total),
  not logged at warn.
- Metrics: runs (success / failure / skipped_lock), durations,
  missed-due-to-lock counter
- NewForTest skips the lock for unit tests

internal/sweepers — five tasks:
- WrapsExpired — purge secret_wraps past expires_at (uses
  api/pkg/storage.SecretWraps.DeleteExpired)
- SecretsStale — flip discovered-secret rows to `missing` after a
  configurable cutoff (uses api/pkg/storage.Secrets.MarkStaleAsMissing)
- AgentsStale — UPDATE agents SET status='stale' WHERE
  status='active' AND last_seen_at < cutoff. Raw SQL on the api's
  pgxpool — the query is worker-specific and doesn't belong in api's
  domain layer. Same pattern for jobs-recovery.
- JobsRecovery — UPDATE sync_jobs SET status='expired' WHERE
  status='claimed' AND claim_expires_at < now. Observability /
  audit-trail companion to api's inline ClaimNext expired-claim
  reentry; the api side handles the claim-path correctness, the
  worker explicitly transitions the status so admin views aren't
  showing forever-claimed rows.
- DiscoverScheduler — enqueue one discover job per configured target
  every interval. Targets come from SB_DISCOVER_TARGETS_JSON (a
  future PR will swap this for a provider_connections admin API).

Hard rules respected
--------------------
- Stateless: all state lives in Postgres + Redis (NFR-08)
- No secret values logged, audited, or notified
- Multi-replica safe: leader election per sweeper via Redis lock
- Fail-loud on misconfig: SB_DISCOVER_TARGETS_JSON parses at boot
- Worker imports api/pkg/{storage,runtime} via local `replace`
  directive (per REFACTOR_PLAN §4 polyrepo rule) — CI checks out
  both repos side-by-side
- No provider SDKs imported by the worker; only api repos for shared
  storage/runtime types

Verification
------------
- go build / vet / test -race -count=1 ./... all green
- 24 unit tests across retry / notifications / scheduler / sweepers
- Live smoke against api's docker-compose (Postgres 17 + Redis 7):
  worker boots, registers 4 sweepers (discover skipped because no
  targets configured), runs the wraps-expired sweep successfully
  within 2s, /healthz + /readyz + /metrics all respond,
  worker_scheduler_runs_total{outcome="success",task="wraps-expired"}=1,
  graceful shutdown drains cleanly on SIGTERM
CI's lint job (golangci-lint v2.12.2) flagged 6 issues. All are minor
and unrelated to the worker's behavior; this commit fixes them
verbatim.

- errcheck: defer rt.Close() now wraps the error explicitly
  (runtime.Client.Close returns an error). We discard it because the
  worker is already shutting down at that point.
- staticcheck SA1019 (×4): prometheus.NewProcessCollector +
  prometheus.NewGoCollector are deprecated in favor of the
  collectors subpackage. Swap to collectors.NewProcessCollector +
  collectors.NewGoCollector in cmd/worker/main.go AND
  internal/probes/probes.go.
- unused: Scheduler.stopped field was never read. Removed.

Verified locally: go build / vet / test -race -count=1 ./... still
green.
@haydercyber haydercyber merged commit c38b8bb into main May 28, 2026
4 checks passed
@haydercyber haydercyber deleted the feat/step-11-scaffold branch May 28, 2026 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Step 11] Scaffold worker — job dispatch, retries, scheduled scans, notifications

1 participant