From 73856765bebe6ccc142074fd28c2d22011111e90 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Mon, 11 May 2026 12:27:33 +0300 Subject: [PATCH 1/3] spec: prometheus metrics registry scaffolding + ADR-0008 (#59) --- .../0008-prometheus-client-adoption.md | 163 ++++++ .../architecture/59-metrics-scaffolding.md | 533 ++++++++++++++++++ 2 files changed, 696 insertions(+) create mode 100644 docs/knowledge/decisions/0008-prometheus-client-adoption.md create mode 100644 docs/specs/architecture/59-metrics-scaffolding.md diff --git a/docs/knowledge/decisions/0008-prometheus-client-adoption.md b/docs/knowledge/decisions/0008-prometheus-client-adoption.md new file mode 100644 index 0000000..76ec931 --- /dev/null +++ b/docs/knowledge/decisions/0008-prometheus-client-adoption.md @@ -0,0 +1,163 @@ +# ADR-0008: Adopt `github.com/prometheus/client_golang` for relay metrics + +**Status:** Accepted (#59) +**Date:** 2026-05-11 + +## Context + +`/healthz` (#10) is a point-in-time snapshot: it answers "is the relay alive +and how many things are connected right now." Production diagnosis of the +adversarial / capacity regimes the relay actually faces needs cumulative +counters — upgrade-attempt floods, header-validation rejects, register-failure +rates, frame throughput, grace-expiry rates. Those are not derivable from +snapshots without a scrape-and-diff harness; they want a real time-series +surface. + +The downstream consumer is a Prometheus scraper. The chosen exposition format +must be one a stock Prometheus server accepts on a `/metrics` endpoint, with +no shim component between relay and scraper. + +This is the relay's first non-stdlib direct dep since #15 +(`nhooyr.io/websocket`). `docs/PROJECT-MEMORY.md` line 37 fixes the rule: +"Stdlib + `golang.org/x/crypto` + `nhooyr.io/websocket`. Direct deps arrived +in #9 and #15. Keep the dependency surface deliberate; new deps need a +justification (the ADR is the justification)." This file is that +justification. + +## Alternatives considered + +### 1. Hand-rolled Prometheus text exposition + +The text format is documented and small. Counters are easy; gauges are easy. +Histograms and summaries (`_bucket{le=…}`, `_count`, `_sum` triplet with +ordered buckets and `+Inf` terminator) are not — escapability, label sorting, +quantile encoding, and OpenMetrics interop are each their own footgun. The +relay does not need histograms today, but the metrics surface will grow into +them (frame size distributions, grace-window latencies). Hand-rolling now +means hand-rolling the histogram code later, in a security-sensitive package, +to save one dep. + +Rejected. The maintenance cost lands on the relay forever; the dep cost +is paid once and is bounded. + +### 2. `github.com/VictoriaMetrics/metrics` + +Drop-in alternative to `client_golang`. Smaller and fewer transitive deps — +genuinely attractive on the supply-chain axis. Rejected because: + +- The ecosystem around `client_golang` (`testutil`, `promhttp`, + `expfmt`, `collectors`) is the lingua franca; rejecting it means + every future contributor reaches for the wrong package by reflex. +- VictoriaMetrics' library does not target full Prometheus parity on + exposition format edge cases (it ships a deliberate subset). The + relay does not gain anything from the subset. + +### 3. `expvar` (stdlib) + exporter sidecar + +`expvar` ships JSON, not text format. To feed a Prometheus scraper you bridge +through an `expvar` exporter sidecar. That is one more process to deploy and +one more failure mode to monitor (the sidecar's health is now also a +production concern). The relay is supposed to be a single binary in a +distroless container; adding an exposition sidecar undermines the deployment +posture (#32). + +Rejected. + +## Decision + +Adopt `github.com/prometheus/client_golang` v1.x (the current stable line) as +a direct dep. Use: + +- `prometheus.NewRegistry()` for the per-process registry. +- `prometheus.NewCounter` / `NewCounterVec` / `NewGauge` / `NewGaugeVec` / + `NewHistogramVec` for metric types as sibling tickets need them. +- `promhttp.HandlerFor(reg, opts)` for the exposition handler. +- `prometheus/testutil` if and when a sibling ticket needs to assert + counter increments under test. + +## Scope of use + +In bounds (`internal/relay/` and the listener's wiring in #60): + +- Construction of the relay's private registry (`internal/relay/metrics.go`). +- Construction of counters / gauges / histograms by sibling tickets. +- Handler factory wrapping `promhttp.HandlerFor`. +- Test-time use of `prometheus/testutil` for metric assertions. + +Out of bounds, structurally: + +- **`prometheus.DefaultRegisterer` is never touched.** The relay's registry + is constructed with `prometheus.NewRegistry()` and held privately. A + developer who reaches for `promauto`'s default-registry shortcuts is + reaching outside the design; code review and the spec's docstring on + `NewMetricsRegistry` are the gates. +- **Process / Go-runtime collectors** (`collectors.NewGoCollector`, + `collectors.NewProcessCollector`) are **not** registered in this ticket. + They are useful but they expand the exposed surface (memory layout, + GC pacing, file descriptor counts, command line via `process_*`) + and the listener in #60 will be reachable from a more sensitive + scrape position than `/healthz`. If a future ticket adds them, it + registers them on the same private registry — never on the global default. +- **Structured logging.** The relay continues to use `slog` for logs. No + log adapter from `client_golang` is wired in. The "Log hygiene" + threat-model rule (#36's allowlist) does not extend to metric labels; + metric labels are an independent low-cardinality channel and are + governed by the threat-model § *Log hygiene* MUST-NOT-emit list + applied to label values when sibling tickets land. +- **HTTP routing.** stdlib `net/http`, not any Prometheus router. +- **Config parsing.** stdlib `flag`, not any Prometheus config helper. + +## Supply-chain notes + +Direct dep: `github.com/prometheus/client_golang` v1.x. + +Transitive deps the module graph will pick up (verified against +`go.sum` of comparable projects on 2026-05-11; the developer's `go mod tidy` +output is the authoritative version of this list at commit time): + +- `github.com/beorn7/perks` — histogram quantile estimation. +- `github.com/cespare/xxhash/v2` — fast non-cryptographic hash for + label-set keying. +- `github.com/munnerz/goautoneg` — content-type negotiation for + `Accept` headers on `/metrics`. +- `github.com/prometheus/client_model` — generated protobuf for the + Prometheus exposition model. +- `github.com/prometheus/common` — shared types across the Prometheus + Go ecosystem; pulls `expfmt` (text/OpenMetrics encoders). +- `github.com/prometheus/procfs` — only used by process / runtime + collectors. We do not register those (see *Scope of use*), but the + module is transitively present. +- `google.golang.org/protobuf` — required by `client_model`. + +Maintenance status: actively maintained, hosted under the Prometheus +organisation, deployed at production scale across the CNCF graduated +ecosystem. The release cadence (multiple minor releases per year) is on +the slow-and-steady side compared to web frameworks. + +Vetting posture: `go.sum` digests pin each transitive at the resolved +version. Renovate (or whatever bumper lands) is responsible for keeping +the line current. `govulncheck` (already in `make lint`) covers the +expanded surface. + +## Consequences + +- **Dependency surface grows.** The `go.mod` `require` block gains one + direct entry; transitives (~7 modules) are recorded in `go.sum`. This is + a known, deliberate expansion, signed off in this ADR. +- **`govulncheck` and `gosec` now scan a wider tree.** Failures in + `prometheus/*` modules land as `make lint` failures in CI; the + remediation is the same as for any dep — bump within the line, or + open an upstream issue if the bump is unavailable. +- **`make build` artifact size grows.** Indicative figure (from comparable + Go services adopting `client_golang` for the first time): the static + binary gains roughly 1–2 MB. Acceptable for the relay's distroless + deployment posture (#32); the artifact is still well under the + registry's practical ceiling. +- **The "stdlib + `x/crypto` + `nhooyr.io/websocket`" line in + `docs/PROJECT-MEMORY.md` is now stale.** Updating that line is a + documentation-phase concern, not the architect's; the documentation + agent will edit it when the codebase summary for #59 lands. +- **The pattern this establishes for any next dep:** an ADR file, before + the `go.mod` edit. The dispatcher's documentation agent then + cross-links it from `docs/knowledge/INDEX.md` under *Decisions*. No + silent first-import. diff --git a/docs/specs/architecture/59-metrics-scaffolding.md b/docs/specs/architecture/59-metrics-scaffolding.md new file mode 100644 index 0000000..903e0d6 --- /dev/null +++ b/docs/specs/architecture/59-metrics-scaffolding.md @@ -0,0 +1,533 @@ +# Spec — Prometheus metrics registry scaffolding (#59) + +## Files to read first + +- `internal/relay/healthz.go:35-57` — `NewHealthzHandler` is the handler-factory + pattern the new `NewMetricsHandler` mirrors: returns `http.Handler` (not + `HandlerFunc`); no per-request state; safe for concurrent use; no exported + fields. `docs/PROJECT-MEMORY.md` line 43 is the rule. +- `internal/relay/healthz_test.go:11-69` — assertion shape: `httptest.NewRecorder` + + `httptest.NewRequest(http.MethodGet, …)`; `rec.Code` for status; `rec.Header().Get("Content-Type")` for content type; body parsed with a decoder (json there → text-format-decoder here). Reuse the same test layout. +- `docs/PROJECT-MEMORY.md` line 37 — "new deps need a justification (the ADR + is the justification)". This is the rule ADR-0008 satisfies. +- `docs/PROJECT-MEMORY.md` line 43 — handler-factory rule. `NewMetricsHandler` + follows it. +- `docs/knowledge/decisions/0008-prometheus-client-adoption.md` (this commit) + — adoption ADR; *Scope of use* section binds what `metrics.go` may and may + not do. +- `docs/knowledge/decisions/0004-ws-library-and-adapter-context-strategy.md` + — shape reference for an ADR justifying a third-party dep. Read for tone / + structure if you need to extend the ADR. +- `go.mod` — current dep surface (stdlib + `x/crypto` + `nhooyr.io/websocket`, + no transitive Prometheus deps yet). `go mod tidy` after the new dep lands + fills `go.sum` with the transitives ADR-0008 enumerates. +- `Makefile` — `make vet`, `make test` (runs with `-race`), and `make build` + are the AC gates; lint targets are unrelated to this ticket. +- `cmd/pyrycode-relay/main.go` — read only enough to confirm there is **no** + call to anything in `metrics.go` after this ticket. Wiring is #60's + problem; this ticket is structurally a scaffold. + +## Context + +The relay currently has no metrics surface — `/healthz` (#10) is a snapshot +only. Sibling tickets #57, #58 (counters) and #60 (listener) all depend on +the registry shape this ticket lands. + +This is the **scaffolding slice**: it adopts `github.com/prometheus/client_golang`, +records the adoption in ADR-0008, and ships a private registry + handler +factory. It ships **no** counters, **no** gauges, and **no** listener wiring. +Those land in their own tickets and consume the seam this ticket establishes. + +The single load-bearing design choice for this ticket is the *seam shape*: +how do counters that sibling tickets add wire onto a registry that this +ticket constructs, without forcing this ticket's call sites (the test, and +later #60's `main.go` line) to mutate their import sets every time a sibling +lands? AC names that choice explicitly ("the architect picks the +package-vars-vs-struct shape"). + +## Design + +Two artefacts in `internal/relay/metrics.go`, plus one test file, plus the +ADR (already written under `docs/knowledge/decisions/0008-…`), plus the +`go.mod` / `go.sum` updates from `go mod tidy`. + +### 1. `internal/relay/metrics.go` (new, ~25 lines plus doc comments) + +```go +package relay + +import ( + "net/http" + + "github.com/prometheus/client_golang/prometheus" + "github.com/prometheus/client_golang/prometheus/promhttp" +) + +// NewMetricsRegistry constructs a fresh *prometheus.Registry private to the +// relay. Each call returns an independent instance — tests construct as +// many as they need without process-wide collisions. The relay's wiring +// site (cmd/pyrycode-relay/main.go, landing in #60) holds exactly one +// registry for the process lifetime. +// +// Counters and gauges added by sibling tickets MUST register against the +// registry returned here (or a prometheus.Registerer derived from it), +// NEVER against prometheus.DefaultRegisterer. ADR-0008 § Scope of use +// fixes that rule. +func NewMetricsRegistry() *prometheus.Registry { + return prometheus.NewRegistry() +} + +// NewMetricsHandler returns an http.Handler that serves reg in the standard +// Prometheus text exposition format ("text/plain; version=0.0.4; +// charset=utf-8"). Pattern follows NewHealthzHandler (#10): +// factory-shaped, no per-request state, safe for concurrent use. +// +// The handler does not negotiate the OpenMetrics content type +// (EnableOpenMetrics defaults to false in promhttp.HandlerOpts). Holding +// the text format stable keeps the on-the-wire response shape +// independent of the caller's Accept header, matching the relay's +// /healthz posture (no content negotiation). +// +// Errors during collection are propagated to the response with HTTP 500 +// (promhttp's default ErrorHandling = HTTPErrorOnError); the relay does +// not currently inject a logger into the handler — if a future ticket +// wants collection errors in slog, it threads them via ErrorLog at the +// wiring site, not here. +func NewMetricsHandler(reg *prometheus.Registry) http.Handler { + return promhttp.HandlerFor(reg, promhttp.HandlerOpts{ + Registry: reg, + }) +} +``` + +**Note on `HandlerOpts.Registry`.** Setting `Registry: reg` enables +`promhttp` to register its own self-instrumentation counters +(`promhttp_metric_handler_*`) on the same private registry. This is +desirable: it keeps internal handler errors (e.g. concurrent-scrape +limit hits, if a future ticket sets one) visible on `/metrics` instead +of silently absent. Leaving the field nil would land those counters on +`DefaultRegisterer`, which ADR-0008 § Scope of use forbids — so setting +the field is structurally how this ticket holds the ADR's no-default- +registry boundary even for the library's own bookkeeping. + +### 2. The architect's "package-vars-vs-struct" pick + +**Pick: per-concern collector struct, constructed by a sibling-ticket +helper that takes `prometheus.Registerer`.** Each sibling ticket +defines a *separate* file under `internal/relay/` (e.g. +`metrics_upgrade.go` in #57, `metrics_register.go` in #58) with its +own type and constructor: + +```go +// (sibling ticket — illustrative, not landed by #59) +type upgradeFloodMetrics struct { + rejectsTotal *prometheus.CounterVec +} + +func newUpgradeFloodMetrics(reg prometheus.Registerer) *upgradeFloodMetrics { + m := &upgradeFloodMetrics{ + rejectsTotal: prometheus.NewCounterVec( + prometheus.CounterOpts{Name: "relay_upgrade_rejects_total", Help: "…"}, + []string{"reason"}, + ), + } + reg.MustRegister(m.rejectsTotal) + return m +} +``` + +Why this shape, not the alternatives: + +1. **Package-level vars (rejected).** A `var upgradeFloodTotal = …` at + package scope cannot register against a registry that + `NewMetricsRegistry()` constructs per call — package-level + initialisers run at import time and can only attach to a + package-level registry, which forces the registry into a singleton. + The relay's existing seam shape (`NewRegistry()` returns a fresh + `Registry`; tests construct as many as they need; see + ADR-0003) treats per-call construction as load-bearing for + testability. A metrics singleton would diverge for no benefit. +2. **One mega-struct on a public type (rejected).** A + `type Metrics struct { UpgradeFloodTotal …; RegisterFailsTotal …; … }` + that grows a field per sibling ticket forces every sibling PR to + touch the same file in the same struct. That is a guaranteed merge + conflict between concurrently-architected siblings (#57 and #58 in + flight at the same time — the file-overlap check at architect time + would block whichever lands second from running). It also bundles + unrelated concerns under one type. +3. **Per-concern collector struct (adopted).** Each sibling owns a + private `*xxxMetrics` type in its own file. The wiring site + (`main.go`, #60) holds *N* of them as locals, threads each into + the handler that increments it. Adding a new metric type is a + greenfield file under `internal/relay/`; no edit to `metrics.go` + or to other siblings' files. This is the natural Go-idiomatic + shape and the one ADR-0008's *Scope of use* assumes. + +This pick lives in this ticket's spec, not in `metrics.go` itself — +the seam this ticket exposes (`NewMetricsRegistry`, `NewMetricsHandler`) +admits the pattern without naming it. The spec is the place sibling +architects read first. + +### 3. `internal/relay/metrics_test.go` (new, ~80 lines) + +Three tests, no fixtures, pure stdlib + `prometheus/common/expfmt` (a +transitive of `client_golang` — already in the module graph, no new +dep). + +```go +package relay + +import ( + "net/http" + "net/http/httptest" + "strings" + "testing" + + "github.com/prometheus/common/expfmt" +) +``` + +**Test 1: `TestMetricsHandler_EmptyRegistry_ResponseShape`** — covers the +explicit AC ("an empty registry served via the handler returns HTTP 200 +with `Content-Type: text/plain; version=0.0.4; charset=utf-8` … and a +parseable body"). + +```go +func TestMetricsHandler_EmptyRegistry_ResponseShape(t *testing.T) { + t.Parallel() + + reg := NewMetricsRegistry() + h := NewMetricsHandler(reg) + + rec := httptest.NewRecorder() + h.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/metrics", nil)) + + if rec.Code != http.StatusOK { + t.Fatalf("status: got %d, want %d", rec.Code, http.StatusOK) + } + if got, want := rec.Header().Get("Content-Type"), + "text/plain; version=0.0.4; charset=utf-8"; got != want { + t.Errorf("Content-Type: got %q, want %q", got, want) + } + + // "Parseable body" — round-trip the response through expfmt's text + // decoder. An empty registry produces zero MetricFamily entries; + // the decoder must reach io.EOF without an error before that. + dec := expfmt.NewDecoder(rec.Body, expfmt.FmtText) + var families int + for { + var mf dto.MetricFamily // see import note below + if err := dec.Decode(&mf); err != nil { + if err.Error() == "EOF" || errors.Is(err, io.EOF) { + break + } + t.Fatalf("decode: %v (body=%q)", err, rec.Body.String()) + } + families++ + } + if families != 0 { + t.Errorf("empty registry: got %d MetricFamilies, want 0", families) + } +} +``` + +Imports the test needs: `io`, `errors`, plus +`dto "github.com/prometheus/client_model/go"` for `dto.MetricFamily`. +Both are transitives of `client_golang` already in the module graph +(see ADR-0008 § Supply-chain notes). The developer wires the import +block accordingly. + +Implementation note on the content-type literal: at `client_golang` v1.x +the constant `expfmt.FmtText` resolves to that exact string. If a future +release renames or version-bumps the format constant, the test's literal +string is the canary — keep the literal in the assertion (not a +re-export of `expfmt.FmtText`) so a silent format change in a bump +surfaces as a test failure rather than a tautological pass. + +**Test 2: `TestMetricsHandler_RegisterAndScrape_RoundTrip`** — a smoke +check that the seam actually plumbs through. Constructs a registry, +registers one trivial counter against it, increments it, scrapes, +asserts the counter appears in the body in text format. + +```go +func TestMetricsHandler_RegisterAndScrape_RoundTrip(t *testing.T) { + t.Parallel() + + reg := NewMetricsRegistry() + c := prometheus.NewCounter(prometheus.CounterOpts{ + Name: "relay_test_counter_total", + Help: "Sole purpose: prove the registry → handler seam works.", + }) + reg.MustRegister(c) + c.Add(2) + + h := NewMetricsHandler(reg) + rec := httptest.NewRecorder() + h.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/metrics", nil)) + + if rec.Code != http.StatusOK { + t.Fatalf("status: got %d, want %d", rec.Code, http.StatusOK) + } + body := rec.Body.String() + if !strings.Contains(body, "relay_test_counter_total 2") { + t.Errorf("body missing counter+value; got:\n%s", body) + } +} +``` + +This test is not in the AC literally but is the cheapest possible +seam-verification — it would catch a regression where, e.g., a future +bump of `client_golang` changes the handler's `prometheus.Gatherer` +contract and the registered counter no longer surfaces. ~15 lines; it +pays its freight. + +**Test 3: `TestMetricsRegistry_NoGlobalRegistrarLeak`** — structural +defence for ADR-0008's "never `prometheus.DefaultRegisterer`" rule. + +```go +func TestMetricsRegistry_NoGlobalRegistrarLeak(t *testing.T) { + // Snapshot the default registerer's collectors. After + // NewMetricsRegistry()+NewMetricsHandler() are called, the default + // registerer must contain the same set — nothing the relay does + // in this ticket may register against the global. + t.Parallel() + before := defaultRegistererSize(t) + _ = NewMetricsHandler(NewMetricsRegistry()) + after := defaultRegistererSize(t) + if before != after { + t.Fatalf("default registerer changed: before=%d, after=%d "+ + "(metrics.go must never register on prometheus.DefaultRegisterer; "+ + "see ADR-0008 § Scope of use)", before, after) + } +} + +func defaultRegistererSize(t *testing.T) int { + t.Helper() + // Use the default gatherer (which wraps DefaultRegisterer). Gather() + // returns one MetricFamily per registered Collector's metrics. + mfs, err := prometheus.DefaultGatherer.Gather() + if err != nil { + t.Fatalf("DefaultGatherer.Gather: %v", err) + } + return len(mfs) +} +``` + +The "snapshot before/after" framing tolerates whatever the Go runtime +or `client_golang` itself has already registered on the default +gatherer at process start — the test asserts a delta of zero, not an +absolute count. This makes it stable across `client_golang` version +bumps that might pre-register their own internals. + +This test enforces the ADR rule structurally rather than by review. +~20 lines; pays its freight by closing the most-likely future bypass +(a contributor reaches for `promauto.NewCounter(...)` which silently +registers on `DefaultRegisterer`). + +## Concurrency model + +None this ticket lands. `promhttp.HandlerFor`'s returned handler is +documented safe for concurrent calls; the registry's `Register` / +`MustRegister` / `Gather` methods take their own locks internally. +Sibling tickets that register counters will do so once at construction +(in the wiring site) — that is a single-goroutine call before +serving begins. + +## Error handling + +Three concrete failure modes the spec accounts for: + +1. **`go mod tidy` cannot resolve the new dep.** Most likely cause is a + sandboxed CI environment with no proxy access — `make build` would + fail with a Go module error. Out of scope to mitigate in this + ticket; the developer reports the env issue on the ticket and the + pipeline owner sets `GOPROXY` appropriately. Recovery: re-run + `make build` after the env is fixed. +2. **`promhttp.HandlerFor` returns a 500 on a collector error at scrape + time.** Not reachable in this ticket (empty registry; no collectors). + The handler's `ErrorHandling` defaults to `HTTPErrorOnError` — the + 500 is returned with a body containing the error message. ADR-0008's + *Scope of use* notes that a future ticket can inject `ErrorLog` + via `HandlerOpts` to route the error into `slog`; this ticket does + not. +3. **A sibling ticket registers a colliding metric name.** `MustRegister` + panics on collision. The panic surfaces at process startup (in + `main.go`'s wiring), not at scrape time — desirable, because it + refuses to serve a broken metrics surface. Spec relies on each + sibling's tests to catch their own collision before merge. + +## Testing strategy + +- **`make test` runs the three tests above; they each exercise an + independent contract** (empty-registry response shape, round-trip + registration, no-global-leak structural invariant). +- **No race tests.** The seam this ticket lands is single-goroutine; + the library handles its own locking. `go test -race ./...` will run + the new tests under `-race` because `make test` is `-race`-wide, but + there is no goroutine the tests need to spin up for race coverage. +- **No new e2e or integration coverage.** The listener wiring is #60; + that ticket's tests will hit `/metrics` over HTTP against a running + process. This ticket's gate is unit-test-only. + +## Open questions + +1. **`promhttp.HandlerOpts.MaxRequestsInFlight` and `Timeout`.** Both + default to zero (unbounded / no timeout). The relay's listener (#60) + sits behind the existing `http.Server` whose `ReadTimeout` / + `WriteTimeout` bound scrape-side I/O — so the in-handler defaults + are not a DoS exposure here. Recorded so the #60 architect sees the + decision delegated to them: if the metrics listener is on a separate + port with its own `http.Server`, the timeouts on *that* server are + what matters; the in-handler options remain optional. +2. **Process / Go-runtime collectors.** Out of scope for this ticket + (ADR-0008 § Scope of use names them as forbidden for #59). A future + ticket may opt in by registering them on the relay's private + registry, never on the default. No design decision needed here; the + ADR is the contract. +3. **`docs/knowledge/codebase/59.md` and `docs/knowledge/INDEX.md` + updates.** The AC lists both. Per the architect's own "Never Update" + rule (`docs/knowledge/INDEX.md` is documentation-phase territory), + the architect does **not** write either file in this ticket. The + developer's commit ships the code; the documentation agent's + post-merge run lands the codebase summary and the INDEX entry. The + AC is satisfied at the ticket level by the documentation phase, not + by the architect or developer phases. Recorded here so the developer + does not chase an AC item that is not theirs. + +## Out of scope + +- Any gauge or counter (`/metrics` is empty after this ticket). +- The `/metrics` listener (#60's job). +- Bind-address validation, flag plumbing, TLS for the metrics port. +- Process / Go-runtime collectors. +- An `ErrorLog` wiring from `promhttp` into `slog`. +- Updating the `docs/PROJECT-MEMORY.md` line 37 dep list (frozen + document; documentation phase owns evergreen knowledge). + +## Security review (security-sensitive ticket) + +### Mindset + +Re-reading the spec as an adversary. Default verdict FAIL until each +applicable category below has either a concrete finding or an explicit +design decision that makes the category not applicable. + +### 1. Trust boundaries + +- **Inbound trust boundary:** none introduced by this ticket. The + `/metrics` listener (which would terminate untrusted scrape traffic) + is #60's responsibility. This ticket ships a handler factory and + a registry constructor; both are pure constructors with no I/O. +- **Internal boundary:** the registry private to the relay package vs. + `prometheus.DefaultRegisterer`. ADR-0008 names the rule; this spec + enforces it structurally with Test 3 + (`TestMetricsRegistry_NoGlobalRegistrarLeak`). + +### 2. Tokens, secrets, credentials + +- The relay's only handled secret is the phone token (`X-Pyrycode-Token`), + presence-checked and discarded at `/v1/client` (#5). This ticket adds + no code path that touches a token. Metric labels added by sibling + tickets MUST NOT carry token values; that rule lands in each sibling's + spec, not here. Recorded so the next architect reading this file sees + the boundary explicitly: **metric labels are an exfiltration channel; + treat them with the same MUST-NOT-emit posture as log values + (`docs/threat-model.md` § Log hygiene)**. + +### 3. File operations + +- None. `metrics.go` performs no file I/O. + +### 4. Subprocess / external command execution + +- None. + +### 5. Cryptographic primitives + +- None directly. `client_golang` v1.x has no `crypto/*` API surface + the relay reaches into. (The transitive dep `xxhash/v2` is a + non-cryptographic hash used internally for label-set keying; not + attacker-influenced from outside the process.) + +### 6. Network & I/O + +- **Handler exposure surface.** `promhttp.HandlerFor` returns a handler + whose response size is proportional to the number of registered + collectors. For this ticket, the registry is empty — response is + essentially zero bytes. Sibling tickets registering CounterVec / + HistogramVec with high-cardinality labels could expand this; that + is each sibling's threat-model concern, with the security-sensitive + label on their tickets gating the check. +- **No `Accept` content negotiation in this ticket.** The handler + always returns text format (OpenMetrics disabled by leaving + `EnableOpenMetrics: false` at default). This avoids the negotiator + becoming a covert channel for content-type-based parser + differentials. Decision documented in `metrics.go`'s docstring. +- **No `http.Server` timeout policy in this ticket.** The handler is + *constructed* here; serving is #60's job. Recorded in *Open + questions* #1 for the #60 architect. + +### 7. Error messages, logs, telemetry + +- **No `slog` integration.** The handler does not log; collection + errors return HTTP 500 with the error string in the body. The error + string from `prometheus.Gatherer.Gather` is library-internal and does + not carry relay state — it names a collector type or a label name, + not a request header or a token. Acceptable for this ticket; the #60 + architect can revisit by injecting `ErrorLog` if a future regression + flags a sensitive string. +- **Metric labels as a telemetry channel.** Empty for this ticket. The + boundary applies to siblings (point in *Tokens, secrets, credentials* + above). + +### 8. Concurrency + +- The handler is documented safe for concurrent calls; the registry + uses internal locking. This ticket spawns no goroutines, holds no + locks, owns no shared state. + +### 9. Threat model alignment + +- `docs/threat-model.md` § *Log hygiene*: extends to metric labels per + the rule recorded above (sibling tickets enforce). +- `docs/threat-model.md` § *DoS*: the `/metrics` listener as a scrape + surface is #60's design; this ticket does not introduce a new + network-reachable endpoint. +- `docs/threat-model.md` § *Supply chain*: ADR-0008 enumerates the + transitive dep set, names the maintenance status, and binds the dep + to `go.sum` digests. `govulncheck` covers the expanded surface via + `make lint`. +- Protocol spec (`pyrycode/pyrycode/docs/protocol-mobile.md`) is + unaffected by this ticket — no wire-protocol surface change. + +### Findings + +- **[Trust boundaries]** No findings — single internal boundary + (private registry vs. `DefaultRegisterer`) enforced structurally by + Test 3. +- **[Tokens]** SHOULD FIX (deferred to sibling specs) — metric labels + are a future exfiltration channel; rule recorded above so each + sibling architect sees the constraint when their spec lands. +- **[File operations]** No findings — none performed. +- **[Subprocess]** No findings — none performed. +- **[Cryptographic primitives]** No findings — none used. +- **[Network & I/O]** OUT OF SCOPE — listener is #60. Handler-side + content negotiation pinned to text format; recorded. +- **[Errors / logs / telemetry]** No findings for this ticket — handler + does not log; collection-error strings are library-internal. +- **[Concurrency]** No findings — no new goroutines, locks, or shared + state. +- **[Threat model]** No findings unique to this ticket — supply-chain + expansion documented in ADR-0008; log/DoS/protocol surfaces + unchanged or deferred. + +### Verdict + +**PASS.** No MUST FIX findings. The one SHOULD FIX (metric-label +exfiltration) is deferred to sibling specs by design — they each +carry the `security-sensitive` label and gate the check at the right +layer. Recording the rule here propagates the constraint without +inventing scaffolding work that has no caller in this ticket. + +**Reviewer:** architect (self-review per `architect/security-review.md`) +**Date:** 2026-05-11 From 654add0db11b819ca1ddfbcd21e66eb5400d5b44 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Mon, 11 May 2026 12:32:03 +0300 Subject: [PATCH 2/3] feat(relay): prometheus metrics registry scaffolding (#59) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adopt github.com/prometheus/client_golang (per ADR-0008) and stand up the relay's private metrics registry + handler factory. Scaffolding only — no counters, no gauges, no listener wiring (those land in #57, #58, #60). `NewMetricsRegistry()` returns a fresh `*prometheus.Registry`; per-call construction matches the existing `NewRegistry()` testability shape and lets each test build an independent registry without process-wide collisions. `NewMetricsHandler(reg)` returns `http.Handler` wrapping `promhttp.HandlerFor(reg, …)`. `HandlerOpts.Registry: reg` is load-bearing: it routes promhttp's own self-instrumentation collectors (`promhttp_metric_handler_*`) onto the relay's private registry rather than `prometheus.DefaultRegisterer` — that's how the ADR-0008 § Scope of use "never the default registerer" rule holds even for the library's internal bookkeeping. Three unit tests: - `TestMetricsHandler_EmptyRegistry_ResponseShape` covers the AC: HTTP 200, `Content-Type` starts with `text/plain; version=0.0.4; charset=utf-8`, body parseable by `expfmt`'s text decoder to `io.EOF`. The Content-Type is substring-matched because prometheus/common v0.62+ appends an additive `; escaping=` suffix (UTF-8 names support) which the AC literal predates — the substring is still the spec-author's canary for non-additive format changes. - `TestMetricsHandler_RegisterAndScrape_RoundTrip` proves the registry→handler seam works end-to-end with a trivial counter. - `TestMetricsRegistry_NoGlobalRegistrarLeak` structurally enforces the ADR-0008 § Scope of use rule by asserting a zero delta on `DefaultGatherer.Gather()` across registry+handler construction. Co-Authored-By: Claude Opus 4.7 --- go.mod | 11 ++++ go.sum | 46 +++++++++++++ internal/relay/metrics.go | 51 +++++++++++++++ internal/relay/metrics_test.go | 115 +++++++++++++++++++++++++++++++++ 4 files changed, 223 insertions(+) create mode 100644 internal/relay/metrics.go create mode 100644 internal/relay/metrics_test.go diff --git a/go.mod b/go.mod index 7737307..e4998f4 100644 --- a/go.mod +++ b/go.mod @@ -3,11 +3,22 @@ module github.com/pyrycode/pyrycode-relay go 1.26.2 require ( + github.com/prometheus/client_golang v1.23.2 + github.com/prometheus/client_model v0.6.2 + github.com/prometheus/common v0.66.1 golang.org/x/crypto v0.50.0 nhooyr.io/websocket v1.8.17 ) require ( + github.com/beorn7/perks v1.0.1 // indirect + github.com/cespare/xxhash/v2 v2.3.0 // indirect + github.com/kr/text v0.2.0 // indirect + github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect + github.com/prometheus/procfs v0.16.1 // indirect + go.yaml.in/yaml/v2 v2.4.2 // indirect golang.org/x/net v0.52.0 // indirect + golang.org/x/sys v0.43.0 // indirect golang.org/x/text v0.36.0 // indirect + google.golang.org/protobuf v1.36.8 // indirect ) diff --git a/go.sum b/go.sum index 5b6d84b..d773965 100644 --- a/go.sum +++ b/go.sum @@ -1,8 +1,54 @@ +github.com/beorn7/perks v1.0.1 h1:VlbKKnNfV8bJzeqoa4cOKqO6bYr3WgKZxO8Z16+hsOM= +github.com/beorn7/perks v1.0.1/go.mod h1:G2ZrVWU2WbWT9wwq4/hrbKbnv/1ERSJQ0ibhJ6rlkpw= +github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs= +github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs= +github.com/creack/pty v1.1.9/go.mod h1:oKZEueFk5CKHvIhNR5MUki03XCEU+Q6VDXinZuGJ33E= +github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c= +github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= +github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8= +github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU= +github.com/klauspost/compress v1.18.0 h1:c/Cqfb0r+Yi+JtIEq73FWXVkRonBlf0CRNYc8Zttxdo= +github.com/klauspost/compress v1.18.0/go.mod h1:2Pp+KzxcywXVXMr50+X0Q/Lsb43OQHYWRCY2AiWywWQ= +github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE= +github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk= +github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY= +github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE= +github.com/kylelemons/godebug v1.1.0 h1:RPNrshWIDI6G2gRW9EHilWtl7Z6Sb1BR0xunSBf0SNc= +github.com/kylelemons/godebug v1.1.0/go.mod h1:9/0rRGxNHcop5bhtWyNeEfOS8JIWk580+fNqagV/RAw= +github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 h1:C3w9PqII01/Oq1c1nUAm88MOHcQC9l5mIlSMApZMrHA= +github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822/go.mod h1:+n7T8mK8HuQTcFwEeznm/DIxMOiR9yIdICNftLE1DvQ= +github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= +github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= +github.com/prometheus/client_golang v1.23.2 h1:Je96obch5RDVy3FDMndoUsjAhG5Edi49h0RJWRi/o0o= +github.com/prometheus/client_golang v1.23.2/go.mod h1:Tb1a6LWHB3/SPIzCoaDXI4I8UHKeFTEQ1YCr+0Gyqmg= +github.com/prometheus/client_model v0.6.2 h1:oBsgwpGs7iVziMvrGhE53c/GrLUsZdHnqNwqPLxwZyk= +github.com/prometheus/client_model v0.6.2/go.mod h1:y3m2F6Gdpfy6Ut/GBsUqTWZqCUvMVzSfMLjcu6wAwpE= +github.com/prometheus/common v0.66.1 h1:h5E0h5/Y8niHc5DlaLlWLArTQI7tMrsfQjHV+d9ZoGs= +github.com/prometheus/common v0.66.1/go.mod h1:gcaUsgf3KfRSwHY4dIMXLPV0K/Wg1oZ8+SbZk/HH/dA= +github.com/prometheus/procfs v0.16.1 h1:hZ15bTNuirocR6u0JZ6BAHHmwS1p8B4P6MRqxtzMyRg= +github.com/prometheus/procfs v0.16.1/go.mod h1:teAbpZRB1iIAJYREa1LsoWUXykVXA1KlTmWl8x/U+Is= +github.com/rogpeppe/go-internal v1.10.0 h1:TMyTOH3F/DB16zRVcYyreMH6GnZZrwQVAoYjRBZyWFQ= +github.com/rogpeppe/go-internal v1.10.0/go.mod h1:UQnix2H7Ngw/k4C5ijL5+65zddjncjaFoBhdsK/akog= +github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U= +github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U= +go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto= +go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE= +go.yaml.in/yaml/v2 v2.4.2 h1:DzmwEr2rDGHl7lsFgAHxmNz/1NlQ7xLIrlN2h5d1eGI= +go.yaml.in/yaml/v2 v2.4.2/go.mod h1:081UH+NErpNdqlCXm3TtEran0rJZGxAYx9hb/ELlsPU= golang.org/x/crypto v0.50.0 h1:zO47/JPrL6vsNkINmLoo/PH1gcxpls50DNogFvB5ZGI= golang.org/x/crypto v0.50.0/go.mod h1:3muZ7vA7PBCE6xgPX7nkzzjiUq87kRItoJQM1Yo8S+Q= golang.org/x/net v0.52.0 h1:He/TN1l0e4mmR3QqHMT2Xab3Aj3L9qjbhRm78/6jrW0= golang.org/x/net v0.52.0/go.mod h1:R1MAz7uMZxVMualyPXb+VaqGSa3LIaUqk0eEt3w36Sw= +golang.org/x/sys v0.43.0 h1:Rlag2XtaFTxp19wS8MXlJwTvoh8ArU6ezoyFsMyCTNI= +golang.org/x/sys v0.43.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw= golang.org/x/text v0.36.0 h1:JfKh3XmcRPqZPKevfXVpI1wXPTqbkE5f7JA92a55Yxg= golang.org/x/text v0.36.0/go.mod h1:NIdBknypM8iqVmPiuco0Dh6P5Jcdk8lJL0CUebqK164= +google.golang.org/protobuf v1.36.8 h1:xHScyCOEuuwZEc6UtSOvPbAT4zRh0xcNRYekJwfqyMc= +google.golang.org/protobuf v1.36.8/go.mod h1:fuxRtAxBytpl4zzqUh6/eyUujkJdNiuEkXntxiD/uRU= +gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= +gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk= +gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q= +gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= +gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= nhooyr.io/websocket v1.8.17 h1:KEVeLJkUywCKVsnLIDlD/5gtayKp8VoCkksHCGGfT9Y= nhooyr.io/websocket v1.8.17/go.mod h1:rN9OFWIUwuxg4fR5tELlYC04bXYowCP9GX47ivo2l+c= diff --git a/internal/relay/metrics.go b/internal/relay/metrics.go new file mode 100644 index 0000000..557dbd5 --- /dev/null +++ b/internal/relay/metrics.go @@ -0,0 +1,51 @@ +package relay + +import ( + "net/http" + + "github.com/prometheus/client_golang/prometheus" + "github.com/prometheus/client_golang/prometheus/promhttp" +) + +// NewMetricsRegistry constructs a fresh *prometheus.Registry private to the +// relay. Each call returns an independent instance — tests construct as many +// as they need without process-wide collisions. The relay's wiring site +// (cmd/pyrycode-relay/main.go, landing in #60) holds exactly one registry for +// the process lifetime. +// +// Counters and gauges added by sibling tickets MUST register against the +// registry returned here (or a prometheus.Registerer derived from it), NEVER +// against prometheus.DefaultRegisterer. ADR-0008 § Scope of use fixes that +// rule; metrics_test.go's TestMetricsRegistry_NoGlobalRegistrarLeak enforces +// it structurally. +func NewMetricsRegistry() *prometheus.Registry { + return prometheus.NewRegistry() +} + +// NewMetricsHandler returns an http.Handler that serves reg in the standard +// Prometheus text exposition format ("text/plain; version=0.0.4; +// charset=utf-8"). Pattern follows NewHealthzHandler (#10): factory-shaped, +// no per-request state, safe for concurrent use. +// +// The handler does not negotiate the OpenMetrics content type +// (EnableOpenMetrics defaults to false in promhttp.HandlerOpts). Holding the +// text format stable keeps the on-the-wire response shape independent of the +// caller's Accept header, matching the relay's /healthz posture (no content +// negotiation). +// +// HandlerOpts.Registry is set to reg so promhttp's own self-instrumentation +// counters (promhttp_metric_handler_*) register against the relay's private +// registry rather than prometheus.DefaultRegisterer — that bookkeeping is +// load-bearing for the ADR-0008 § Scope of use rule "DefaultRegisterer is +// never touched" even for the library's internals. +// +// Errors during collection are propagated to the response with HTTP 500 +// (promhttp's default ErrorHandling = HTTPErrorOnError); the relay does not +// currently inject a logger into the handler — if a future ticket wants +// collection errors in slog, it threads them via ErrorLog at the wiring +// site, not here. +func NewMetricsHandler(reg *prometheus.Registry) http.Handler { + return promhttp.HandlerFor(reg, promhttp.HandlerOpts{ + Registry: reg, + }) +} diff --git a/internal/relay/metrics_test.go b/internal/relay/metrics_test.go new file mode 100644 index 0000000..46047d7 --- /dev/null +++ b/internal/relay/metrics_test.go @@ -0,0 +1,115 @@ +package relay + +import ( + "errors" + "io" + "net/http" + "net/http/httptest" + "strings" + "testing" + + "github.com/prometheus/client_golang/prometheus" + dto "github.com/prometheus/client_model/go" + "github.com/prometheus/common/expfmt" +) + +// textFormatPrefix is the substring of the Prometheus text exposition +// Content-Type that the AC pins on every /metrics response: "text/plain; +// version=0.0.4; charset=utf-8". Newer prometheus/common versions append +// "; escaping=" (default "underscores") to advertise the name-escape +// policy used when serialising metric/label names — that suffix is an +// additive, backward-compatible extension, so we substring-match the AC +// literal instead of equality-matching. The substring is the canary the +// spec calls out: a release that renames the version or drops the charset +// would still surface here as a test failure rather than a tautological +// pass. +const textFormatPrefix = "text/plain; version=0.0.4; charset=utf-8" + +func TestMetricsHandler_EmptyRegistry_ResponseShape(t *testing.T) { + t.Parallel() + + reg := NewMetricsRegistry() + h := NewMetricsHandler(reg) + + rec := httptest.NewRecorder() + h.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/metrics", nil)) + + if rec.Code != http.StatusOK { + t.Fatalf("status: got %d, want %d", rec.Code, http.StatusOK) + } + if got := rec.Header().Get("Content-Type"); !strings.HasPrefix(got, textFormatPrefix) { + t.Errorf("Content-Type: got %q, want prefix %q", got, textFormatPrefix) + } + + // "Parseable body" — round-trip the response through expfmt's text + // decoder. Decoding to io.EOF without an error is the contract; the + // number of MetricFamilies is *not* zero because NewMetricsHandler + // passes HandlerOpts.Registry=reg, which registers promhttp's own + // self-instrumentation collectors (promhttp_metric_handler_*) on the + // private registry — that's the load-bearing design choice in + // metrics.go (ADR-0008 § Scope of use: nothing on DefaultRegisterer, + // not even promhttp's own bookkeeping). + dec := expfmt.NewDecoder(rec.Body, expfmt.NewFormat(expfmt.TypeTextPlain)) + for { + var mf dto.MetricFamily + if err := dec.Decode(&mf); err != nil { + if errors.Is(err, io.EOF) { + break + } + t.Fatalf("decode: %v (body=%q)", err, rec.Body.String()) + } + } +} + +func TestMetricsHandler_RegisterAndScrape_RoundTrip(t *testing.T) { + t.Parallel() + + reg := NewMetricsRegistry() + c := prometheus.NewCounter(prometheus.CounterOpts{ + Name: "relay_test_counter_total", + Help: "Sole purpose: prove the registry → handler seam works.", + }) + reg.MustRegister(c) + c.Add(2) + + h := NewMetricsHandler(reg) + rec := httptest.NewRecorder() + h.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/metrics", nil)) + + if rec.Code != http.StatusOK { + t.Fatalf("status: got %d, want %d", rec.Code, http.StatusOK) + } + body := rec.Body.String() + if !strings.Contains(body, "relay_test_counter_total 2") { + t.Errorf("body missing counter+value; got:\n%s", body) + } +} + +func TestMetricsRegistry_NoGlobalRegistrarLeak(t *testing.T) { + // Structural defence for ADR-0008's "never prometheus.DefaultRegisterer" + // rule. Snapshot the default gatherer's MetricFamily count before and + // after constructing the relay's registry and handler — the delta MUST + // be zero. Absolute count is whatever the Go runtime or client_golang + // itself has pre-registered; this test asserts the delta only, so it is + // stable across version bumps. + t.Parallel() + + before := defaultGathererSize(t) + _ = NewMetricsHandler(NewMetricsRegistry()) + after := defaultGathererSize(t) + + if before != after { + t.Fatalf("default registerer changed: before=%d, after=%d "+ + "(metrics.go must never register on prometheus.DefaultRegisterer; "+ + "see ADR-0008 § Scope of use)", before, after) + } +} + +func defaultGathererSize(t *testing.T) int { + t.Helper() + mfs, err := prometheus.DefaultGatherer.Gather() + if err != nil { + t.Fatalf("DefaultGatherer.Gather: %v", err) + } + return len(mfs) +} From 62d6bf7ac7dcd5ce3e05c0db067010ed34d1f6eb Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Tue, 12 May 2026 19:21:03 +0300 Subject: [PATCH 3/3] =?UTF-8?q?docs:=20metrics=20registry=20scaffolding=20?= =?UTF-8?q?=E2=80=94=20feature=20doc,=20codebase=20note,=20ADR=20+=20featu?= =?UTF-8?q?re=20in=20INDEX=20(#59)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.7 --- docs/knowledge/INDEX.md | 2 + docs/knowledge/codebase/59.md | 56 +++++++++++++ docs/knowledge/features/metrics-registry.md | 90 +++++++++++++++++++++ 3 files changed, 148 insertions(+) create mode 100644 docs/knowledge/codebase/59.md create mode 100644 docs/knowledge/features/metrics-registry.md diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md index 8b0e762..34e73c9 100644 --- a/docs/knowledge/INDEX.md +++ b/docs/knowledge/INDEX.md @@ -4,6 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o ## Features +- [Metrics registry (scaffolding)](features/metrics-registry.md) — private `*prometheus.Registry` + `NewMetricsHandler` factory wrapping `promhttp.HandlerFor` (text format only; OpenMetrics off; `HandlerOpts.Registry: reg` keeps `promhttp_metric_handler_*` off `DefaultRegisterer`). No counters/gauges, no listener — siblings (#57, #58) and #60 fill those in. Seam shape for siblings: per-concern collector struct in its own file, constructed by a helper taking `prometheus.Registerer` (no mega-struct, no package-level vars). Structural defence against default-registry leaks via `TestMetricsRegistry_NoGlobalRegistrarLeak` (#59). - [Docker image](features/docker-image.md) — portable OCI artifact: multi-stage `Dockerfile` builds a fully-static binary (`CGO_ENABLED=0`, `-trimpath -s -w`) into `distroless/static-debian12:nonroot`; both base images digest-pinned with `# Tracks:` comments; exposes `:80`/`:443` and declares `/var/lib/relay/autocert` volume; host-specific wiring (TLS policy, ports, volumes, healthcheck) is #38's problem (#32). - [Binary-side frame forwarder](features/binary-forwarder.md) — per-binary read pump: unwraps each inbound routing envelope, linear-scans `PhonesFor(serverID)` for `env.ConnID`, writes `env.Frame` verbatim to that phone; opaque inner bytes; synchronous (handler discards the return); diverges from #25 in error policy — unknown `conn_id`, malformed envelope, phone `Send` error all log+continue (a single bad frame never tears down the binary); replaced `/v1/server`'s `CloseRead` placeholder (#26). - [WebSocket heartbeat](features/heartbeat.md) — per-conn goroutine on both endpoints sends RFC 6455 ping every 30s; closes with `1011 "heartbeat timeout"` if no pong within 30s. Detects half-open TCP within 60s; ctx-cancel exit path leaves close to the handler defer (#7). @@ -18,6 +19,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o ## Decisions +- [ADR-0008: Adopt `github.com/prometheus/client_golang` for relay metrics](decisions/0008-prometheus-client-adoption.md) — first non-stdlib direct dep since #15 (`nhooyr.io/websocket`); alternatives (hand-rolled text format, `VictoriaMetrics/metrics`, `expvar`+sidecar) rejected; transitive set enumerated and pinned via `go.sum`; scope of use bans `DefaultRegisterer`, process/runtime collectors, `slog` adapter, Prometheus router/config helpers; pattern established: ADR-before-import for any next direct dep. - [ADR-0007: `WSConn.CloseWithCode` for active-conn application close codes](decisions/0007-wsconn-closewithcode-for-active-conn.md) — extends ADR-0005 for the post-claim window: heartbeat (#7) needs `1011 "heartbeat timeout"` on a live WSConn; `Close()` delegates to `CloseWithCode(StatusNormalClosure, "")`, both share `closeOnce`. - [ADR-0006: Grace window IS the reclaim path](decisions/0006-grace-period-as-reclaim-path.md) — `ClaimServer` during a pending grace timer succeeds (not `4409`); pointer-identity wrapper defends against stale `time.AfterFunc` fires after `Stop()`. - [ADR-0005: Application WS close codes go through the underlying `*websocket.Conn`](decisions/0005-application-close-codes-via-underlying-conn.md) — `WSConn.Close()` stays minimal (`StatusNormalClosure`); handlers emit `4409` / `4401` / `4404` directly on the `*websocket.Conn` in stillborn-WSConn paths. diff --git a/docs/knowledge/codebase/59.md b/docs/knowledge/codebase/59.md new file mode 100644 index 0000000..67d8172 --- /dev/null +++ b/docs/knowledge/codebase/59.md @@ -0,0 +1,56 @@ +# Ticket #59 — Prometheus metrics registry scaffolding + +Stands up a private `*prometheus.Registry` and an `http.Handler` factory wrapping `promhttp.HandlerFor`. First non-stdlib direct dep since #15 (`nhooyr.io/websocket`); ADR-0008 is the justification. Scaffolding only — no counters, no gauges, no listener. The seam exists so sibling tickets (#57, #58, …) attach metrics and #60 wires the listener without re-touching `metrics.go`. + +## Implementation + +- **`internal/relay/metrics.go`** — two exports: + - `NewMetricsRegistry() *prometheus.Registry` — thin wrapper over `prometheus.NewRegistry()`. Fresh instance per call, no singleton. + - `NewMetricsHandler(reg *prometheus.Registry) http.Handler` — `promhttp.HandlerFor(reg, HandlerOpts{Registry: reg})`. `HandlerOpts.Registry: reg` is load-bearing: it routes `promhttp`'s self-instrumentation (`promhttp_metric_handler_*`) onto the private registry instead of `DefaultRegisterer`, so the ADR-0008 § Scope of use rule holds even for the library's internals. Pattern mirrors `NewHealthzHandler` (#10). +- **`internal/relay/metrics_test.go`** — three tests: + - `TestMetricsHandler_EmptyRegistry_ResponseShape` — AC gate. Status 200, `Content-Type` prefix `text/plain; version=0.0.4; charset=utf-8`, body decodes through `expfmt.NewDecoder` to `io.EOF`. Substring (not equality) match on content-type because newer `prometheus/common` appends `; escaping=` — additive, backward-compatible. + - `TestMetricsHandler_RegisterAndScrape_RoundTrip` — registers a trivial counter, increments to 2, asserts `relay_test_counter_total 2` in the body. Catches the cheapest possible regression where a future `client_golang` bump breaks the `Gatherer` contract for registered metrics. + - `TestMetricsRegistry_NoGlobalRegistrarLeak` — snapshots `prometheus.DefaultGatherer.Gather()` `len(mfs)` before and after `NewMetricsHandler(NewMetricsRegistry())`. Asserts delta = 0. Stable across version bumps (asserts delta, not absolute count). Structural defence for ADR-0008 against a contributor reaching for `promauto`'s default-registry shortcuts. +- **`docs/knowledge/decisions/0008-prometheus-client-adoption.md`** — new ADR. Alternatives considered (hand-rolled text format; `VictoriaMetrics/metrics`; `expvar` + exporter sidecar), supply-chain notes (transitives: `beorn7/perks`, `cespare/xxhash/v2`, `munnerz/goautoneg`, `prometheus/client_model`, `prometheus/common`, `prometheus/procfs`, `google.golang.org/protobuf`), scope-of-use boundary (no `DefaultRegisterer`, no process/runtime collectors, no `slog` adapter, no Prometheus router/config helpers). +- **`go.mod` / `go.sum`** — first new direct dep since `nhooyr.io/websocket` in #15. `make vet`, `make test -race`, `make build` clean. +- **No changes to `cmd/pyrycode-relay/main.go`** — the listener (and any mux registration) lives in #60. This ticket is scaffolding by design; both exports are dead code at merge time, brought to life by siblings. + +## Patterns established + +- **Per-concern collector struct, constructed by a sibling-ticket helper taking `prometheus.Registerer`.** Each sibling adds a private `*xxxMetrics` type in its own file (e.g. `metrics_upgrade.go`); the wiring site (`main.go`, #60) holds N collectors as locals. Adding a metric type is a greenfield file — no edits to `metrics.go` or to other siblings' files. Eliminates the merge-conflict surface that a growing mega-struct or package-level `var`s would create between concurrently-architected siblings (#57 and #58 were in flight at the same time). +- **Handler-factory rule extends.** `NewMetricsHandler` returns `http.Handler` (not `http.HandlerFunc`), holds no per-request state, exposes no fields — same contract as `NewHealthzHandler` (#10). +- **Per-call registry construction.** `NewMetricsRegistry()` returns a fresh registry on each call; the process holds exactly one in production but tests construct as many as they need. Matches `Registry` (#3, ADR-0003) — tests aren't forced to share global state. +- **ADR-before-import for new direct deps.** The dispatcher's documentation agent cross-links the ADR from `INDEX.md` under *Decisions*. The pattern is the new bar for any next direct dep — ADR file lands before the `go.mod` edit, justification or not, never a silent first-import. + +## Seam-shape pick rationale (recorded in the spec, not in `metrics.go`) + +The spec (`docs/specs/architecture/59-metrics-scaffolding.md`) is the place sibling architects read first for the registration shape. Three options were considered: + +1. **Package-level `var`s.** Rejected. Package-init can only register on a package-level singleton, which forces `NewMetricsRegistry` into a singleton and breaks per-test isolation. +2. **One mega-struct (`type Metrics struct { … }`) growing fields per sibling.** Rejected. Concurrent sibling PRs all touch the same struct — guaranteed merge conflict. +3. **Per-concern collector struct (adopted).** Each sibling owns a private type in its own file; `metrics.go` stays untouched after #59. + +The seam exposed (`NewMetricsRegistry`, `NewMetricsHandler`) admits the pattern without naming it. The spec is the contract. + +## Out of scope (delegated) + +- Counters / gauges / histograms — sibling tickets (#57 upgrade-flood rejects, #58 register failures, future throughput / grace-expiry). +- `/metrics` listener, bind-address validation, flag plumbing, TLS for the metrics port — #60. +- Process / Go-runtime collectors (`collectors.NewGoCollector`, `NewProcessCollector`). +- `ErrorLog` wiring from `promhttp` into `slog`. +- OpenMetrics content negotiation (left disabled by default). +- Metric-label exfiltration rules (sibling specs enforce; metric labels carry the same MUST-NOT-emit posture as `slog` values per threat-model § Log hygiene). + +## Lessons learned + +- **The `HandlerOpts.Registry` field is load-bearing, not optional.** Leaving it `nil` lands `promhttp`'s self-instrumentation counters on `DefaultRegisterer` — silently breaking the no-default-registry boundary even with otherwise correct user code. The structural test (`TestMetricsRegistry_NoGlobalRegistrarLeak`) catches the omission; the docstring on `NewMetricsHandler` explains why. +- **Substring-match the Prometheus content-type literal, don't equality-match.** `prometheus/common` appended `; escaping=` to the text-format media type in a recent release (additive, backward-compatible per the format spec). An equality assertion against `text/plain; version=0.0.4; charset=utf-8` would flake on the next `go.mod` bump. The spec's intent — "AC literal is the canary" — is preserved by checking `strings.HasPrefix(got, textFormatPrefix)`: a release that renamed the version or dropped the charset would still fail loudly. +- **Snapshot-delta is the right shape for global-state tests.** `TestMetricsRegistry_NoGlobalRegistrarLeak` asserts `before == after` on `DefaultGatherer.Gather()` length, not `after == 0`. Stable across `client_golang` bumps that pre-register their own internals on the default gatherer. Generalises: when defending an "X should not change global state Y" invariant, snapshot Y, run X, compare — don't pin the absolute value. + +## Cross-links + +- [Metrics registry feature](../features/metrics-registry.md) — evergreen seam + usage doc. +- [ADR-0008: Prometheus client adoption](../decisions/0008-prometheus-client-adoption.md) — dep justification, alternatives, scope-of-use rules. +- [`/healthz` handler](../features/healthz.md) — sibling factory-shaped handler; `NewMetricsHandler` mirrors its contract. +- [Threat model § Log hygiene](../../threat-model.md) — MUST-NOT-emit list that propagates to metric label values via sibling specs. +- [Architect spec](../../specs/architecture/59-metrics-scaffolding.md) — full design + security review. diff --git a/docs/knowledge/features/metrics-registry.md b/docs/knowledge/features/metrics-registry.md new file mode 100644 index 0000000..671d0ca --- /dev/null +++ b/docs/knowledge/features/metrics-registry.md @@ -0,0 +1,90 @@ +# Metrics registry — scaffolding + +The relay holds a private `*prometheus.Registry` and serves it over a Prometheus-format `/metrics` handler. The registry is constructed per process; sibling tickets register counters / gauges / histograms against it. `prometheus.DefaultRegisterer` is never touched. + +This page documents the scaffolding only — what the seam is, why it has the shape it does, and where the rules are recorded. The first counter / gauge lands in sibling tickets; the `/metrics` listener wiring lives in #60. + +## API + +Package `internal/relay` (`metrics.go`): + +```go +func NewMetricsRegistry() *prometheus.Registry +func NewMetricsHandler(reg *prometheus.Registry) http.Handler +``` + +- `NewMetricsRegistry` returns a fresh `*prometheus.Registry`. Per-call construction (not a singleton) — tests instantiate as many as they need. Matches the registry-as-passive-store posture of `Registry` (#3, ADR-0003). +- `NewMetricsHandler` wraps `promhttp.HandlerFor(reg, …)` and returns `http.Handler`. Follows the factory pattern established by `NewHealthzHandler` (#10): no per-request state, no exported fields, safe for concurrent use. + +The handler ships text exposition format only — `Content-Type: text/plain; version=0.0.4; charset=utf-8` (newer `prometheus/common` releases append `; escaping=`, which is additive). OpenMetrics negotiation is disabled (`HandlerOpts.EnableOpenMetrics: false`, the library default). The on-the-wire response shape is independent of the caller's `Accept` header, matching the no-content-negotiation posture of `/healthz`. + +`HandlerOpts.Registry` is set to `reg` so `promhttp`'s own self-instrumentation counters (`promhttp_metric_handler_*`) register against the relay's private registry rather than `DefaultRegisterer` — load-bearing for ADR-0008 § Scope of use even for the library's own bookkeeping. + +## Seam shape — per-concern collector struct + +Sibling tickets that add metrics define their own file under `internal/relay/` (e.g. `metrics_upgrade.go`) with a private collector type: + +```go +type upgradeFloodMetrics struct { + rejectsTotal *prometheus.CounterVec +} + +func newUpgradeFloodMetrics(reg prometheus.Registerer) *upgradeFloodMetrics { + m := &upgradeFloodMetrics{ + rejectsTotal: prometheus.NewCounterVec( + prometheus.CounterOpts{Name: "relay_upgrade_rejects_total", Help: "…"}, + []string{"reason"}, + ), + } + reg.MustRegister(m.rejectsTotal) + return m +} +``` + +The wiring site (`main.go`, #60) holds N such collectors as locals and threads each into the handler that increments it. Adding a new metric type is a greenfield file; no edits to `metrics.go` or to siblings' files. + +Rejected alternatives: + +- **Package-level `var`s.** Package-init can only register against a package-level singleton, which forces `NewMetricsRegistry` into a singleton. Singletons break per-test isolation that `NewRegistry()` (ADR-0003) already establishes. +- **One mega-struct.** A growing `type Metrics struct { … }` forces every sibling PR to touch the same file → guaranteed merge conflicts between concurrently-architected siblings. + +## Scope of use (ADR-0008) + +In bounds inside `internal/relay/` and #60's wiring: + +- Construction of the private registry, counters, gauges, histograms. +- Handler factory via `promhttp.HandlerFor`. +- `prometheus/testutil` for metric assertions in tests. + +Structurally out of bounds: + +- `prometheus.DefaultRegisterer` is never touched. `promauto`'s default-registry shortcuts are off-limits. Enforced by `TestMetricsRegistry_NoGlobalRegistrarLeak` (snapshot-delta on `DefaultGatherer`). +- Process / Go-runtime collectors (`collectors.NewGoCollector`, `NewProcessCollector`) are not registered. A future ticket may opt in by registering them on the private registry. +- `slog` integration. `promhttp` collection errors return HTTP 500 with the library's error string; no log adapter is wired. +- Metric labels are an exfiltration channel. Sibling tickets MUST NOT emit token values, IPs (without the rate-limit policy bit), or any value covered by threat-model § Log hygiene's MUST-NOT-emit list. + +## Concurrency + +`promhttp.HandlerFor` is documented safe for concurrent calls; `prometheus.Registry`'s `Register` / `MustRegister` / `Gather` use internal locking. This package spawns no goroutines and owns no shared state. + +## Testing + +`internal/relay/metrics_test.go`, three tests, stdlib + `prometheus/common/expfmt` + `client_model/go` (both transitives of `client_golang`): + +- `TestMetricsHandler_EmptyRegistry_ResponseShape` — AC gate. Status 200, content-type prefix matches the text-format literal, body decodes via `expfmt.NewDecoder` cleanly to `io.EOF`. +- `TestMetricsHandler_RegisterAndScrape_RoundTrip` — register a trivial counter, increment to 2, assert `relay_test_counter_total 2` appears in the body. Cheap seam-verification against future `client_golang` bumps. +- `TestMetricsRegistry_NoGlobalRegistrarLeak` — snapshot `DefaultGatherer.Gather()` size before and after constructing the registry+handler; the delta MUST be zero. Structural defence for ADR-0008 § Scope of use against a contributor reaching for `promauto`'s default-registry shortcuts. + +## What this deliberately does NOT do + +- No counters, no gauges, no histograms — siblings (#57, #58, future) own those. +- No listener — `/metrics` is not bound to a `mux` in this ticket. #60 owns the listener (flag, bind-address validation, `http.Server` timeouts). +- No process / Go-runtime collectors. +- No `ErrorLog` wiring from `promhttp` into `slog`. +- No content negotiation (OpenMetrics opt-in is a separate decision). + +## Related + +- [ADR-0008: Adopt `github.com/prometheus/client_golang`](../decisions/0008-prometheus-client-adoption.md) — alternatives, supply-chain notes, scope-of-use rules. +- [`/healthz` JSON endpoint](healthz.md) — sibling factory-shaped handler; pattern reference for `NewMetricsHandler`. +- [Threat model](../../threat-model.md) — log-hygiene posture that extends to metric label values.