unnamedlab · unnamedlab · Apr 30, 2026 · Apr 30, 2026
diff --git a/infra/observability/grafana-dashboards/README.md b/infra/observability/grafana-dashboards/README.md
@@ -4,23 +4,32 @@ This directory holds the dashboard inventory for the components covered
 by [ADR-0012 — Data-plane SLOs, SLIs and error budgets][adr-0012].
 
 For T17 we deliberately **prefer official upstream dashboards** over
-hand-written JSON: every component listed below already publishes a
-maintained dashboard on [grafana.com/grafana/dashboards][gcom] that is
-updated against the same metric names our PrometheusRules
-(`../prometheus-rules/`) target. Forking those dashboards into this
-repo would force us to track every upstream rename of a panel or
-metric — work that delivers nothing to OpenFoundry users.
-
-The handful of dashboards that are **specific to OpenFoundry** are
-shipped as JSON in this directory. The first wave (this directory)
-covers the **on-call critical path** — the ADR-0012 §3 freeze decision
-hinges on the *Data Plane SLO Overview* dashboard, and the two SLIs
-that have the largest uninstrumented surface today
-(DataFusion/Iceberg scans, NATS control events) get a dedicated
-per-SLI dashboard so a regression is debuggable without writing
-ad-hoc PromQL. The remaining per-SLO dashboards listed in the table
-below stay marked **TBD** until their backing histograms are emitted
-in production.
+hand-written JSON for the **per-component health** view: every
+component listed below already publishes a maintained dashboard on
+[grafana.com/grafana/dashboards][gcom] that is updated against the
+same metric names our PrometheusRules (`../prometheus-rules/`)
+target. Forking those dashboards into this repo would force us to
+track every upstream rename of a panel or metric — work that
+delivers nothing to OpenFoundry users.
+
+What we **do ship as JSON** in this directory is the OpenFoundry-specific
+material that upstream cannot provide:
+
+1. The **per-SLO dashboards** that map 1:1 to the SLIs in
+   ADR-0012 §2 — they bind a fixed PromQL expression, the SLO bound
+   from §1, and the multi-window burn-rate page thresholds from §3
+   into a single view that on-call uses to decide a freeze.
+2. **Operator/fleet overviews** for the components whose alerts
+   (`../prometheus-rules/`) target metric names not yet covered by
+   any upstream dashboard at the time of writing — currently
+   Lakekeeper (no upstream), CNPG (fleet view across one cluster
+   per bounded context, beyond the per-instance upstream
+   dashboard 20417) and Flink (uptime / checkpoints / savepoint
+   age tied to the T15 maintenance schedule).
+
+Per-SLO dashboards remain marked **TBD** only when the backing
+histogram is not yet emitted by the producing service in production;
+the rest are now shipped as JSON.
 
 ## Dashboard inventory
 
@@ -37,23 +46,28 @@ in production.
 | Apache Flink      | *Flink Dashboard*                               | **[14911][gc-14911]** | Prometheus | Job uptime, checkpoints, savepoints — pairs with `flink.yaml`. |
 | NATS / JetStream  | *NATS Server Dashboard*                         | **[2279][gc-2279]**   | Prometheus | Built for the official prometheus-nats-exporter. |
 | NATS / JetStream  | *JetStream Dashboard*                           | **[14862][gc-14862]** | Prometheus | Per-stream / per-consumer view; pairs with `nats.yaml`. |
-| Lakekeeper        | *Lakekeeper service overview* — **TBD**         | n/a                 | Prometheus | No upstream dashboard; will be added here once the SLI route labels stabilise. |
+| Lakekeeper        | *Lakekeeper service overview*                   | n/a                 | Prometheus | OpenFoundry-specific — see [`lakekeeper-overview.json`](./lakekeeper-overview.json). RED metrics + sqlx pool, pairs with `lakekeeper.yaml`. |
+| CloudNativePG     | *CNPG cluster fleet*                            | n/a                 | Prometheus | OpenFoundry-specific fleet view across the per-bounded-context clusters in `infra/k8s/cnpg/clusters/` — see [`cnpg-overview.json`](./cnpg-overview.json). Pairs with `cnpg.yaml`. |
+| Apache Flink      | *Flink jobs overview*                           | n/a                 | Prometheus | OpenFoundry-specific — see [`flink-overview.json`](./flink-overview.json). Uptime, checkpoints, savepoint age (T15 maintenance), pairs with `flink.yaml`. |
 
 ### Per-SLO (OpenFoundry-specific)
 
-These map 1:1 to the dashboards listed in ADR-0012 §4. Three are
-shipped as JSON in this directory (closing T17); the rest stay **TBD**
-until the corresponding histograms land in production. The proposed
-UIDs are reserved.
+These map 1:1 to the dashboards listed in ADR-0012 §4. All seven are
+now shipped as JSON in this directory (closing T17). Each one only
+yields useful values when the producing service is emitting the
+backing histogram with the labels listed in ADR-0012 §2; until then
+the panels render `N/A`, but importing the dashboard is still the
+right move so on-call sees the wiring the moment metrics start
+flowing.
 
 | Dashboard | UID | Backing SLI from ADR-0012 | File |
 |---|---|---|---|
 | Data Plane SLO Overview              | `dp-slo-overview`  | aggregates the six SLIs | [`dp-slo-overview.json`](./dp-slo-overview.json) |
-| Flight SQL — point query SLO         | `dp-slo-flightsql` | §2.1 | **TBD** |
+| Flight SQL — point query SLO         | `dp-slo-flightsql` | §2.1 | [`dp-slo-flightsql.json`](./dp-slo-flightsql.json) |
 | DataFusion / Iceberg scan SLO        | `dp-slo-datafusion`| §2.2 | [`dp-slo-datafusion.json`](./dp-slo-datafusion.json) |
-| Kafka producer ack SLO               | `dp-slo-kafka`     | §2.3 | **TBD** |
-| ClickHouse dashboard query SLO       | `dp-slo-clickhouse`| §2.4 | **TBD** |
-| Vespa hybrid query SLO               | `dp-slo-vespa`     | §2.5 | **TBD** |
+| Kafka producer ack SLO               | `dp-slo-kafka`     | §2.3 | [`dp-slo-kafka.json`](./dp-slo-kafka.json) |
+| ClickHouse dashboard query SLO       | `dp-slo-clickhouse`| §2.4 | [`dp-slo-clickhouse.json`](./dp-slo-clickhouse.json) |
+| Vespa hybrid query SLO               | `dp-slo-vespa`     | §2.5 | [`dp-slo-vespa.json`](./dp-slo-vespa.json) |
 | NATS control event SLO               | `dp-slo-nats`      | §2.6 | [`dp-slo-nats.json`](./dp-slo-nats.json) |
 
 Every shipped dashboard: