Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 40 additions & 26 deletions infra/observability/grafana-dashboards/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,32 @@ This directory holds the dashboard inventory for the components covered
by [ADR-0012 — Data-plane SLOs, SLIs and error budgets][adr-0012].

For T17 we deliberately **prefer official upstream dashboards** over
hand-written JSON: every component listed below already publishes a
maintained dashboard on [grafana.com/grafana/dashboards][gcom] that is
updated against the same metric names our PrometheusRules
(`../prometheus-rules/`) target. Forking those dashboards into this
repo would force us to track every upstream rename of a panel or
metric — work that delivers nothing to OpenFoundry users.

The handful of dashboards that are **specific to OpenFoundry** are
shipped as JSON in this directory. The first wave (this directory)
covers the **on-call critical path** — the ADR-0012 §3 freeze decision
hinges on the *Data Plane SLO Overview* dashboard, and the two SLIs
that have the largest uninstrumented surface today
(DataFusion/Iceberg scans, NATS control events) get a dedicated
per-SLI dashboard so a regression is debuggable without writing
ad-hoc PromQL. The remaining per-SLO dashboards listed in the table
below stay marked **TBD** until their backing histograms are emitted
in production.
hand-written JSON for the **per-component health** view: every
component listed below already publishes a maintained dashboard on
[grafana.com/grafana/dashboards][gcom] that is updated against the
same metric names our PrometheusRules (`../prometheus-rules/`)
target. Forking those dashboards into this repo would force us to
track every upstream rename of a panel or metric — work that
delivers nothing to OpenFoundry users.

What we **do ship as JSON** in this directory is the OpenFoundry-specific
material that upstream cannot provide:

1. The **per-SLO dashboards** that map 1:1 to the SLIs in
ADR-0012 §2 — they bind a fixed PromQL expression, the SLO bound
from §1, and the multi-window burn-rate page thresholds from §3
into a single view that on-call uses to decide a freeze.
2. **Operator/fleet overviews** for the components whose alerts
(`../prometheus-rules/`) target metric names not yet covered by
any upstream dashboard at the time of writing — currently
Lakekeeper (no upstream), CNPG (fleet view across one cluster
per bounded context, beyond the per-instance upstream
dashboard 20417) and Flink (uptime / checkpoints / savepoint
age tied to the T15 maintenance schedule).

Per-SLO dashboards remain marked **TBD** only when the backing
histogram is not yet emitted by the producing service in production;
the rest are now shipped as JSON.

## Dashboard inventory

Expand All @@ -37,23 +46,28 @@ in production.
| Apache Flink | *Flink Dashboard* | **[14911][gc-14911]** | Prometheus | Job uptime, checkpoints, savepoints — pairs with `flink.yaml`. |
| NATS / JetStream | *NATS Server Dashboard* | **[2279][gc-2279]** | Prometheus | Built for the official prometheus-nats-exporter. |
| NATS / JetStream | *JetStream Dashboard* | **[14862][gc-14862]** | Prometheus | Per-stream / per-consumer view; pairs with `nats.yaml`. |
| Lakekeeper | *Lakekeeper service overview* — **TBD** | n/a | Prometheus | No upstream dashboard; will be added here once the SLI route labels stabilise. |
| Lakekeeper | *Lakekeeper service overview* | n/a | Prometheus | OpenFoundry-specific — see [`lakekeeper-overview.json`](./lakekeeper-overview.json). RED metrics + sqlx pool, pairs with `lakekeeper.yaml`. |
| CloudNativePG | *CNPG cluster fleet* | n/a | Prometheus | OpenFoundry-specific fleet view across the per-bounded-context clusters in `infra/k8s/cnpg/clusters/` — see [`cnpg-overview.json`](./cnpg-overview.json). Pairs with `cnpg.yaml`. |
| Apache Flink | *Flink jobs overview* | n/a | Prometheus | OpenFoundry-specific — see [`flink-overview.json`](./flink-overview.json). Uptime, checkpoints, savepoint age (T15 maintenance), pairs with `flink.yaml`. |

### Per-SLO (OpenFoundry-specific)

These map 1:1 to the dashboards listed in ADR-0012 §4. Three are
shipped as JSON in this directory (closing T17); the rest stay **TBD**
until the corresponding histograms land in production. The proposed
UIDs are reserved.
These map 1:1 to the dashboards listed in ADR-0012 §4. All seven are
now shipped as JSON in this directory (closing T17). Each one only
yields useful values when the producing service is emitting the
backing histogram with the labels listed in ADR-0012 §2; until then
the panels render `N/A`, but importing the dashboard is still the
right move so on-call sees the wiring the moment metrics start
flowing.

| Dashboard | UID | Backing SLI from ADR-0012 | File |
|---|---|---|---|
| Data Plane SLO Overview | `dp-slo-overview` | aggregates the six SLIs | [`dp-slo-overview.json`](./dp-slo-overview.json) |
| Flight SQL — point query SLO | `dp-slo-flightsql` | §2.1 | **TBD** |
| Flight SQL — point query SLO | `dp-slo-flightsql` | §2.1 | [`dp-slo-flightsql.json`](./dp-slo-flightsql.json) |
| DataFusion / Iceberg scan SLO | `dp-slo-datafusion`| §2.2 | [`dp-slo-datafusion.json`](./dp-slo-datafusion.json) |
| Kafka producer ack SLO | `dp-slo-kafka` | §2.3 | **TBD** |
| ClickHouse dashboard query SLO | `dp-slo-clickhouse`| §2.4 | **TBD** |
| Vespa hybrid query SLO | `dp-slo-vespa` | §2.5 | **TBD** |
| Kafka producer ack SLO | `dp-slo-kafka` | §2.3 | [`dp-slo-kafka.json`](./dp-slo-kafka.json) |
| ClickHouse dashboard query SLO | `dp-slo-clickhouse`| §2.4 | [`dp-slo-clickhouse.json`](./dp-slo-clickhouse.json) |
| Vespa hybrid query SLO | `dp-slo-vespa` | §2.5 | [`dp-slo-vespa.json`](./dp-slo-vespa.json) |
| NATS control event SLO | `dp-slo-nats` | §2.6 | [`dp-slo-nats.json`](./dp-slo-nats.json) |

Every shipped dashboard:
Expand Down
Loading