Add 7 Grafana dashboards to close T17 observability gap by Copilot · Pull Request #70 · unnamedlab/OpenFoundry

Copilot · 2026-04-30T09:35:48Z

The audit flagged that infra/observability/grafana-dashboards/ shipped only 3 JSON dashboards (dp-slo-overview, dp-slo-datafusion, dp-slo-nats) while infra/observability/prometheus-rules/ defined alerts for 7 components — leaving on-call without panels for Kafka, ClickHouse, CNPG, Vespa, Lakekeeper and Flink.

Per-SLO dashboards (ADR-0012 §2)

Generated from the existing dp-slo-nats.json template, substituting metric, selector, latency bound (le), SLO target (99.5 %) and the multi-window page thresholds (1h × 14.4 = 0.072, 6h × 6 = 0.030 of budget):

dp-slo-flightsql.json — §2.1, flight_sql_query_duration_seconds, le="0.020"
dp-slo-kafka.json — §2.3, kafka_producer_request_latency_seconds, le="0.025"
dp-slo-clickhouse.json — §2.4, clickhouse_query_duration_seconds, le="0.200"
dp-slo-vespa.json — §2.5, vespa_query_latency_seconds, le="0.080"

Each dashboard ships the same 9-panel layout as the NATS one: p50/p99/p99.9 stats, 30-day compliance, request rate, latency timeseries with the SLO line, 1h-vs-6h burn-rate with both page thresholds drawn, and budget remaining.

Operator / fleet overviews

Panelling the exact series the existing PrometheusRules already alert on — no new metrics introduced:

lakekeeper-overview.json — RED (http_requests_total, http_request_duration_seconds_bucket) + sqlx pool saturation; mirrors lakekeeper.yaml.
cnpg-overview.json — fleet view across the per-bounded-context clusters in infra/k8s/cnpg/clusters/: primaries reachable, max replica lag with the 1 GiB threshold, switchover events, WAL-archiver failures.
flink-overview.json — flink_jobmanager_job_uptime, failed checkpoints over 30 m, time since last checkpoint vs the 1 h alert line, latest savepoint age vs the 24 h T15 maintenance line.

README

infra/observability/grafana-dashboards/README.md: per-SLO inventory flipped from TBD → shipped; the per-component table now lists the three OpenFoundry-specific overviews alongside the upstream-preferred entries (rationale for keeping the rest on grafana.com/IDs is preserved).

All 10 JSONs declare __inputs[].name = DS_PROMETHEUS so they import unmodified into any Grafana with a Prometheus datasource.

…g T17 audit gap Agent-Logs-Url: https://github.com/unnamedlab/OpenFoundry/sessions/da69e7eb-5294-4f0f-8243-6eb251188a0d Co-authored-by: unnamedlab <272794385+unnamedlab@users.noreply.github.com>

…-3-aggregate feat(pipeline-build-service): promote aggregate to available (ADR-0045 Phase C.3)

docs(observability): add 7 Grafana dashboards (SLO + overview) closin…

f7f4f19

…g T17 audit gap Agent-Logs-Url: https://github.com/unnamedlab/OpenFoundry/sessions/da69e7eb-5294-4f0f-8243-6eb251188a0d Co-authored-by: unnamedlab <272794385+unnamedlab@users.noreply.github.com>

Copilot AI assigned Copilot and unnamedlab Apr 30, 2026

Copilot created this pull request from a session on behalf of unnamedlab April 30, 2026 09:35 View session

unnamedlab marked this pull request as ready for review April 30, 2026 09:35

unnamedlab merged commit b88b022 into main Apr 30, 2026

unnamedlab pushed a commit that referenced this pull request May 18, 2026

Merge pull request #70 from DioCrafts/migration/spark-removal-phase-c…

1b8820b

…-3-aggregate feat(pipeline-build-service): promote aggregate to available (ADR-0045 Phase C.3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 7 Grafana dashboards to close T17 observability gap#70

Add 7 Grafana dashboards to close T17 observability gap#70
unnamedlab merged 1 commit into
mainfrom
copilot/audit-openfoundry-repository

Copilot AI commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Apr 30, 2026

Per-SLO dashboards (ADR-0012 §2)

Operator / fleet overviews

README

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants