Skip to content

Add 7 Grafana dashboards to close T17 observability gap#70

Merged
unnamedlab merged 1 commit into
mainfrom
copilot/audit-openfoundry-repository
Apr 30, 2026
Merged

Add 7 Grafana dashboards to close T17 observability gap#70
unnamedlab merged 1 commit into
mainfrom
copilot/audit-openfoundry-repository

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 30, 2026

The audit flagged that infra/observability/grafana-dashboards/ shipped only 3 JSON dashboards (dp-slo-overview, dp-slo-datafusion, dp-slo-nats) while infra/observability/prometheus-rules/ defined alerts for 7 components — leaving on-call without panels for Kafka, ClickHouse, CNPG, Vespa, Lakekeeper and Flink.

Per-SLO dashboards (ADR-0012 §2)

Generated from the existing dp-slo-nats.json template, substituting metric, selector, latency bound (le), SLO target (99.5 %) and the multi-window page thresholds (1h × 14.4 = 0.072, 6h × 6 = 0.030 of budget):

  • dp-slo-flightsql.json — §2.1, flight_sql_query_duration_seconds, le="0.020"
  • dp-slo-kafka.json — §2.3, kafka_producer_request_latency_seconds, le="0.025"
  • dp-slo-clickhouse.json — §2.4, clickhouse_query_duration_seconds, le="0.200"
  • dp-slo-vespa.json — §2.5, vespa_query_latency_seconds, le="0.080"

Each dashboard ships the same 9-panel layout as the NATS one: p50/p99/p99.9 stats, 30-day compliance, request rate, latency timeseries with the SLO line, 1h-vs-6h burn-rate with both page thresholds drawn, and budget remaining.

Operator / fleet overviews

Panelling the exact series the existing PrometheusRules already alert on — no new metrics introduced:

  • lakekeeper-overview.json — RED (http_requests_total, http_request_duration_seconds_bucket) + sqlx pool saturation; mirrors lakekeeper.yaml.
  • cnpg-overview.json — fleet view across the per-bounded-context clusters in infra/k8s/cnpg/clusters/: primaries reachable, max replica lag with the 1 GiB threshold, switchover events, WAL-archiver failures.
  • flink-overview.jsonflink_jobmanager_job_uptime, failed checkpoints over 30 m, time since last checkpoint vs the 1 h alert line, latest savepoint age vs the 24 h T15 maintenance line.

README

infra/observability/grafana-dashboards/README.md: per-SLO inventory flipped from TBD → shipped; the per-component table now lists the three OpenFoundry-specific overviews alongside the upstream-preferred entries (rationale for keeping the rest on grafana.com/IDs is preserved).

All 10 JSONs declare __inputs[].name = DS_PROMETHEUS so they import unmodified into any Grafana with a Prometheus datasource.

…g T17 audit gap

Agent-Logs-Url: https://github.com/unnamedlab/OpenFoundry/sessions/da69e7eb-5294-4f0f-8243-6eb251188a0d

Co-authored-by: unnamedlab <272794385+unnamedlab@users.noreply.github.com>
@unnamedlab unnamedlab marked this pull request as ready for review April 30, 2026 09:35
@unnamedlab unnamedlab merged commit b88b022 into main Apr 30, 2026
unnamedlab pushed a commit that referenced this pull request May 18, 2026
…-3-aggregate

feat(pipeline-build-service): promote aggregate to available (ADR-0045 Phase C.3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants