A hands-on lab that demonstrates why your monitoring pipeline is a single point of failure — and how meta-monitoring fixes it.
This lab is the companion to a three-part article series on building intentional, reliable, and cost-effective observability for cloud-native systems at scale.
Each article tackles a different dimension of observability maturity — and each one builds on the lessons of the previous.
Observability costs don't explode because of traffic — they explode because of unchecked cardinality. A single unbounded label like request_id or collection_id can silently generate hundreds of thousands of active time series. This article covers how a single high-cardinality label caused an 81% cost increase in Grafana Cloud, and how relabeling rules and metric auditing bring costs back under control.
Most alerts assume data exists — they trigger when a metric crosses a threshold. But when a metric disappears entirely, the alert never fires. There's no alarm, no page, just silence. This article introduces Prometheus's absent() function as a way to detect when critical metrics stop reporting — turning silence into an actionable signal.
You've got cardinality under control. You're alerting on missing data. But what happens when Prometheus itself goes down? Your absent() rules can't evaluate. Your Watchdog heartbeat stops. Your dashboards freeze. And nobody gets notified — because the notification path runs through the very system that failed. This lab lets you experience that blind spot firsthand and demonstrates the meta-monitoring patterns that solve it.
Most teams build alerting around thresholds and missing data — but never test what happens when the monitoring stack itself fails.
This lab lets you:
- Run a full Prometheus + AlertManager + Grafana stack locally
- Kill Prometheus and watch dashboards freeze, alerts go silent, and
absent()rules become useless - Kill AlertManager and see Prometheus detect the failure but fail to notify anyone
- Understand why meta-monitoring (Watchdog alerts, external heartbeats, blackbox probes) is the missing layer
┌──────────────────────────────────────────────────────────┐
│ Docker Network │
│ │
│ ┌──────────────┐ scrapes ┌───────────────────┐ │
│ │ Prometheus │◄────────────────│ Payment Service │ │
│ │ :9090 │ │ :8000 │ │
│ └──────┬───────┘ └───────────────────┘ │
│ │ │
│ alerts│ ┌───────────────────┐ │
│ ▼ │ Blackbox Exporter │ │
│ ┌──────────────┐ │ :9115 │ │
│ │ AlertManager │ │ (probes health │ │
│ │ :9093 │ │ endpoints) │ │
│ └──────────────┘ └───────────────────┘ │
│ │
│ ┌──────────────┐ │
│ │ Grafana │ ← Meta-Monitoring Dashboard │
│ │ :3000 │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────┘
- Docker and Docker Compose
- Ports
3000,8000,9090,9093,9115available
# Clone the repo
git clone https://github.com/rtmuller/observability-reliability-lab.git
cd observability-reliability-lab
# Start everything
docker compose up -d --build
# Verify all services are running
docker compose ps| Service | URL | Credentials |
|---|---|---|
| Grafana | http://localhost:3000 | admin / admin |
| Prometheus | http://localhost:9090 | — |
| AlertManager | http://localhost:9093 | — |
| Blackbox | http://localhost:9115 | — |
| Payment App | http://localhost:8000/metrics | — |
Open Grafana and navigate to Dashboards → Meta-Monitoring — Observability Reliability to see the pre-built dashboard.
Simulates Prometheus being OOMKilled or evicted from a node.
./chaos/kill-prometheus.shWhat happens:
Prometheus Health Endpoint:
CONNECTION REFUSED (Prometheus is down)
Payment Service:
OK (app is fine, but nobody is watching it)
Grafana:
Cannot reach Prometheus — dashboards are frozen
AlertManager Active Alerts:
Watchdog status=active (stale — will expire)
Blackbox probe for Prometheus:
probe_success = 0 (FAIL — Prometheus unreachable)
Blackbox Exporter detects Prometheus is down. But Prometheus is the one that reads blackbox results. Nobody is consuming the data. The detection is useless.
Simulates AlertManager crashing or becoming unreachable.
./chaos/kill-alertmanager.shWhat happens:
Prometheus Firing Alerts:
[critical] TargetDown target=alertmanager:9093
[critical] MonitoringComponentDown target=http://alertmanager:9093/-/healthy
Prometheus Notification Delivery:
notifications_dropped_total = 10
notifications_errors_total = 9
Blackbox probe for AlertManager:
probe_success = 0 (FAIL — AlertManager unreachable)
Prometheus knows AlertManager is down. It fires critical alerts. But it delivers alerts through AlertManager — the very thing that's broken. 10 dropped notifications. Nobody gets paged.
./chaos/restore-all.shAn alert that always fires. If it stops, your pipeline is broken.
- alert: Watchdog
expr: vector(1)
labels:
severity: noneRoute it to an external heartbeat service (Healthchecks.io, PagerDuty, Deadman's Snitch). If the heartbeat stops arriving, the external service alerts you through an independent path.
Probes the health endpoints of monitoring components themselves:
http://prometheus:9090/-/healthyhttp://alertmanager:9093/-/healthyhttp://grafana:3000/api/health
From Article 2 — detects when metrics disappear. But as this lab demonstrates, absent() only works if Prometheus is alive to evaluate it.
.
├── docker-compose.yml # Full stack definition
├── app/
│ ├── Dockerfile # Sample payment service
│ ├── main.py # Python app with Prometheus metrics
│ └── requirements.txt
├── prometheus/
│ ├── prometheus.yml # Scrape config + meta-monitoring
│ └── alerts/
│ ├── watchdog.yml # Watchdog, TargetDown, meta-alerts
│ └── absent.yml # absent() rules
├── alertmanager/
│ └── alertmanager.yml # Routing with DeadMansSwitch receiver
├── blackbox/
│ └── blackbox.yml # HTTP health probe config
├── grafana/
│ ├── datasources.yml # Auto-provisioned Prometheus source
│ └── dashboards/
│ ├── dashboard.yml # Provisioning config
│ └── meta-monitoring.json # Pre-built dashboard
└── chaos/
├── kill-prometheus.sh # Stop Prometheus
├── kill-alertmanager.sh # Stop AlertManager
└── restore-all.sh # Restore all services
docker compose down -vMIT
Author: Rafael Muller — Staff Cloud Engineer at Airbnb, working on platform infrastructure at 8M+ listings scale. Writing at Medium.