Observability Reliability Lab

A hands-on lab that demonstrates why your monitoring pipeline is a single point of failure — and how meta-monitoring fixes it.

The Observability Trilogy

This lab is the companion to a three-part article series on building intentional, reliable, and cost-effective observability for cloud-native systems at scale.

Each article tackles a different dimension of observability maturity — and each one builds on the lessons of the previous.

Article 1 — Beyond Monitoring: The Hidden Cost of Observability at Scale

Observability costs don't explode because of traffic — they explode because of unchecked cardinality. A single unbounded label like request_id or collection_id can silently generate hundreds of thousands of active time series. This article covers how a single high-cardinality label caused an 81% cost increase in Grafana Cloud, and how relabeling rules and metric auditing bring costs back under control.

Article 2 — The Silent Killer: Why "No Data" Is Often Worse Than Bad Data

Most alerts assume data exists — they trigger when a metric crosses a threshold. But when a metric disappears entirely, the alert never fires. There's no alarm, no page, just silence. This article introduces Prometheus's absent() function as a way to detect when critical metrics stop reporting — turning silence into an actionable signal.

Article 3 — The Observability Reliability Paradox (this lab)

You've got cardinality under control. You're alerting on missing data. But what happens when Prometheus itself goes down? Your absent() rules can't evaluate. Your Watchdog heartbeat stops. Your dashboards freeze. And nobody gets notified — because the notification path runs through the very system that failed. This lab lets you experience that blind spot firsthand and demonstrates the meta-monitoring patterns that solve it.

What This Lab Demonstrates

Most teams build alerting around thresholds and missing data — but never test what happens when the monitoring stack itself fails.

This lab lets you:

Run a full Prometheus + AlertManager + Grafana stack locally
Kill Prometheus and watch dashboards freeze, alerts go silent, and absent() rules become useless
Kill AlertManager and see Prometheus detect the failure but fail to notify anyone
Understand why meta-monitoring (Watchdog alerts, external heartbeats, blackbox probes) is the missing layer

Architecture

┌──────────────────────────────────────────────────────────┐
│                     Docker Network                        │
│                                                           │
│  ┌──────────────┐     scrapes      ┌───────────────────┐ │
│  │  Prometheus   │◄────────────────│  Payment Service   │ │
│  │  :9090        │                 │  :8000             │ │
│  └──────┬───────┘                 └───────────────────┘ │
│         │                                                 │
│    alerts│         ┌───────────────────┐                  │
│         ▼         │ Blackbox Exporter  │                  │
│  ┌──────────────┐ │  :9115            │                  │
│  │ AlertManager  │ │  (probes health   │                  │
│  │  :9093        │ │   endpoints)      │                  │
│  └──────────────┘ └───────────────────┘                  │
│                                                           │
│  ┌──────────────┐                                         │
│  │   Grafana     │  ← Meta-Monitoring Dashboard           │
│  │  :3000        │                                        │
│  └──────────────┘                                         │
└──────────────────────────────────────────────────────────┘

Prerequisites

Docker and Docker Compose
Ports 3000, 8000, 9090, 9093, 9115 available

Quick Start

# Clone the repo
git clone https://github.com/rtmuller/observability-reliability-lab.git
cd observability-reliability-lab

# Start everything
docker compose up -d --build

# Verify all services are running
docker compose ps

Access the UIs

Service	URL	Credentials
Grafana	http://localhost:3000	admin / admin
Prometheus	http://localhost:9090	—
AlertManager	http://localhost:9093	—
Blackbox	http://localhost:9115	—
Payment App	http://localhost:8000/metrics	—

Open Grafana and navigate to Dashboards → Meta-Monitoring — Observability Reliability to see the pre-built dashboard.

Chaos Scenarios

Scenario 1: Kill Prometheus

Simulates Prometheus being OOMKilled or evicted from a node.

./chaos/kill-prometheus.sh

What happens:

Prometheus Health Endpoint:
  CONNECTION REFUSED (Prometheus is down)

Payment Service:
  OK (app is fine, but nobody is watching it)

Grafana:
  Cannot reach Prometheus — dashboards are frozen

AlertManager Active Alerts:
  Watchdog       status=active (stale — will expire)

Blackbox probe for Prometheus:
  probe_success = 0 (FAIL — Prometheus unreachable)

Blackbox Exporter detects Prometheus is down. But Prometheus is the one that reads blackbox results. Nobody is consuming the data. The detection is useless.

Scenario 2: Kill AlertManager

Simulates AlertManager crashing or becoming unreachable.

./chaos/kill-alertmanager.sh

What happens:

Prometheus Firing Alerts:
  [critical] TargetDown              target=alertmanager:9093
  [critical] MonitoringComponentDown target=http://alertmanager:9093/-/healthy

Prometheus Notification Delivery:
  notifications_dropped_total = 10
  notifications_errors_total  = 9

Blackbox probe for AlertManager:
  probe_success = 0 (FAIL — AlertManager unreachable)

Prometheus knows AlertManager is down. It fires critical alerts. But it delivers alerts through AlertManager — the very thing that's broken. 10 dropped notifications. Nobody gets paged.

Restore Everything

./chaos/restore-all.sh

Key Concepts

Watchdog Alert (DeadMansSwitch)

An alert that always fires. If it stops, your pipeline is broken.

- alert: Watchdog
  expr: vector(1)
  labels:
    severity: none

Route it to an external heartbeat service (Healthchecks.io, PagerDuty, Deadman's Snitch). If the heartbeat stops arriving, the external service alerts you through an independent path.

Blackbox Health Probes

Probes the health endpoints of monitoring components themselves:

http://prometheus:9090/-/healthy
http://alertmanager:9093/-/healthy
http://grafana:3000/api/health

absent() Rules

From Article 2 — detects when metrics disappear. But as this lab demonstrates, absent() only works if Prometheus is alive to evaluate it.

File Structure

.
├── docker-compose.yml              # Full stack definition
├── app/
│   ├── Dockerfile                  # Sample payment service
│   ├── main.py                     # Python app with Prometheus metrics
│   └── requirements.txt
├── prometheus/
│   ├── prometheus.yml              # Scrape config + meta-monitoring
│   └── alerts/
│       ├── watchdog.yml            # Watchdog, TargetDown, meta-alerts
│       └── absent.yml              # absent() rules
├── alertmanager/
│   └── alertmanager.yml            # Routing with DeadMansSwitch receiver
├── blackbox/
│   └── blackbox.yml                # HTTP health probe config
├── grafana/
│   ├── datasources.yml             # Auto-provisioned Prometheus source
│   └── dashboards/
│       ├── dashboard.yml           # Provisioning config
│       └── meta-monitoring.json    # Pre-built dashboard
└── chaos/
    ├── kill-prometheus.sh          # Stop Prometheus
    ├── kill-alertmanager.sh        # Stop AlertManager
    └── restore-all.sh             # Restore all services

Cleanup

docker compose down -v

License

MIT

Author: Rafael Muller — Staff Cloud Engineer at Airbnb, working on platform infrastructure at 8M+ listings scale. Writing at Medium.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Observability Reliability Lab

The Observability Trilogy

Article 1 — Beyond Monitoring: The Hidden Cost of Observability at Scale

Article 2 — The Silent Killer: Why "No Data" Is Often Worse Than Bad Data

Article 3 — The Observability Reliability Paradox (this lab)

What This Lab Demonstrates

Architecture

Prerequisites

Quick Start

Access the UIs

Chaos Scenarios

Scenario 1: Kill Prometheus

Scenario 2: Kill AlertManager

Restore Everything

Key Concepts

Watchdog Alert (DeadMansSwitch)

Blackbox Health Probes

absent() Rules

File Structure

Cleanup

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
alertmanager		alertmanager
app		app
blackbox		blackbox
chaos		chaos
grafana		grafana
prometheus		prometheus
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Observability Reliability Lab

The Observability Trilogy

Article 1 — Beyond Monitoring: The Hidden Cost of Observability at Scale

Article 2 — The Silent Killer: Why "No Data" Is Often Worse Than Bad Data

Article 3 — The Observability Reliability Paradox (this lab)

What This Lab Demonstrates

Architecture

Prerequisites

Quick Start

Access the UIs

Chaos Scenarios

Scenario 1: Kill Prometheus

Scenario 2: Kill AlertManager

Restore Everything

Key Concepts

Watchdog Alert (DeadMansSwitch)

Blackbox Health Probes

absent() Rules

File Structure

Cleanup

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages