Skip to content

rtmuller/observability-reliability-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Observability Reliability Lab

A hands-on lab that demonstrates why your monitoring pipeline is a single point of failure — and how meta-monitoring fixes it.


The Observability Trilogy

This lab is the companion to a three-part article series on building intentional, reliable, and cost-effective observability for cloud-native systems at scale.

Each article tackles a different dimension of observability maturity — and each one builds on the lessons of the previous.

Observability costs don't explode because of traffic — they explode because of unchecked cardinality. A single unbounded label like request_id or collection_id can silently generate hundreds of thousands of active time series. This article covers how a single high-cardinality label caused an 81% cost increase in Grafana Cloud, and how relabeling rules and metric auditing bring costs back under control.

Most alerts assume data exists — they trigger when a metric crosses a threshold. But when a metric disappears entirely, the alert never fires. There's no alarm, no page, just silence. This article introduces Prometheus's absent() function as a way to detect when critical metrics stop reporting — turning silence into an actionable signal.

Article 3 — The Observability Reliability Paradox (this lab)

You've got cardinality under control. You're alerting on missing data. But what happens when Prometheus itself goes down? Your absent() rules can't evaluate. Your Watchdog heartbeat stops. Your dashboards freeze. And nobody gets notified — because the notification path runs through the very system that failed. This lab lets you experience that blind spot firsthand and demonstrates the meta-monitoring patterns that solve it.


What This Lab Demonstrates

Most teams build alerting around thresholds and missing data — but never test what happens when the monitoring stack itself fails.

This lab lets you:

  • Run a full Prometheus + AlertManager + Grafana stack locally
  • Kill Prometheus and watch dashboards freeze, alerts go silent, and absent() rules become useless
  • Kill AlertManager and see Prometheus detect the failure but fail to notify anyone
  • Understand why meta-monitoring (Watchdog alerts, external heartbeats, blackbox probes) is the missing layer

Architecture

┌──────────────────────────────────────────────────────────┐
│                     Docker Network                        │
│                                                           │
│  ┌──────────────┐     scrapes      ┌───────────────────┐ │
│  │  Prometheus   │◄────────────────│  Payment Service   │ │
│  │  :9090        │                 │  :8000             │ │
│  └──────┬───────┘                 └───────────────────┘ │
│         │                                                 │
│    alerts│         ┌───────────────────┐                  │
│         ▼         │ Blackbox Exporter  │                  │
│  ┌──────────────┐ │  :9115            │                  │
│  │ AlertManager  │ │  (probes health   │                  │
│  │  :9093        │ │   endpoints)      │                  │
│  └──────────────┘ └───────────────────┘                  │
│                                                           │
│  ┌──────────────┐                                         │
│  │   Grafana     │  ← Meta-Monitoring Dashboard           │
│  │  :3000        │                                        │
│  └──────────────┘                                         │
└──────────────────────────────────────────────────────────┘

Prerequisites

  • Docker and Docker Compose
  • Ports 3000, 8000, 9090, 9093, 9115 available

Quick Start

# Clone the repo
git clone https://github.com/rtmuller/observability-reliability-lab.git
cd observability-reliability-lab

# Start everything
docker compose up -d --build

# Verify all services are running
docker compose ps

Access the UIs

Service URL Credentials
Grafana http://localhost:3000 admin / admin
Prometheus http://localhost:9090
AlertManager http://localhost:9093
Blackbox http://localhost:9115
Payment App http://localhost:8000/metrics

Open Grafana and navigate to Dashboards → Meta-Monitoring — Observability Reliability to see the pre-built dashboard.


Chaos Scenarios

Scenario 1: Kill Prometheus

Simulates Prometheus being OOMKilled or evicted from a node.

./chaos/kill-prometheus.sh

What happens:

Prometheus Health Endpoint:
  CONNECTION REFUSED (Prometheus is down)

Payment Service:
  OK (app is fine, but nobody is watching it)

Grafana:
  Cannot reach Prometheus — dashboards are frozen

AlertManager Active Alerts:
  Watchdog       status=active (stale — will expire)

Blackbox probe for Prometheus:
  probe_success = 0 (FAIL — Prometheus unreachable)

Blackbox Exporter detects Prometheus is down. But Prometheus is the one that reads blackbox results. Nobody is consuming the data. The detection is useless.

Scenario 2: Kill AlertManager

Simulates AlertManager crashing or becoming unreachable.

./chaos/kill-alertmanager.sh

What happens:

Prometheus Firing Alerts:
  [critical] TargetDown              target=alertmanager:9093
  [critical] MonitoringComponentDown target=http://alertmanager:9093/-/healthy

Prometheus Notification Delivery:
  notifications_dropped_total = 10
  notifications_errors_total  = 9

Blackbox probe for AlertManager:
  probe_success = 0 (FAIL — AlertManager unreachable)

Prometheus knows AlertManager is down. It fires critical alerts. But it delivers alerts through AlertManager — the very thing that's broken. 10 dropped notifications. Nobody gets paged.

Restore Everything

./chaos/restore-all.sh

Key Concepts

Watchdog Alert (DeadMansSwitch)

An alert that always fires. If it stops, your pipeline is broken.

- alert: Watchdog
  expr: vector(1)
  labels:
    severity: none

Route it to an external heartbeat service (Healthchecks.io, PagerDuty, Deadman's Snitch). If the heartbeat stops arriving, the external service alerts you through an independent path.

Blackbox Health Probes

Probes the health endpoints of monitoring components themselves:

  • http://prometheus:9090/-/healthy
  • http://alertmanager:9093/-/healthy
  • http://grafana:3000/api/health

absent() Rules

From Article 2 — detects when metrics disappear. But as this lab demonstrates, absent() only works if Prometheus is alive to evaluate it.


File Structure

.
├── docker-compose.yml              # Full stack definition
├── app/
│   ├── Dockerfile                  # Sample payment service
│   ├── main.py                     # Python app with Prometheus metrics
│   └── requirements.txt
├── prometheus/
│   ├── prometheus.yml              # Scrape config + meta-monitoring
│   └── alerts/
│       ├── watchdog.yml            # Watchdog, TargetDown, meta-alerts
│       └── absent.yml              # absent() rules
├── alertmanager/
│   └── alertmanager.yml            # Routing with DeadMansSwitch receiver
├── blackbox/
│   └── blackbox.yml                # HTTP health probe config
├── grafana/
│   ├── datasources.yml             # Auto-provisioned Prometheus source
│   └── dashboards/
│       ├── dashboard.yml           # Provisioning config
│       └── meta-monitoring.json    # Pre-built dashboard
└── chaos/
    ├── kill-prometheus.sh          # Stop Prometheus
    ├── kill-alertmanager.sh        # Stop AlertManager
    └── restore-all.sh             # Restore all services

Cleanup

docker compose down -v

License

MIT


Author: Rafael Muller — Staff Cloud Engineer at Airbnb, working on platform infrastructure at 8M+ listings scale. Writing at Medium.

About

Hands-on lab demonstrating observability reliability patterns: meta-monitoring, chaos scenarios, Watchdog, absent() rules.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors