# DevOps Observability & SRE

A practical guide to observability, Site Reliability Engineering principles, and incident management.

## Table of Contents
1. [Three Pillars of Observability](#three-pillars)
2. [SLIs, SLOs, and SLAs](#slis-slos-slas)
3. [RED and USE Methods](#red-use-methods)
4. [Distributed Tracing with OpenTelemetry](#distributed-tracing)
5. [Alerting Best Practices](#alerting)
6. [Incident Management & Post-Mortems](#incident-management)

In [None]:
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from dataclasses import dataclass, field
from typing import Optional, List
import json
import uuid

---
<a id='three-pillars'></a>
## 1. Three Pillars of Observability

Observability enables understanding system behavior through external outputs.

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    THREE PILLARS OF OBSERVABILITY                       │
├─────────────────────┬─────────────────────┬─────────────────────────────┤
│        LOGS         │       METRICS       │          TRACES             │
├─────────────────────┼─────────────────────┼─────────────────────────────┤
│ • Discrete events   │ • Numeric values    │ • Request journey           │
│ • High cardinality  │ • Aggregated        │ • Causality chain           │
│ • Debug details     │ • Time-series       │ • Latency breakdown         │
│ • Unstructured/     │ • Low cardinality   │ • Cross-service             │
│   Structured        │ • Dashboards        │   correlation               │
├─────────────────────┼─────────────────────┼─────────────────────────────┤
│ WHEN: Debugging,    │ WHEN: Alerting,     │ WHEN: Debugging distributed │
│ Forensics           │ Trends, SLOs        │ systems, Latency analysis   │
└─────────────────────┴─────────────────────┴─────────────────────────────┘
```

### Comparison Matrix

| Aspect | Logs | Metrics | Traces |
|--------|------|---------|--------|
| **Storage Cost** | High | Low | Medium |
| **Query Speed** | Slow | Fast | Medium |
| **Cardinality** | Unlimited | Limited | Medium |
| **Best For** | Root cause | Alerting | Request flow |

In [None]:
# Structured Logging Example
class StructuredLogger:
    """Structured JSON logger for observability."""
    
    def __init__(self, service: str, version: str):
        self.service = service
        self.version = version
    
    def log(self, level: str, message: str, **context):
        entry = {
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "level": level,
            "service": self.service,
            "version": self.version,
            "message": message,
            **context
        }
        print(json.dumps(entry, indent=2))

# Usage
logger = StructuredLogger("payment-service", "1.2.3")
logger.log(
    "INFO", 
    "Payment processed",
    trace_id="abc123",
    user_id="user_456",
    amount=99.99,
    duration_ms=45
)

---
<a id='slis-slos-slas'></a>
## 2. SLIs, SLOs, and SLAs

```
┌─────────────────────────────────────────────────────────────────┐
│                    SLI → SLO → SLA Pipeline                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐         │
│  │     SLI     │───▶│     SLO     │───▶│     SLA     │         │
│  │  (Measure)  │    │  (Target)   │    │ (Contract)  │         │
│  └─────────────┘    └─────────────┘    └─────────────┘         │
│                                                                 │
│  • What you       • Internal goal    • External promise        │
│    measure        • Error budget     • Legal/financial         │
│  • Quantitative   • Team-owned       • Customer-facing         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### Definitions

| Term | Definition | Example |
|------|------------|--------|
| **SLI** (Service Level Indicator) | Quantitative measure of service | Request latency p99 |
| **SLO** (Service Level Objective) | Target value for an SLI | p99 latency < 200ms |
| **SLA** (Service Level Agreement) | Contract with consequences | 99.9% uptime or credits |
| **Error Budget** | Allowed unreliability | 0.1% = 43.8 min/month |

In [None]:
@dataclass
class SLOCalculator:
    """Calculate SLO metrics and error budgets."""
    
    target_availability: float  # e.g., 0.999 for 99.9%
    window_days: int = 30
    
    @property
    def error_budget_percent(self) -> float:
        """Allowed unavailability percentage."""
        return (1 - self.target_availability) * 100
    
    @property
    def error_budget_minutes(self) -> float:
        """Allowed downtime in minutes."""
        return self.window_days * 24 * 60 * (1 - self.target_availability)
    
    def remaining_budget(self, downtime_minutes: float) -> dict:
        """Calculate remaining error budget."""
        total = self.error_budget_minutes
        remaining = total - downtime_minutes
        return {
            "total_budget_min": round(total, 2),
            "consumed_min": round(downtime_minutes, 2),
            "remaining_min": round(remaining, 2),
            "consumed_percent": round((downtime_minutes / total) * 100, 2),
            "budget_exhausted": remaining <= 0
        }
    
    def current_availability(self, good_events: int, total_events: int) -> float:
        """Calculate current availability from events."""
        return good_events / total_events if total_events > 0 else 1.0

# Example calculations
slo = SLOCalculator(target_availability=0.999, window_days=30)

print("=== 99.9% SLO (30-day window) ===")
print(f"Error Budget: {slo.error_budget_percent:.2f}%")
print(f"Allowed Downtime: {slo.error_budget_minutes:.1f} minutes ({slo.error_budget_minutes/60:.2f} hours)")
print()

# Scenario: 20 minutes of downtime so far
status = slo.remaining_budget(downtime_minutes=20)
print("=== Budget Status (20 min downtime) ===")
for k, v in status.items():
    print(f"{k}: {v}")

In [None]:
# Common SLO targets and their error budgets
slo_targets = [
    (0.99, "99%", "Two nines"),
    (0.999, "99.9%", "Three nines"),
    (0.9999, "99.99%", "Four nines"),
    (0.99999, "99.999%", "Five nines"),
]

data = []
for target, label, name in slo_targets:
    calc = SLOCalculator(target, window_days=30)
    data.append({
        "SLO": label,
        "Name": name,
        "Downtime/Month": f"{calc.error_budget_minutes:.2f} min",
        "Downtime/Year": f"{calc.error_budget_minutes * 12 / 60:.2f} hrs"
    })

pd.DataFrame(data)

In [None]:
# Visualize Error Budget Burn Rate
np.random.seed(42)
days = 30
dates = pd.date_range(start='2026-01-01', periods=days, freq='D')

# Simulate daily incidents
daily_downtime = np.random.exponential(scale=1.5, size=days)
daily_downtime[15] = 12  # Major incident on day 15
cumulative_downtime = np.cumsum(daily_downtime)

slo = SLOCalculator(target_availability=0.999, window_days=30)
budget_total = slo.error_budget_minutes
budget_remaining = budget_total - cumulative_downtime

fig = go.Figure()

# Budget burn line
fig.add_trace(go.Scatter(
    x=dates, y=budget_remaining,
    mode='lines+markers',
    name='Remaining Budget',
    line=dict(color='#2ecc71', width=2),
    fill='tozeroy',
    fillcolor='rgba(46, 204, 113, 0.3)'
))

# Danger zone
fig.add_hline(y=0, line_dash="dash", line_color="red", 
              annotation_text="Budget Exhausted")
fig.add_hline(y=budget_total * 0.2, line_dash="dash", line_color="orange",
              annotation_text="20% Warning")

fig.update_layout(
    title="Error Budget Burn Rate (99.9% SLO)",
    xaxis_title="Date",
    yaxis_title="Remaining Budget (minutes)",
    template="plotly_white",
    height=400
)
fig

---
<a id='red-use-methods'></a>
## 3. RED and USE Methods

Two complementary frameworks for monitoring different system aspects.

```
┌────────────────────────────────┬────────────────────────────────┐
│         RED Method             │         USE Method             │
│    (Request-driven services)   │      (Resource monitoring)     │
├────────────────────────────────┼────────────────────────────────┤
│                                │                                │
│  R - Rate (requests/sec)       │  U - Utilization (% busy)      │
│  E - Errors (failed reqs)      │  S - Saturation (queue depth)  │
│  D - Duration (latency)        │  E - Errors (error count)      │
│                                │                                │
├────────────────────────────────┼────────────────────────────────┤
│  Best for: APIs, microservices │  Best for: CPU, memory, disk   │
│  Focus: User experience        │  Focus: Resource health        │
└────────────────────────────────┴────────────────────────────────┘
```

### When to Use Each

| Component | Method | Key Metrics |
|-----------|--------|-------------|
| API Gateway | RED | RPS, error rate, p50/p99 latency |
| Database | USE | CPU util, connection pool, disk I/O |
| Message Queue | USE | Queue depth, consumer lag |
| Web Service | RED | Requests, 5xx rate, response time |

In [None]:
@dataclass
class REDMetrics:
    """RED method metrics for a service."""
    
    total_requests: int
    failed_requests: int
    latencies_ms: list  # List of request latencies
    window_seconds: int = 60
    
    @property
    def rate(self) -> float:
        """Requests per second."""
        return self.total_requests / self.window_seconds
    
    @property
    def error_rate(self) -> float:
        """Error percentage."""
        return (self.failed_requests / self.total_requests * 100) if self.total_requests > 0 else 0
    
    @property
    def duration_percentiles(self) -> dict:
        """Latency percentiles."""
        if not self.latencies_ms:
            return {}
        arr = np.array(self.latencies_ms)
        return {
            "p50": np.percentile(arr, 50),
            "p90": np.percentile(arr, 90),
            "p99": np.percentile(arr, 99)
        }

# Simulate service metrics
np.random.seed(42)
latencies = np.random.lognormal(mean=3.5, sigma=0.5, size=1000).tolist()

red = REDMetrics(
    total_requests=1000,
    failed_requests=15,
    latencies_ms=latencies,
    window_seconds=60
)

print("=== RED Metrics ===")
print(f"Rate: {red.rate:.1f} req/sec")
print(f"Error Rate: {red.error_rate:.2f}%")
print(f"Duration: p50={red.duration_percentiles['p50']:.0f}ms, "
      f"p90={red.duration_percentiles['p90']:.0f}ms, "
      f"p99={red.duration_percentiles['p99']:.0f}ms")

In [None]:
# Visualize RED metrics over time
hours = 24
timestamps = pd.date_range(start='2026-01-01', periods=hours, freq='H')

np.random.seed(42)
rate = 100 + 50 * np.sin(np.linspace(0, 2*np.pi, hours)) + np.random.normal(0, 10, hours)
error_rate = np.abs(np.random.normal(1, 0.5, hours))
error_rate[14:16] = [8, 12]  # Spike during incident
p99_latency = 150 + np.random.normal(0, 20, hours)
p99_latency[14:16] = [450, 520]  # Latency spike

fig = make_subplots(rows=3, cols=1, shared_xaxes=True,
                    subplot_titles=('Rate (req/s)', 'Error Rate (%)', 'P99 Latency (ms)'))

fig.add_trace(go.Scatter(x=timestamps, y=rate, mode='lines', 
                         line=dict(color='#3498db')), row=1, col=1)
fig.add_trace(go.Scatter(x=timestamps, y=error_rate, mode='lines',
                         line=dict(color='#e74c3c'), fill='tozeroy'), row=2, col=1)
fig.add_trace(go.Scatter(x=timestamps, y=p99_latency, mode='lines',
                         line=dict(color='#9b59b6')), row=3, col=1)

# SLO threshold lines
fig.add_hline(y=5, line_dash="dash", line_color="orange", row=2, col=1)
fig.add_hline(y=200, line_dash="dash", line_color="orange", row=3, col=1)

fig.update_layout(height=500, title="RED Dashboard", showlegend=False,
                  template="plotly_white")
fig

---
<a id='distributed-tracing'></a>
## 4. Distributed Tracing with OpenTelemetry

OpenTelemetry provides unified observability across services.

```
┌─────────────────────────────────────────────────────────────────────────┐
│                      DISTRIBUTED TRACE ANATOMY                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  TRACE (trace_id: abc123)                                               │
│  ├── SPAN: API Gateway (50ms)                                           │
│  │   └── SPAN: Auth Service (15ms)                                      │
│  │       └── SPAN: Token Validation (5ms)                               │
│  │                                                                      │
│  ├── SPAN: Order Service (120ms)                                        │
│  │   ├── SPAN: Inventory Check (30ms)                                   │
│  │   │   └── SPAN: DB Query (10ms)                                      │
│  │   └── SPAN: Payment Processing (80ms)                                │
│  │       └── SPAN: External API Call (70ms) ◄── Bottleneck              │
│  │                                                                      │
│  └── SPAN: Notification (20ms)                                          │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

### Key Concepts

| Concept | Description |
|---------|-------------|
| **Trace** | End-to-end request journey (unique trace_id) |
| **Span** | Single operation within a trace |
| **Context Propagation** | Passing trace context between services |
| **Baggage** | User-defined key-values passed through trace |

In [None]:
@dataclass
class Span:
    """Simplified span representation."""
    name: str
    duration_ms: float
    service: str
    span_id: str = field(default_factory=lambda: uuid.uuid4().hex[:16])
    parent_id: Optional[str] = None
    status: str = "OK"
    attributes: dict = field(default_factory=dict)

@dataclass  
class Trace:
    """Simplified trace with spans."""
    trace_id: str = field(default_factory=lambda: uuid.uuid4().hex[:32])
    spans: List[Span] = field(default_factory=list)
    
    def add_span(self, span: Span):
        self.spans.append(span)
        
    @property
    def total_duration(self) -> float:
        return max(s.duration_ms for s in self.spans) if self.spans else 0
    
    def find_bottleneck(self) -> Optional[Span]:
        return max(self.spans, key=lambda s: s.duration_ms) if self.spans else None

# Simulate a trace
trace = Trace()
root = Span("HTTP GET /orders", 200, "api-gateway")
trace.add_span(root)
trace.add_span(Span("authenticate", 25, "auth-service", parent_id=root.span_id))
trace.add_span(Span("get_order", 150, "order-service", parent_id=root.span_id))
trace.add_span(Span("db_query", 120, "postgres", parent_id=root.span_id, 
                    attributes={"db.statement": "SELECT * FROM orders"}))

print(f"Trace ID: {trace.trace_id}")
print(f"Total Duration: {trace.total_duration}ms")
print(f"Bottleneck: {trace.find_bottleneck().name} ({trace.find_bottleneck().duration_ms}ms)")
print("\nSpans:")
for span in trace.spans:
    print(f"  [{span.service}] {span.name}: {span.duration_ms}ms")

In [None]:
# Visualize trace as waterfall/Gantt chart
spans_data = [
    {"Service": "api-gateway", "Operation": "HTTP GET /orders", "Start": 0, "Duration": 200},
    {"Service": "auth-service", "Operation": "authenticate", "Start": 5, "Duration": 25},
    {"Service": "order-service", "Operation": "get_order", "Start": 35, "Duration": 150},
    {"Service": "postgres", "Operation": "db_query", "Start": 40, "Duration": 120},
    {"Service": "cache", "Operation": "cache_lookup", "Start": 38, "Duration": 5},
]

df = pd.DataFrame(spans_data)
df["End"] = df["Start"] + df["Duration"]

colors = {"api-gateway": "#3498db", "auth-service": "#2ecc71", 
          "order-service": "#e74c3c", "postgres": "#9b59b6", "cache": "#f39c12"}

fig = go.Figure()

for i, row in df.iterrows():
    fig.add_trace(go.Bar(
        x=[row["Duration"]],
        y=[row["Operation"]],
        orientation='h',
        base=row["Start"],
        name=row["Service"],
        marker_color=colors[row["Service"]],
        text=f"{row['Duration']}ms",
        textposition="inside",
        showlegend=i < 5
    ))

fig.update_layout(
    title="Trace Waterfall - Request /orders",
    xaxis_title="Time (ms)",
    barmode='overlay',
    height=300,
    template="plotly_white"
)
fig

---
<a id='alerting'></a>
## 5. Alerting Best Practices & On-Call

```
┌─────────────────────────────────────────────────────────────────────────┐
│                     ALERTING HIERARCHY                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌───────────┐                                                         │
│   │  PAGE     │ ◄── Immediate action required (SLO breach, outage)      │
│   │  (P1/P2)  │     → Wake someone up                                   │
│   └─────┬─────┘                                                         │
│         │                                                               │
│   ┌─────▼─────┐                                                         │
│   │  TICKET   │ ◄── Needs attention within hours/days                   │
│   │  (P3)     │     → Create work item                                  │
│   └─────┬─────┘                                                         │
│         │                                                               │
│   ┌─────▼─────┐                                                         │
│   │   LOG     │ ◄── Informational, trends to watch                      │
│   │  (P4/P5)  │     → Dashboard/logging only                            │
│   └───────────┘                                                         │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

### Alert Quality Checklist

| Principle | Bad Example | Good Example |
|-----------|-------------|--------------|
| **Actionable** | CPU > 80% | Error budget burn > 2x rate |
| **Symptom-based** | Disk usage high | Users seeing timeouts |
| **Has runbook** | Alert with no context | Link to remediation steps |
| **Appropriate urgency** | Page for warnings | Page only for user impact |

In [None]:
from enum import Enum

class Severity(Enum):
    CRITICAL = "P1"  # Page immediately
    HIGH = "P2"      # Page during business hours
    MEDIUM = "P3"    # Create ticket
    LOW = "P4"       # Log only

@dataclass
class AlertRule:
    """Structured alert definition."""
    name: str
    severity: Severity
    condition: str
    runbook_url: str
    for_duration: str = "5m"  # Must be true for this duration
    
    def to_prometheus(self) -> str:
        """Generate Prometheus alerting rule."""
        return f"""
- alert: {self.name}
  expr: {self.condition}
  for: {self.for_duration}
  labels:
    severity: {self.severity.value}
  annotations:
    runbook_url: {self.runbook_url}
"""

# Define SLO-based alerts
alerts = [
    AlertRule(
        name="HighErrorBudgetBurn",
        severity=Severity.CRITICAL,
        condition="error_budget_burn_rate > 14.4",  # 2% in 1h = exhausted in ~3 days
        runbook_url="https://runbooks.example.com/error-budget",
        for_duration="5m"
    ),
    AlertRule(
        name="HighLatencyP99",
        severity=Severity.HIGH,
        condition='histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5',
        runbook_url="https://runbooks.example.com/latency",
        for_duration="10m"
    )
]

for alert in alerts:
    print(alert.to_prometheus())

In [None]:
# On-call schedule visualization
on_call_data = [
    {"Engineer": "Alice", "Week": "Week 1", "Shift": "Primary"},
    {"Engineer": "Bob", "Week": "Week 1", "Shift": "Secondary"},
    {"Engineer": "Bob", "Week": "Week 2", "Shift": "Primary"},
    {"Engineer": "Carol", "Week": "Week 2", "Shift": "Secondary"},
    {"Engineer": "Carol", "Week": "Week 3", "Shift": "Primary"},
    {"Engineer": "Alice", "Week": "Week 3", "Shift": "Secondary"},
]

df_oncall = pd.DataFrame(on_call_data)

fig = px.bar(df_oncall, x="Week", y="Engineer", color="Shift",
             barmode="group", title="On-Call Rotation",
             color_discrete_map={"Primary": "#e74c3c", "Secondary": "#3498db"})
fig.update_layout(template="plotly_white", height=300)
fig

---
<a id='incident-management'></a>
## 6. Incident Management & Post-Mortems

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    INCIDENT LIFECYCLE                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐             │
│  │ DETECT   │──▶│ RESPOND  │──▶│ RESOLVE  │──▶│ LEARN    │             │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘             │
│       │              │              │              │                    │
│       ▼              ▼              ▼              ▼                    │
│  • Alerting     • Triage        • Fix/         • Post-mortem           │
│  • Monitoring   • Communicate     Rollback     • Action items          │
│  • User report  • Incident cmd  • Verify       • Share learnings       │
│                                                                         │
│  ─────────────────────────────────────────────────────────────────────  │
│  METRICS:  MTTD          MTTA         MTTR         Post-mortem SLA     │
│            (detect)     (ack)        (resolve)      (within 48h)       │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

### Incident Severity Levels

| Severity | Impact | Response | Example |
|----------|--------|----------|---------|
| **SEV1** | Total outage | All hands, exec comms | Site down |
| **SEV2** | Major feature down | On-call + backup | Payments failing |
| **SEV3** | Degraded service | On-call | Slow responses |
| **SEV4** | Minor issue | Next business day | UI bug |

In [None]:
@dataclass
class IncidentMetrics:
    """Calculate incident response metrics."""
    
    incidents: list  # List of (detect_time, ack_time, resolve_time)
    
    def calculate_metrics(self) -> dict:
        mtta = []  # Mean Time to Acknowledge 
        mttr = []  # Mean Time to Resolve
        
        for incident in self.incidents:
            detect, ack, resolve = incident
            mtta.append((ack - detect).total_seconds() / 60)
            mttr.append((resolve - detect).total_seconds() / 60)
        
        return {
            "MTTA (min)": round(np.mean(mtta), 1),
            "MTTR (min)": round(np.mean(mttr), 1),
            "Incident Count": len(self.incidents)
        }

# Sample incidents
incidents = [
    (datetime(2026, 1, 5, 10, 0), datetime(2026, 1, 5, 10, 5), datetime(2026, 1, 5, 10, 45)),
    (datetime(2026, 1, 12, 14, 0), datetime(2026, 1, 12, 14, 3), datetime(2026, 1, 12, 15, 20)),
    (datetime(2026, 1, 20, 2, 0), datetime(2026, 1, 20, 2, 15), datetime(2026, 1, 20, 3, 30)),
]

metrics = IncidentMetrics(incidents)
print("=== Incident Metrics (January 2026) ===")
for k, v in metrics.calculate_metrics().items():
    print(f"{k}: {v}")

In [None]:
# MTTR trend over months
months = ['Oct', 'Nov', 'Dec', 'Jan', 'Feb']
mttr_values = [95, 72, 58, 45, 38]  # Improving trend
incident_counts = [8, 6, 7, 4, 3]

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(
    go.Bar(x=months, y=incident_counts, name="Incidents", 
           marker_color='#3498db', opacity=0.7),
    secondary_y=False
)

fig.add_trace(
    go.Scatter(x=months, y=mttr_values, name="MTTR (min)", 
               mode='lines+markers', line=dict(color='#e74c3c', width=3)),
    secondary_y=True
)

fig.update_layout(
    title="Incident Trend & MTTR Improvement",
    template="plotly_white",
    height=350
)
fig.update_yaxes(title_text="Incident Count", secondary_y=False)
fig.update_yaxes(title_text="MTTR (minutes)", secondary_y=True)
fig

### Post-Mortem Template

```markdown
## Incident Post-Mortem: [Title]

**Date:** YYYY-MM-DD  
**Severity:** SEV2  
**Duration:** 45 minutes  
**Author:** [Name]  

### Summary
Brief 2-3 sentence description of what happened.

### Impact
- Users affected: ~5,000
- Revenue impact: $X
- Error budget consumed: 15%

### Timeline
| Time (UTC) | Event |
|------------|-------|
| 10:00 | Deployment started |
| 10:15 | Alert fired - Error rate > 5% |
| 10:18 | On-call acknowledged |
| 10:25 | Root cause identified |
| 10:30 | Rollback initiated |
| 10:45 | Service restored |

### Root Cause
Database connection pool exhausted due to missing timeout.

### Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Add connection timeout | @alice | 2026-02-10 | Done |
| Add pool size alert | @bob | 2026-02-12 | In Progress |
| Update runbook | @carol | 2026-02-15 | TODO |

### Lessons Learned
- What went well: Fast detection, clear communication
- What went wrong: Missing monitoring for connection pool
- Where we got lucky: Low traffic period
```

---
## Quick Reference

| Topic | Key Takeaway |
|-------|-------------|
| **Three Pillars** | Logs (debug), Metrics (alert), Traces (flow) |
| **SLO/Error Budget** | 99.9% = 43 min/month downtime allowed |
| **RED Method** | Rate, Errors, Duration - for services |
| **USE Method** | Utilization, Saturation, Errors - for resources |
| **Tracing** | trace_id links spans across services |
| **Alerting** | Page for symptoms, not causes; have runbooks |
| **Incidents** | Detect → Respond → Resolve → Learn |