# Observability Overview

Observability is the ability to understand the internal state of a system by examining its external outputs. The **three pillars of observability** are:

```
┌─────────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY PILLARS                        │
├─────────────────────┬─────────────────────┬─────────────────────┤
│       LOGS          │      METRICS        │      TRACES         │
├─────────────────────┼─────────────────────┼─────────────────────┤
│ • What happened     │ • How much/many     │ • Request journey   │
│ • Discrete events   │ • Aggregatable      │ • Causality         │
│ • High cardinality  │ • Low cardinality   │ • Latency breakdown │
│ • Debug & audit     │ • Alerting & SLOs   │ • Dependencies      │
└─────────────────────┴─────────────────────┴─────────────────────┘
```

**Key Concepts:**
- **Logs**: Immutable records of discrete events
- **Metrics**: Numeric measurements aggregated over time
- **Traces**: End-to-end request flow across services

## 1. Logging Best Practices

### Structured Logging

**Structured logging** outputs logs in a machine-parseable format (JSON) instead of plain text.

| Aspect | Unstructured | Structured |
|--------|-------------|------------|
| Format | Free-form text | JSON/key-value pairs |
| Parsing | Regex required | Direct field access |
| Querying | Slow, error-prone | Fast, reliable |
| Context | Often missing | Rich metadata |

### Log Levels
```
DEBUG   → Detailed diagnostic info (dev only)
INFO    → Normal operations, milestones
WARNING → Unexpected but recoverable events
ERROR   → Failures requiring attention
CRITICAL→ System-wide failures
```

### Best Practices
1. **Use correlation IDs** for request tracing
2. **Log at service boundaries** (entry/exit points)
3. **Include context** (user_id, request_id, action)
4. **Never log sensitive data** (passwords, tokens, PII)
5. **Use consistent field names** across services

In [None]:
import logging
import json
import uuid
from datetime import datetime
from contextvars import ContextVar

# Context variable for correlation ID (thread-safe)
correlation_id: ContextVar[str] = ContextVar('correlation_id', default='')

class StructuredFormatter(logging.Formatter):
    """JSON formatter for structured logging."""
    
    def format(self, record: logging.LogRecord) -> str:
        log_entry = {
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "level": record.levelname,
            "logger": record.name,
            "message": record.getMessage(),
            "correlation_id": correlation_id.get(),
            "module": record.module,
            "function": record.funcName,
            "line": record.lineno,
        }
        # Add extra fields if present
        if hasattr(record, 'extra_fields'):
            log_entry.update(record.extra_fields)
        return json.dumps(log_entry)

# Configure logger
logger = logging.getLogger("order_service")
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler()
handler.setFormatter(StructuredFormatter())
logger.addHandler(handler)

# Example usage
def process_order(order_id: str, user_id: str):
    correlation_id.set(str(uuid.uuid4())[:8])  # Set correlation ID
    
    logger.info("Processing order", extra={"extra_fields": {"order_id": order_id, "user_id": user_id}})
    # ... business logic ...
    logger.info("Order completed", extra={"extra_fields": {"order_id": order_id, "status": "success"}})

process_order("ORD-12345", "USR-001")

## 2. Metrics

Metrics are numeric measurements collected over time. They are **aggregatable** and ideal for alerting.

### Metric Types

```
┌─────────────────────────────────────────────────────────────────────────┐
│                           METRIC TYPES                                  │
├───────────────┬─────────────────────────────────────────────────────────┤
│   COUNTER     │ Monotonically increasing value (resets on restart)     │
│               │ Examples: requests_total, errors_total, bytes_sent     │
├───────────────┼─────────────────────────────────────────────────────────┤
│   GAUGE       │ Value that can go up or down                           │
│               │ Examples: temperature, queue_depth, active_connections │
├───────────────┼─────────────────────────────────────────────────────────┤
│   HISTOGRAM   │ Distribution of values in configurable buckets         │
│               │ Examples: request_duration, response_size              │
│               │ Provides: count, sum, bucket counts → percentiles      │
├───────────────┼─────────────────────────────────────────────────────────┤
│   SUMMARY     │ Similar to histogram but calculates quantiles          │
│               │ client-side (less flexible, rarely used now)           │
└───────────────┴─────────────────────────────────────────────────────────┘
```

### Golden Signals (Google SRE)

| Signal | Description | Metric Type |
|--------|-------------|-------------|
| **Latency** | Time to serve a request | Histogram |
| **Traffic** | Requests per second | Counter |
| **Errors** | Failed requests rate | Counter |
| **Saturation** | Resource utilization | Gauge |

### RED Method (Microservices)
- **R**ate: Requests per second
- **E**rrors: Failed requests per second  
- **D**uration: Request latency distribution

## 3. Distributed Tracing

Distributed tracing tracks requests as they flow through multiple services.

```
                         DISTRIBUTED TRACE EXAMPLE
                         
    Trace ID: abc123 (spans entire request lifecycle)
    ═══════════════════════════════════════════════════════════════════
    
    ┌─────────────────────────────────────────────────────────────────┐
    │ Span A: API Gateway          [trace:abc123, span:001]           │
    │ ████████████████████████████████████████████████████  (200ms)   │
    │    │                                                            │
    │    ├──► ┌──────────────────────────────────────────────────┐    │
    │    │    │ Span B: Order Service   [parent:001, span:002]   │    │
    │    │    │ █████████████████████████████████████  (150ms)   │    │
    │    │    │    │                                             │    │
    │    │    │    ├──► ┌────────────────────────────────┐       │    │
    │    │    │    │    │ Span C: DB Query  [span:003]   │       │    │
    │    │    │    │    │ ████████████  (50ms)           │       │    │
    │    │    │    │    └────────────────────────────────┘       │    │
    │    │    │    │                                             │    │
    │    │    │    └──► ┌────────────────────────────────┐       │    │
    │    │    │         │ Span D: Cache    [span:004]    │       │    │
    │    │    │         │ ████  (10ms)                   │       │    │
    │    │    │         └────────────────────────────────┘       │    │
    │    │    └──────────────────────────────────────────────────┘    │
    └─────────────────────────────────────────────────────────────────┘
```

### Key Concepts

| Concept | Description |
|---------|-------------|
| **Trace** | End-to-end journey of a request (collection of spans) |
| **Span** | Single unit of work with start time, duration, metadata |
| **Trace ID** | Unique identifier for the entire request flow |
| **Span ID** | Unique identifier for a single operation |
| **Parent Span ID** | Links child span to parent (creates hierarchy) |
| **Baggage** | Key-value pairs propagated across service boundaries |

### Context Propagation
Trace context is passed via HTTP headers:
```
traceparent: 00-{trace_id}-{span_id}-{flags}
tracestate: vendor=value (optional vendor-specific data)
```

## 4. Health Checks & Readiness Probes

Health checks enable orchestrators (Kubernetes) to manage service lifecycle.

```
┌──────────────────────────────────────────────────────────────────────┐
│                    KUBERNETES PROBE TYPES                            │
├──────────────────┬───────────────────────────────────────────────────┤
│  LIVENESS PROBE  │ "Is the container alive?"                         │
│                  │ • Failure → Container restart                     │
│                  │ • Check: Process running, not deadlocked          │
│                  │ • Endpoint: GET /health/live → 200 OK             │
├──────────────────┼───────────────────────────────────────────────────┤
│ READINESS PROBE  │ "Can the container serve traffic?"                │
│                  │ • Failure → Remove from load balancer             │
│                  │ • Check: Dependencies ready, warmed up            │
│                  │ • Endpoint: GET /health/ready → 200 OK            │
├──────────────────┼───────────────────────────────────────────────────┤
│  STARTUP PROBE   │ "Has the container started successfully?"         │
│                  │ • Disables liveness/readiness until success       │
│                  │ • For slow-starting containers                    │
│                  │ • Failure → Container restart                     │
└──────────────────┴───────────────────────────────────────────────────┘
```

### Health Check Best Practices

1. **Liveness**: Keep simple—don't check external dependencies
2. **Readiness**: Verify critical dependencies (DB, cache, queues)
3. **Response format**: Return JSON with component status
4. **Timeouts**: Set appropriate probe timeouts (default 1s may be too short)

```python
# Example response structure
{
    "status": "healthy",  # or "unhealthy", "degraded"
    "checks": {
        "database": {"status": "up", "latency_ms": 5},
        "redis": {"status": "up", "latency_ms": 2},
        "external_api": {"status": "degraded", "latency_ms": 500}
    }
}
```

## 5. OpenTelemetry (OTel)

OpenTelemetry is a **vendor-neutral** standard for collecting telemetry data (traces, metrics, logs).

```
┌─────────────────────────────────────────────────────────────────────────┐
│                     OPENTELEMETRY ARCHITECTURE                          │
└─────────────────────────────────────────────────────────────────────────┘
                                                                          
    ┌──────────────┐     ┌──────────────┐     ┌──────────────┐            
    │  Service A   │     │  Service B   │     │  Service C   │            
    │   (OTel SDK) │     │   (OTel SDK) │     │   (OTel SDK) │            
    └──────┬───────┘     └──────┬───────┘     └──────┬───────┘            
           │                    │                    │                    
           │    OTLP Protocol   │    OTLP Protocol   │                    
           ▼                    ▼                    ▼                    
    ┌─────────────────────────────────────────────────────────────┐       
    │                    OTEL COLLECTOR                           │       
    │  ┌──────────┐    ┌──────────────┐    ┌───────────────┐      │       
    │  │ Receivers│ →  │  Processors  │ →  │   Exporters   │      │       
    │  │(OTLP,etc)│    │(batch,filter)│    │(Jaeger,Prom,.)│      │       
    │  └──────────┘    └──────────────┘    └───────────────┘      │       
    └──────────────────────────┬──────────────────────────────────┘       
                               │                                          
           ┌───────────────────┼───────────────────┐                      
           ▼                   ▼                   ▼                      
    ┌────────────┐      ┌────────────┐      ┌────────────┐                
    │   Jaeger   │      │ Prometheus │      │   Loki     │                
    │  (Traces)  │      │ (Metrics)  │      │   (Logs)   │                
    └────────────┘      └────────────┘      └────────────┘                
```

### Core Components

| Component | Purpose |
|-----------|--------|
| **SDK** | Instrument code, create spans/metrics |
| **API** | Vendor-neutral interface for instrumentation |
| **Collector** | Receive, process, export telemetry |
| **OTLP** | OpenTelemetry Protocol (gRPC/HTTP) |

### Auto-Instrumentation
OTel provides automatic instrumentation for common libraries:
- HTTP clients/servers (requests, Flask, FastAPI)
- Database clients (psycopg2, pymongo, redis)
- Message queues (kafka, rabbitmq)

In [None]:
# OpenTelemetry basic setup (conceptual example)
# pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource

# Configure the tracer
resource = Resource.create({"service.name": "order-service", "service.version": "1.0.0"})
provider = TracerProvider(resource=resource)

# Export spans to console (use OTLP exporter for production)
processor = BatchSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Get a tracer
tracer = trace.get_tracer(__name__)

# Example: Creating spans
def process_payment(order_id: str, amount: float):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("payment.amount", amount)
        
        # Nested span for validation
        with tracer.start_as_current_span("validate_payment"):
            # ... validation logic ...
            pass
        
        # Nested span for charging
        with tracer.start_as_current_span("charge_card"):
            # ... payment processing ...
            span.add_event("payment_successful", {"transaction_id": "TXN-456"})

print("OpenTelemetry configured! Run process_payment() to see trace output.")

## Summary: Observability Stack

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    MODERN OBSERVABILITY STACK                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   COLLECTION          STORAGE             VISUALIZATION                 │
│   ──────────          ───────             ─────────────                 │
│                                                                         │
│   OpenTelemetry  ───► Prometheus    ───►  Grafana (Metrics)             │
│   (Unified SDK)       (Metrics)                                         │
│                                                                         │
│   Structured     ───► Elasticsearch ───►  Kibana (Logs)                 │
│   Logging             / Loki                                            │
│                                                                         │
│   OTel Tracing   ───► Jaeger/Tempo  ───►  Jaeger UI (Traces)            │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

### Key Takeaways

| Topic | Key Points |
|-------|------------|
| **Logging** | Use structured (JSON), include correlation IDs, never log secrets |
| **Metrics** | Counter (totals), Gauge (current), Histogram (distributions) |
| **Tracing** | Trace ID links spans across services, propagate via headers |
| **Health Checks** | Liveness (restart), Readiness (traffic), keep liveness simple |
| **OpenTelemetry** | Vendor-neutral standard, auto-instrumentation, use Collector |