# Telemetry & Display (Observability)

**Telemetry** is data about how a system behaves in production. **Display** is how humans consume it (dashboards, search, alerts) to operate and improve the system.


## Goals
- Know the main telemetry signals: **logs**, **metrics**, **traces** (and events).
- Understand the end-to-end flow: instrument -> collect -> store -> query -> visualize -> alert.
- See where **ELK (Elasticsearch + Logstash + Kibana)** and **Prometheus + Grafana** fit.
- Learn a practical incident workflow with examples.


## Prerequisites
- Basic Linux/process concepts (services, files, stdout).
- Basic networking (HTTP, ports).
- Basic distributed-systems vocabulary (latency, availability, scaling).


## The 3 main signals
- **Logs**: discrete records ("something happened") with context. Best for debugging and forensics.
- **Metrics**: numeric time series ("how much / how often") like request rate, error rate, CPU, p95 latency. Best for alerting and trend analysis.
- **Traces**: end-to-end request paths across services (spans) with timing. Best for finding where time is spent.

A good mental model:
- Metrics tell you **something is wrong**.
- Traces help you find **where**.
- Logs tell you **why**.


## Why collect telemetry?
- **Detect** problems early (alerts on SLOs and golden signals).
- **Diagnose** incidents faster (search logs, drill-down by service/region/version).
- **Validate** changes (did the new deploy increase error rate/latency?).
- **Capacity planning** (what will we need next month?).
- **Security/compliance** (audit trails, anomaly detection).

Common SRE-style "golden signals": **latency**, **traffic**, **errors**, **saturation**.


## Typical architecture (high level)
A practical pipeline looks like:

```
[Apps] -> [Agent/Collector] -> [Storage] -> [Query/Viz] -> [Alerts]
```

Examples of each stage:
- **Apps**: produce structured logs, expose metrics, emit traces.
- **Agent/Collector**: Fluent Bit/Filebeat/OpenTelemetry Collector.
- **Storage**: Elasticsearch (logs), Prometheus (metrics), tracing backend.
- **Query/Viz**: Kibana (logs), Grafana (dashboards).
- **Alerts**: Alertmanager / Grafana Alerting / Kibana alerting.


## Where the ELK stack fits (logs)
ELK is commonly used for **centralized log search + analysis**:

```
app -> stdout/file -> shipper -> Logstash -> Elasticsearch -> Kibana
```

- **Logstash**: parses/transforms/enriches events.
- **Elasticsearch**: indexes and stores events for fast search/aggregations.
- **Kibana**: UI for searching and visualizing log data.

Practical example: "show me all 5xx errors for checkout in eu-west-1 after the last deploy".


## Where Prometheus + Grafana fits (metrics)
Prometheus is commonly used for **metrics collection + alerting**, Grafana for **dashboards**:

```
app/exporter -> /metrics <- Prometheus -> (Alertmanager) -> paging
                       \-> Grafana dashboards
```

- **Prometheus**: scrapes metrics endpoints and stores labeled time series.
- **Grafana**: visualizes metrics (and can query logs too).

Practical example: alert when error rate > 2% for 10 minutes; then drill into latency by endpoint.


## How you connect signals (correlation)
The main trick is to make telemetry **joinable**:
- Put a stable **service name**, **env**, **region**, **version**, and **request_id/trace_id** on logs/metrics/traces.
- Prefer **structured logs** (JSON) over free-form strings.

Example: include a trace id in logs
```text
{
  "ts": "2026-01-07T20:13:11Z",
  "level": "error",
  "service": "checkout",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "msg": "payment provider timeout",
  "http": {"method": "POST", "path": "/pay", "status": 504}
}
```


## Pseudocode: instrument + emit telemetry
A minimal pattern in most services:

```text
function handle_request(req):
  trace = tracer.start_trace(req)
  with logger.context(trace_id=trace.id, request_id=req.id, service=SERVICE, version=VERSION):
    metrics.counter("http_requests_total", labels={method:req.method, route:req.route}).inc()
    timer = metrics.histogram("http_request_duration_seconds", labels={route:req.route}).start_timer()
    try:
      resp = do_work(req)
      logger.info("request_complete", status=resp.status)
      return resp
    except Exception as e:
      metrics.counter("http_errors_total", labels={route:req.route}).inc()
      logger.error("request_failed", error=str(e))
      raise
    finally:
      timer.observe_duration()
      trace.end()
```


## Example incident workflow (end-to-end)
1. **Prometheus alert**: "checkout error rate > 2% for 10m".
2. Open **Grafana**: confirm spike, identify top failing route and region.
3. Open **Kibana**: search logs for `service:checkout AND status:5*` filtered to that region/time.
4. Use `trace_id`/`request_id` from log events to correlate to traces (if present).
5. Roll back / fix / mitigate; then validate that metrics return to normal and logs stop erroring.


## Pitfalls (common)
- **Unstructured logs**: hard to search and aggregate.
- **High-cardinality labels** in metrics (like user_id): can blow up Prometheus.
- **No sampling strategy** for traces: too expensive at high traffic.
- **No retention/ILM**: storage costs grow without control.

## References
- Elastic Stack overview: https://www.elastic.co/what-is/elk-stack
- Prometheus docs: https://prometheus.io/docs/introduction/overview/
- Grafana docs: https://grafana.com/docs/grafana/latest/
- OpenTelemetry: https://opentelemetry.io/docs/
