# Prometheus

**Prometheus** is an open-source monitoring system and time-series database. It collects **metrics** (numeric time series) by scraping HTTP endpoints and lets you query them with **PromQL** for dashboards and alerting.


## Goals
- Understand the Prometheus model: **metrics + labels + scraping**.
- Learn how apps expose `/metrics` and how Prometheus collects them.
- See PromQL examples and a realistic alert rule.


## What is it?
- A **time-series database (TSDB)** optimized for metrics.
- A **pull-based collector**: Prometheus scrapes targets on a schedule.
- A query language (**PromQL**) for slicing/aggregating time series.

A metric is identified by:
- a metric name (e.g., `http_requests_total`)
- a set of **labels** (e.g., `{service="checkout", method="POST", status="200"}`)


## Why is it used?
- **Alerting** on SLOs / golden signals (error rate, p95 latency, saturation).
- **Trend analysis** (capacity, traffic growth).
- **Fast aggregation** across dimensions (service, region, route) when labels are well-designed.

Prometheus is especially common in Kubernetes and cloud-native stacks.


## Architecture (common components)
- **Prometheus server**: scrapes targets and stores time series.
- **Exporters**: expose metrics for things that do not instrument themselves (node_exporter, postgres_exporter).
- **Service discovery**: finds targets dynamically (Kubernetes, Consul).
- **Recording rules**: precompute expensive queries for speed.
- **Alertmanager**: routes alerts to email/Slack/PagerDuty with grouping/silencing.


## How it is used
### 1) Instrument your app
Use a Prometheus client library to create counters/gauges/histograms and expose `/metrics`.

Pseudocode:
```text
requests = Counter("http_requests_total", labels=["service", "route", "status"])
latency  = Histogram("http_request_duration_seconds", labels=["service", "route"])

function handle_request(req):
  t = latency.labels(service=SERVICE, route=req.route).start_timer()
  try:
    resp = do_work(req)
    requests.labels(service=SERVICE, route=req.route, status=resp.status).inc()
    return resp
  finally:
    t.observe_duration()
```

Example of what `/metrics` can look like:
```text
# TYPE http_requests_total counter
http_requests_total{service="checkout",route="/pay",status="200"} 12034
http_requests_total{service="checkout",route="/pay",status="500"} 42
```


### 2) Configure scraping
A minimal `prometheus.yml`:

```yaml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'checkout'
    static_configs:
      - targets: ['checkout:8080']
```

In Kubernetes, targets are usually discovered automatically via labels/annotations.


### 3) Query with PromQL (examples)
- Requests per second (RPS):
```promql
sum(rate(http_requests_total{service="checkout"}[5m]))
```

- Error rate (ratio):
```promql
sum(rate(http_requests_total{service="checkout",status=~"5.."}[5m]))
  /
sum(rate(http_requests_total{service="checkout"}[5m]))
```

- p95 latency (histogram):
```promql
histogram_quantile(
  0.95,
  sum(rate(http_request_duration_seconds_bucket{service="checkout"}[5m])) by (le)
)
```


## Example alert rule (YAML)
Alert if 5xx error ratio > 2% for 10 minutes:

```yaml
groups:
- name: checkout.rules
  rules:
  - alert: High5xxRate
    expr: |
      (sum(rate(http_requests_total{service="checkout",status=~"5.."}[5m]))
        /
       sum(rate(http_requests_total{service="checkout"}[5m]))) > 0.02
    for: 10m
    labels:
      severity: page
    annotations:
      summary: "Checkout 5xx rate > 2%"
```


## Pitfalls
- **High-cardinality labels** (user_id, request_id) can overwhelm Prometheus.
- Metrics are for aggregates; use logs/traces for per-request debugging.
- Scraping assumes targets are reachable; use push patterns only when necessary (Pushgateway for batch jobs).

## Exercises
- Define 3 golden-signal panels for a service (RPS, error rate, p95 latency).
- Write one alert rule and route it through Alertmanager.

## References
- Prometheus overview: https://prometheus.io/docs/introduction/overview/
- PromQL basics: https://prometheus.io/docs/prometheus/latest/querying/basics/
