# Grafana

**Grafana** is an observability UI for building dashboards and alerts across many data sources (Prometheus, Elasticsearch, Loki, Tempo, etc.). It is commonly paired with Prometheus for metrics visualization.


## Goals
- Understand what Grafana is responsible for (dashboards, exploration, alerting).
- See how it connects to Prometheus/Elasticsearch.
- Learn common dashboard patterns with example queries.


## Why is it used?
- A central place to **visualize** metrics (and often logs/traces too).
- **Shareable dashboards** (teams, on-call, incident channels).
- **Alerting** based on queries with routing and notification policies.
- Powerful features: templated variables, annotations (deploy markers), transformations.


## How it is used (typical workflow)
1. Add a **data source** (Prometheus is the most common).
2. Build dashboard panels by writing queries (PromQL for Prometheus).
3. Add **variables** (service, env, region) so one dashboard works everywhere.
4. Add alerts for critical panels and route notifications.
5. Use annotations to correlate deploys/incidents with metrics changes.


## Example: a "service health" dashboard (Prometheus)
Common panels:
- **RPS**:
```promql
sum(rate(http_requests_total{service="$service"}[5m]))
```

- **Error rate**:
```promql
sum(rate(http_requests_total{service="$service",status=~"5.."}[5m]))
  /
sum(rate(http_requests_total{service="$service"}[5m]))
```

- **p95 latency**:
```promql
histogram_quantile(
  0.95,
  sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le)
)
```


## Alerting (conceptual)
Grafana alerting is usually:

```text
every N minutes:
  value = evaluate(query)
  if value crosses threshold for M minutes:
    send notification to oncall
```

Good practice: alert on **user impact** (error ratio, latency) rather than CPU.


## Provisioning (example snippets)
Grafana can be configured as code via provisioning files.

Example: datasource provisioning
```yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    access: proxy
    isDefault: true
```

You can also provision dashboards (JSON) and alert policies.


## Pitfalls
- Dashboards without variables often get duplicated per env/service.
- Alerts that page on everything cause alert fatigue.
- Make sure dashboards encode your SLOs, not just infrastructure stats.

## Exercises
- Create a dashboard with a `service` variable and 3 golden-signal panels.
- Add an annotation stream for deployments.

## References
- Grafana docs: https://grafana.com/docs/grafana/latest/
- Alerting docs: https://grafana.com/docs/grafana/latest/alerting/
