v1.3.0 — Prometheus Observability
What's New
Full Prometheus metrics instrumentation for both the webhook server and the Kafka consumer worker.
New Metrics
| Metric | Type | Description |
|---|---|---|
aiops_alerts_processed_total |
Counter | Total alerts processed, labelled by outcome (no_action, auto_fixed, escalated) |
aiops_alert_duration_seconds |
Histogram | End-to-end alert processing time from fetch_logs to terminal node |
aiops_fix_verified_total |
Counter | PCF config-fix verification results, labelled by result (success, failure) |
aiops_safety_gate_rejected_total |
Counter | Times execute_tool refused to act due to PLMN or fixable-typo whitelist check |
aiops_llm_duration_seconds |
Histogram | qwen-max LLM call latency |
aiops_llm_tokens_total |
Counter | LLM token usage, labelled by type (prompt, completion) |
aiops_rag_duration_seconds |
Histogram | RAG knowledge-base query latency per alert |
aiops_rag_chunks_returned |
Histogram | Number of unique RAG chunks returned per alert |
Metrics Endpoints
- Webhook server —
GET /metricson port 8000 (FastAPI endpoint) - Worker process —
GET /metricson port 9100 (prometheus_client background HTTP server)
Changes
agent/metrics.py— new module; single source of truth for all metric objectsagent/state.py— addedalert_start_time: Optional[float]toAgentStateagent/graph.py— instrumentedfetch_logs,rag_lookup,_analyze_with_llm,execute_tool,decide,verify_fix,notifywebhook/server.py— addedGET /metricsendpointagent/worker.py— addedstart_http_server(9100)at startupk8s/aiops/worker.yaml— exposedcontainerPort: 9100for Prometheus scrapingrequirements.txt— addedprometheus_client>=0.20.0