Skip to content

v1.3.0 — Prometheus Observability

Choose a tag to compare

@timyl timyl released this 24 May 06:11

What's New

Full Prometheus metrics instrumentation for both the webhook server and the Kafka consumer worker.

New Metrics

Metric Type Description
aiops_alerts_processed_total Counter Total alerts processed, labelled by outcome (no_action, auto_fixed, escalated)
aiops_alert_duration_seconds Histogram End-to-end alert processing time from fetch_logs to terminal node
aiops_fix_verified_total Counter PCF config-fix verification results, labelled by result (success, failure)
aiops_safety_gate_rejected_total Counter Times execute_tool refused to act due to PLMN or fixable-typo whitelist check
aiops_llm_duration_seconds Histogram qwen-max LLM call latency
aiops_llm_tokens_total Counter LLM token usage, labelled by type (prompt, completion)
aiops_rag_duration_seconds Histogram RAG knowledge-base query latency per alert
aiops_rag_chunks_returned Histogram Number of unique RAG chunks returned per alert

Metrics Endpoints

  • Webhook serverGET /metrics on port 8000 (FastAPI endpoint)
  • Worker processGET /metrics on port 9100 (prometheus_client background HTTP server)

Changes

  • agent/metrics.py — new module; single source of truth for all metric objects
  • agent/state.py — added alert_start_time: Optional[float] to AgentState
  • agent/graph.py — instrumented fetch_logs, rag_lookup, _analyze_with_llm, execute_tool, decide, verify_fix, notify
  • webhook/server.py — added GET /metrics endpoint
  • agent/worker.py — added start_http_server(9100) at startup
  • k8s/aiops/worker.yaml — exposed containerPort: 9100 for Prometheus scraping
  • requirements.txt — added prometheus_client>=0.20.0