Release v1.3.0 — Prometheus Observability · timyl/aiops-agent

What's New

Full Prometheus metrics instrumentation for both the webhook server and the Kafka consumer worker.

Metric	Type	Description
`aiops_alerts_processed_total`	Counter	Total alerts processed, labelled by `outcome` (`no_action`, `auto_fixed`, `escalated`)
`aiops_alert_duration_seconds`	Histogram	End-to-end alert processing time from `fetch_logs` to terminal node
`aiops_fix_verified_total`	Counter	PCF config-fix verification results, labelled by `result` (`success`, `failure`)
`aiops_safety_gate_rejected_total`	Counter	Times `execute_tool` refused to act due to PLMN or fixable-typo whitelist check
`aiops_llm_duration_seconds`	Histogram	qwen-max LLM call latency
`aiops_llm_tokens_total`	Counter	LLM token usage, labelled by `type` (`prompt`, `completion`)
`aiops_rag_duration_seconds`	Histogram	RAG knowledge-base query latency per alert
`aiops_rag_chunks_returned`	Histogram	Number of unique RAG chunks returned per alert

Webhook server — GET /metrics on port 8000 (FastAPI endpoint)
Worker process — GET /metrics on port 9100 (prometheus_client background HTTP server)

agent/metrics.py — new module; single source of truth for all metric objects
agent/state.py — added alert_start_time: Optional[float] to AgentState
agent/graph.py — instrumented fetch_logs, rag_lookup, _analyze_with_llm, execute_tool, decide, verify_fix, notify
webhook/server.py — added GET /metrics endpoint
agent/worker.py — added start_http_server(9100) at startup
k8s/aiops/worker.yaml — exposed containerPort: 9100 for Prometheus scraping
requirements.txt — added prometheus_client>=0.20.0