Releases: timyl/aiops-agent
Release list
v1.3.1 — ConfigMap Hot-Reload & Observability Fixes
What's New
Post-v1.3.0 patch release. All S1–S5 regression scenarios pass.
ConfigMap Hot-Reload (Q1)
agent/graph.py— every node (fetch_metrics,fetch_nrf_logs,_analyze_with_llm,_analyze_with_rules,execute_tool) now calls_load_config()on each alert instead of caching at import timek8s/aiops/worker.yaml— switched fromsubPathmount to directory mount (mountPath: /app/config); K8s now auto-syncs ConfigMap changes within ~60 s, norollout restartrequired
Safety Gate Metrics Cleanup (Q2)
webhook/server.py— removedimport agent.metrics; webhook/metricsendpoint no longer exposes worker metric series stuck at 0
Prometheus ServiceMonitor for Worker
k8s/aiops/monitoring.yaml— addedaiops-worker-metricsheadless Service (port 9100) andServiceMonitorfor Prometheus scraping; previously only defined in YAML but not committed
Grafana Dashboard & Cost Panel (Q3)
docs/grafana-dashboard-aiops.json— committed full Grafana 10.3.3 dashboard JSON (14 panels, 4 sections)- LLM Cost panel description updated: "Relative trend only — coefficients are rough estimates (~10× cheaper than actual ¥0.04/1K input, ¥0.12/1K output). Free-tier DashScope users: real cost is $0."
Load Test Race Condition Fix
scripts/load_test.py—wait_lock_released()now uses two-phase wait: Phase 1 waits for Redis lock to appear (up to 20 s), Phase 2 waits for lock to disappear; eliminates false-positive pass when Kafka consumer hasn't acquired the lock yet
Documentation
docs/manual-test-procedure.md— complete S1–S6 manual test procedure with full commands, no cross-references or omissions
Regression Test Results
| Scenario | Result | Duration |
|---|---|---|
| S1 PLMN Mismatch | ✅ PASS | ~3.5 min |
| S2 No Action | ✅ PASS | ~21 s |
| S3 Dedup | ✅ PASS | ~1.6 min |
| S4 Field Typo | ✅ PASS | ~18 s |
| S5 Concurrent | ✅ PASS | ~1.8 min |
v1.3.0 — Prometheus Observability
What's New
Full Prometheus metrics instrumentation for both the webhook server and the Kafka consumer worker.
New Metrics
| Metric | Type | Description |
|---|---|---|
aiops_alerts_processed_total |
Counter | Total alerts processed, labelled by outcome (no_action, auto_fixed, escalated) |
aiops_alert_duration_seconds |
Histogram | End-to-end alert processing time from fetch_logs to terminal node |
aiops_fix_verified_total |
Counter | PCF config-fix verification results, labelled by result (success, failure) |
aiops_safety_gate_rejected_total |
Counter | Times execute_tool refused to act due to PLMN or fixable-typo whitelist check |
aiops_llm_duration_seconds |
Histogram | qwen-max LLM call latency |
aiops_llm_tokens_total |
Counter | LLM token usage, labelled by type (prompt, completion) |
aiops_rag_duration_seconds |
Histogram | RAG knowledge-base query latency per alert |
aiops_rag_chunks_returned |
Histogram | Number of unique RAG chunks returned per alert |
Metrics Endpoints
- Webhook server —
GET /metricson port 8000 (FastAPI endpoint) - Worker process —
GET /metricson port 9100 (prometheus_client background HTTP server)
Changes
agent/metrics.py— new module; single source of truth for all metric objectsagent/state.py— addedalert_start_time: Optional[float]toAgentStateagent/graph.py— instrumentedfetch_logs,rag_lookup,_analyze_with_llm,execute_tool,decide,verify_fix,notifywebhook/server.py— addedGET /metricsendpointagent/worker.py— addedstart_http_server(9100)at startupk8s/aiops/worker.yaml— exposedcontainerPort: 9100for Prometheus scrapingrequirements.txt— addedprometheus_client>=0.20.0
v1.1.2 — RAG improvement for unknown fields & ConfigMap externalization
What's Changed
Bug Fix: RAG now consulted for unknown fields
Previously, when a dropped field was not in fixable_typos, the rag_lookup node was skipped entirely. This meant escalation messages relied solely on LLM training knowledge (confidence: medium). RAG is now queried for unknown fields as well, providing explicit correct field names from the knowledge base — raising confidence to high and improving Slack alert quality for human operators.
Feature: agent_config.yaml externalized as K8s ConfigMap
agent_config.yaml (fixable_typos, PLMN whitelist, vendor_fields) is now mounted into the pod via a K8s ConfigMap rather than baked into the Docker image. Config-only changes (adding typos, updating PLMN whitelist) now require only:
kubectl apply -f k8s/aiops/agent-config.yaml && kubectl rollout restart deployment/aiops-worker -n aiops
No image rebuild needed.
Config: expand fixable_typos
- Added
NfSetIdLists(capital N variant ofnfSetIdLists) - Added
localities(plural typo oflocality)
Knowledge Accumulation Workflow (validated)
This release validates the end-to-end fault discovery → RAG accumulation → auto-fix promotion workflow:
- New unknown field detected → RAG queried → Slack escalation with correct field name suggested
- Operator confirms → adds field to fixable_typos ConfigMap (no rebuild)
- Next occurrence → auto-fixed automatically
v1.1.1 — Function Calling + Safety Gate
v1.1.1
v1.1 — Function Calling
- LLM outputs structured tool_calls via bind_tools, replacing string-based fix_action
- New execute_tool node for unified PCF operation dispatch
- decide() router reads tool_call_name, with rules-mode fallback compatibility
- New tool registry: tools/tool_registry.py (update_pcf_plmn / fix_profile_field / notify_only / no_action)
v1.1.1 — Bug Fixes & Safety Hardening
- Unknown field propagation: unknown_fields written to AgentState; LLM receives (unknown — cannot auto-fix) label and correctly routes to notify instead of no_action
- fixable_typos safety gate: execute_tool now enforces whitelist check, blocking LLM from auto-fixing unauthorized fields using training knowledge
- Slack notification fix: NOTIFY_WEBHOOK_URL added to Secret; escalation alerts now delivered correctly
v1.0 — Production Hardening
What's New in v1.0
This release focuses on production hardening across four areas: safety, reliability, resilience, and configurability.
Safety
- Defense-in-depth PLMN whitelist —
auto_fixnode performs a hard code-level check before writing to PCF, independent of LLM confidence. Rejects any PLMN not in the config whitelist, guarding against prompt injection and LLM hallucination.
Configurability
- Config-driven domain knowledge —
ALLOWED_PLMNS,_FIXABLE_TYPOS, and_VENDOR_FIELDSextracted from source code intoconfig/agent_config.yaml. Operators can update allowed PLMNs or fixable typo lists without touching Python code or rebuilding the image. SupportsAGENT_CONFIG_PATHenv var for K8s ConfigMap volume mount.
Reliability
- Kafka manual offset commit —
enable_auto_commit=False; offset committed only after successfulgraph.invoke()+ audit log write. Prevents silent message loss if the worker crashes mid-processing. - LangGraph version pin —
langgraph<1.0.0to avoid 1.x breaking API changes (KeyError: '__end__'in conditional edge routing). AddedEND: ENDtoanalyzepath_map for correct termination.
Resilience
- PCF REST API retry — All PCF API calls now use
tenacitywith 3 attempts, exponential backoff (1–8s), and 5s timeout. Retries on network errors and 5xx responses only; 4xx errors fail immediately.
Files Changed
| File | Change |
|---|---|
config/agent_config.yaml |
New — operator config for PLMNs and field whitelists |
agent/graph.py |
Config loader, auto_fix safety gate, LangGraph END fix |
agent/worker.py |
Kafka manual offset commit |
tools/pcf_tool.py |
Tenacity retry + 5s timeout |
requirements.txt |
Pin langgraph<1.0.0, add pyyaml, tenacity |
Dockerfile |
Include config/ directory |
k8s/aiops/webhook.yaml |
Service type: LoadBalancer |
Upgrade Notes
config/agent_config.yamlis required at runtime. Default path:<project_root>/config/agent_config.yaml. Override withAGENT_CONFIG_PATHenv var.- For K8s: mount
agent_config.yamlas a ConfigMap volume and setAGENT_CONFIG_PATHto the mount path for config updates without image rebuilds.
Next: v1.1 will introduce LLM function calling for dynamic tool selection, enabling extensibility to new NF types and REST API operations without graph restructuring.