What's New
Post-v1.3.0 patch release. All S1–S5 regression scenarios pass.
ConfigMap Hot-Reload (Q1)
agent/graph.py— every node (fetch_metrics,fetch_nrf_logs,_analyze_with_llm,_analyze_with_rules,execute_tool) now calls_load_config()on each alert instead of caching at import timek8s/aiops/worker.yaml— switched fromsubPathmount to directory mount (mountPath: /app/config); K8s now auto-syncs ConfigMap changes within ~60 s, norollout restartrequired
Safety Gate Metrics Cleanup (Q2)
webhook/server.py— removedimport agent.metrics; webhook/metricsendpoint no longer exposes worker metric series stuck at 0
Prometheus ServiceMonitor for Worker
k8s/aiops/monitoring.yaml— addedaiops-worker-metricsheadless Service (port 9100) andServiceMonitorfor Prometheus scraping; previously only defined in YAML but not committed
Grafana Dashboard & Cost Panel (Q3)
docs/grafana-dashboard-aiops.json— committed full Grafana 10.3.3 dashboard JSON (14 panels, 4 sections)- LLM Cost panel description updated: "Relative trend only — coefficients are rough estimates (~10× cheaper than actual ¥0.04/1K input, ¥0.12/1K output). Free-tier DashScope users: real cost is $0."
Load Test Race Condition Fix
scripts/load_test.py—wait_lock_released()now uses two-phase wait: Phase 1 waits for Redis lock to appear (up to 20 s), Phase 2 waits for lock to disappear; eliminates false-positive pass when Kafka consumer hasn't acquired the lock yet
Documentation
docs/manual-test-procedure.md— complete S1–S6 manual test procedure with full commands, no cross-references or omissions
Regression Test Results
| Scenario | Result | Duration |
|---|---|---|
| S1 PLMN Mismatch | ✅ PASS | ~3.5 min |
| S2 No Action | ✅ PASS | ~21 s |
| S3 Dedup | ✅ PASS | ~1.6 min |
| S4 Field Typo | ✅ PASS | ~18 s |
| S5 Concurrent | ✅ PASS | ~1.8 min |