v1.3.1 — ConfigMap Hot-Reload & Observability Fixes

Latest

Latest

timyl released this 25 May 13:26

a5c8e23

What's New

Post-v1.3.0 patch release. All S1–S5 regression scenarios pass.

ConfigMap Hot-Reload (Q1)

agent/graph.py — every node (fetch_metrics, fetch_nrf_logs, _analyze_with_llm, _analyze_with_rules, execute_tool) now calls _load_config() on each alert instead of caching at import time
k8s/aiops/worker.yaml — switched from subPath mount to directory mount (mountPath: /app/config); K8s now auto-syncs ConfigMap changes within ~60 s, no rollout restart required

Safety Gate Metrics Cleanup (Q2)

webhook/server.py — removed import agent.metrics; webhook /metrics endpoint no longer exposes worker metric series stuck at 0

Prometheus ServiceMonitor for Worker

k8s/aiops/monitoring.yaml — added aiops-worker-metrics headless Service (port 9100) and ServiceMonitor for Prometheus scraping; previously only defined in YAML but not committed

Grafana Dashboard & Cost Panel (Q3)

docs/grafana-dashboard-aiops.json — committed full Grafana 10.3.3 dashboard JSON (14 panels, 4 sections)
LLM Cost panel description updated: "Relative trend only — coefficients are rough estimates (~10× cheaper than actual ¥0.04/1K input, ¥0.12/1K output). Free-tier DashScope users: real cost is $0."

Load Test Race Condition Fix

scripts/load_test.py — wait_lock_released() now uses two-phase wait: Phase 1 waits for Redis lock to appear (up to 20 s), Phase 2 waits for lock to disappear; eliminates false-positive pass when Kafka consumer hasn't acquired the lock yet

Documentation

docs/manual-test-procedure.md — complete S1–S6 manual test procedure with full commands, no cross-references or omissions

Regression Test Results

Scenario	Result	Duration
S1 PLMN Mismatch	✅ PASS	~3.5 min
S2 No Action	✅ PASS	~21 s
S3 Dedup	✅ PASS	~1.6 min
S4 Field Typo	✅ PASS	~18 s
S5 Concurrent	✅ PASS	~1.8 min

Assets 2