Skip to content

v1.3.1 — ConfigMap Hot-Reload & Observability Fixes

Latest

Choose a tag to compare

@timyl timyl released this 25 May 13:26

What's New

Post-v1.3.0 patch release. All S1–S5 regression scenarios pass.

ConfigMap Hot-Reload (Q1)

  • agent/graph.py — every node (fetch_metrics, fetch_nrf_logs, _analyze_with_llm, _analyze_with_rules, execute_tool) now calls _load_config() on each alert instead of caching at import time
  • k8s/aiops/worker.yaml — switched from subPath mount to directory mount (mountPath: /app/config); K8s now auto-syncs ConfigMap changes within ~60 s, no rollout restart required

Safety Gate Metrics Cleanup (Q2)

  • webhook/server.py — removed import agent.metrics; webhook /metrics endpoint no longer exposes worker metric series stuck at 0

Prometheus ServiceMonitor for Worker

  • k8s/aiops/monitoring.yaml — added aiops-worker-metrics headless Service (port 9100) and ServiceMonitor for Prometheus scraping; previously only defined in YAML but not committed

Grafana Dashboard & Cost Panel (Q3)

  • docs/grafana-dashboard-aiops.json — committed full Grafana 10.3.3 dashboard JSON (14 panels, 4 sections)
  • LLM Cost panel description updated: "Relative trend only — coefficients are rough estimates (~10× cheaper than actual ¥0.04/1K input, ¥0.12/1K output). Free-tier DashScope users: real cost is $0."

Load Test Race Condition Fix

  • scripts/load_test.pywait_lock_released() now uses two-phase wait: Phase 1 waits for Redis lock to appear (up to 20 s), Phase 2 waits for lock to disappear; eliminates false-positive pass when Kafka consumer hasn't acquired the lock yet

Documentation

  • docs/manual-test-procedure.md — complete S1–S6 manual test procedure with full commands, no cross-references or omissions

Regression Test Results

Scenario Result Duration
S1 PLMN Mismatch ✅ PASS ~3.5 min
S2 No Action ✅ PASS ~21 s
S3 Dedup ✅ PASS ~1.6 min
S4 Field Typo ✅ PASS ~18 s
S5 Concurrent ✅ PASS ~1.8 min