Skip to content

Releases: timyl/aiops-agent

v1.3.1 — ConfigMap Hot-Reload & Observability Fixes

Choose a tag to compare

@timyl timyl released this 25 May 13:26

What's New

Post-v1.3.0 patch release. All S1–S5 regression scenarios pass.

ConfigMap Hot-Reload (Q1)

  • agent/graph.py — every node (fetch_metrics, fetch_nrf_logs, _analyze_with_llm, _analyze_with_rules, execute_tool) now calls _load_config() on each alert instead of caching at import time
  • k8s/aiops/worker.yaml — switched from subPath mount to directory mount (mountPath: /app/config); K8s now auto-syncs ConfigMap changes within ~60 s, no rollout restart required

Safety Gate Metrics Cleanup (Q2)

  • webhook/server.py — removed import agent.metrics; webhook /metrics endpoint no longer exposes worker metric series stuck at 0

Prometheus ServiceMonitor for Worker

  • k8s/aiops/monitoring.yaml — added aiops-worker-metrics headless Service (port 9100) and ServiceMonitor for Prometheus scraping; previously only defined in YAML but not committed

Grafana Dashboard & Cost Panel (Q3)

  • docs/grafana-dashboard-aiops.json — committed full Grafana 10.3.3 dashboard JSON (14 panels, 4 sections)
  • LLM Cost panel description updated: "Relative trend only — coefficients are rough estimates (~10× cheaper than actual ¥0.04/1K input, ¥0.12/1K output). Free-tier DashScope users: real cost is $0."

Load Test Race Condition Fix

  • scripts/load_test.pywait_lock_released() now uses two-phase wait: Phase 1 waits for Redis lock to appear (up to 20 s), Phase 2 waits for lock to disappear; eliminates false-positive pass when Kafka consumer hasn't acquired the lock yet

Documentation

  • docs/manual-test-procedure.md — complete S1–S6 manual test procedure with full commands, no cross-references or omissions

Regression Test Results

Scenario Result Duration
S1 PLMN Mismatch ✅ PASS ~3.5 min
S2 No Action ✅ PASS ~21 s
S3 Dedup ✅ PASS ~1.6 min
S4 Field Typo ✅ PASS ~18 s
S5 Concurrent ✅ PASS ~1.8 min

v1.3.0 — Prometheus Observability

Choose a tag to compare

@timyl timyl released this 24 May 06:11

What's New

Full Prometheus metrics instrumentation for both the webhook server and the Kafka consumer worker.

New Metrics

Metric Type Description
aiops_alerts_processed_total Counter Total alerts processed, labelled by outcome (no_action, auto_fixed, escalated)
aiops_alert_duration_seconds Histogram End-to-end alert processing time from fetch_logs to terminal node
aiops_fix_verified_total Counter PCF config-fix verification results, labelled by result (success, failure)
aiops_safety_gate_rejected_total Counter Times execute_tool refused to act due to PLMN or fixable-typo whitelist check
aiops_llm_duration_seconds Histogram qwen-max LLM call latency
aiops_llm_tokens_total Counter LLM token usage, labelled by type (prompt, completion)
aiops_rag_duration_seconds Histogram RAG knowledge-base query latency per alert
aiops_rag_chunks_returned Histogram Number of unique RAG chunks returned per alert

Metrics Endpoints

  • Webhook serverGET /metrics on port 8000 (FastAPI endpoint)
  • Worker processGET /metrics on port 9100 (prometheus_client background HTTP server)

Changes

  • agent/metrics.py — new module; single source of truth for all metric objects
  • agent/state.py — added alert_start_time: Optional[float] to AgentState
  • agent/graph.py — instrumented fetch_logs, rag_lookup, _analyze_with_llm, execute_tool, decide, verify_fix, notify
  • webhook/server.py — added GET /metrics endpoint
  • agent/worker.py — added start_http_server(9100) at startup
  • k8s/aiops/worker.yaml — exposed containerPort: 9100 for Prometheus scraping
  • requirements.txt — added prometheus_client>=0.20.0

v1.1.2 — RAG improvement for unknown fields & ConfigMap externalization

Choose a tag to compare

@timyl timyl released this 22 May 15:53

What's Changed

Bug Fix: RAG now consulted for unknown fields

Previously, when a dropped field was not in fixable_typos, the rag_lookup node was skipped entirely. This meant escalation messages relied solely on LLM training knowledge (confidence: medium). RAG is now queried for unknown fields as well, providing explicit correct field names from the knowledge base — raising confidence to high and improving Slack alert quality for human operators.

Feature: agent_config.yaml externalized as K8s ConfigMap

agent_config.yaml (fixable_typos, PLMN whitelist, vendor_fields) is now mounted into the pod via a K8s ConfigMap rather than baked into the Docker image. Config-only changes (adding typos, updating PLMN whitelist) now require only:

kubectl apply -f k8s/aiops/agent-config.yaml && kubectl rollout restart deployment/aiops-worker -n aiops

No image rebuild needed.

Config: expand fixable_typos

  • Added NfSetIdLists (capital N variant of nfSetIdLists)
  • Added localities (plural typo of locality)

Knowledge Accumulation Workflow (validated)

This release validates the end-to-end fault discovery → RAG accumulation → auto-fix promotion workflow:

  1. New unknown field detected → RAG queried → Slack escalation with correct field name suggested
  2. Operator confirms → adds field to fixable_typos ConfigMap (no rebuild)
  3. Next occurrence → auto-fixed automatically

v1.1.1 — Function Calling + Safety Gate

Choose a tag to compare

@timyl timyl released this 22 May 03:56

v1.1.1

v1.1 — Function Calling

  • LLM outputs structured tool_calls via bind_tools, replacing string-based fix_action
  • New execute_tool node for unified PCF operation dispatch
  • decide() router reads tool_call_name, with rules-mode fallback compatibility
  • New tool registry: tools/tool_registry.py (update_pcf_plmn / fix_profile_field / notify_only / no_action)

v1.1.1 — Bug Fixes & Safety Hardening

  • Unknown field propagation: unknown_fields written to AgentState; LLM receives (unknown — cannot auto-fix) label and correctly routes to notify instead of no_action
  • fixable_typos safety gate: execute_tool now enforces whitelist check, blocking LLM from auto-fixing unauthorized fields using training knowledge
  • Slack notification fix: NOTIFY_WEBHOOK_URL added to Secret; escalation alerts now delivered correctly

v1.0 — Production Hardening

Choose a tag to compare

@timyl timyl released this 21 May 06:00

What's New in v1.0

This release focuses on production hardening across four areas: safety, reliability, resilience, and configurability.

Safety

  • Defense-in-depth PLMN whitelistauto_fix node performs a hard code-level check before writing to PCF, independent of LLM confidence. Rejects any PLMN not in the config whitelist, guarding against prompt injection and LLM hallucination.

Configurability

  • Config-driven domain knowledgeALLOWED_PLMNS, _FIXABLE_TYPOS, and _VENDOR_FIELDS extracted from source code into config/agent_config.yaml. Operators can update allowed PLMNs or fixable typo lists without touching Python code or rebuilding the image. Supports AGENT_CONFIG_PATH env var for K8s ConfigMap volume mount.

Reliability

  • Kafka manual offset commitenable_auto_commit=False; offset committed only after successful graph.invoke() + audit log write. Prevents silent message loss if the worker crashes mid-processing.
  • LangGraph version pinlanggraph<1.0.0 to avoid 1.x breaking API changes (KeyError: '__end__' in conditional edge routing). Added END: END to analyze path_map for correct termination.

Resilience

  • PCF REST API retry — All PCF API calls now use tenacity with 3 attempts, exponential backoff (1–8s), and 5s timeout. Retries on network errors and 5xx responses only; 4xx errors fail immediately.

Files Changed

File Change
config/agent_config.yaml New — operator config for PLMNs and field whitelists
agent/graph.py Config loader, auto_fix safety gate, LangGraph END fix
agent/worker.py Kafka manual offset commit
tools/pcf_tool.py Tenacity retry + 5s timeout
requirements.txt Pin langgraph<1.0.0, add pyyaml, tenacity
Dockerfile Include config/ directory
k8s/aiops/webhook.yaml Service type: LoadBalancer

Upgrade Notes

  • config/agent_config.yaml is required at runtime. Default path: <project_root>/config/agent_config.yaml. Override with AGENT_CONFIG_PATH env var.
  • For K8s: mount agent_config.yaml as a ConfigMap volume and set AGENT_CONFIG_PATH to the mount path for config updates without image rebuilds.

Next: v1.1 will introduce LLM function calling for dynamic tool selection, enabling extensibility to new NF types and REST API operations without graph restructuring.