Releases · timyl/aiops-agent

Release list

v1.3.1 — ConfigMap Hot-Reload & Observability Fixes Latest

Latest

timyl released this 25 May 13:26

v1.3.1

a5c8e23

What's New

Post-v1.3.0 patch release. All S1–S5 regression scenarios pass.

ConfigMap Hot-Reload (Q1)

agent/graph.py — every node (fetch_metrics, fetch_nrf_logs, _analyze_with_llm, _analyze_with_rules, execute_tool) now calls _load_config() on each alert instead of caching at import time
k8s/aiops/worker.yaml — switched from subPath mount to directory mount (mountPath: /app/config); K8s now auto-syncs ConfigMap changes within ~60 s, no rollout restart required

Safety Gate Metrics Cleanup (Q2)

webhook/server.py — removed import agent.metrics; webhook /metrics endpoint no longer exposes worker metric series stuck at 0

Prometheus ServiceMonitor for Worker

k8s/aiops/monitoring.yaml — added aiops-worker-metrics headless Service (port 9100) and ServiceMonitor for Prometheus scraping; previously only defined in YAML but not committed

Grafana Dashboard & Cost Panel (Q3)

docs/grafana-dashboard-aiops.json — committed full Grafana 10.3.3 dashboard JSON (14 panels, 4 sections)
LLM Cost panel description updated: "Relative trend only — coefficients are rough estimates (~10× cheaper than actual ¥0.04/1K input, ¥0.12/1K output). Free-tier DashScope users: real cost is $0."

Load Test Race Condition Fix

scripts/load_test.py — wait_lock_released() now uses two-phase wait: Phase 1 waits for Redis lock to appear (up to 20 s), Phase 2 waits for lock to disappear; eliminates false-positive pass when Kafka consumer hasn't acquired the lock yet

Documentation

docs/manual-test-procedure.md — complete S1–S6 manual test procedure with full commands, no cross-references or omissions

Regression Test Results

Scenario	Result	Duration
S1 PLMN Mismatch	✅ PASS	~3.5 min
S2 No Action	✅ PASS	~21 s
S3 Dedup	✅ PASS	~1.6 min
S4 Field Typo	✅ PASS	~18 s
S5 Concurrent	✅ PASS	~1.8 min

Assets 2

v1.3.0 — Prometheus Observability

timyl released this 24 May 06:11

v1.3.0

85a6f00

What's New

Full Prometheus metrics instrumentation for both the webhook server and the Kafka consumer worker.

New Metrics

Metric	Type	Description
`aiops_alerts_processed_total`	Counter	Total alerts processed, labelled by `outcome` (`no_action`, `auto_fixed`, `escalated`)
`aiops_alert_duration_seconds`	Histogram	End-to-end alert processing time from `fetch_logs` to terminal node
`aiops_fix_verified_total`	Counter	PCF config-fix verification results, labelled by `result` (`success`, `failure`)
`aiops_safety_gate_rejected_total`	Counter	Times `execute_tool` refused to act due to PLMN or fixable-typo whitelist check
`aiops_llm_duration_seconds`	Histogram	qwen-max LLM call latency
`aiops_llm_tokens_total`	Counter	LLM token usage, labelled by `type` (`prompt`, `completion`)
`aiops_rag_duration_seconds`	Histogram	RAG knowledge-base query latency per alert
`aiops_rag_chunks_returned`	Histogram	Number of unique RAG chunks returned per alert

Metrics Endpoints

Webhook server — GET /metrics on port 8000 (FastAPI endpoint)
Worker process — GET /metrics on port 9100 (prometheus_client background HTTP server)

Changes

agent/metrics.py — new module; single source of truth for all metric objects
agent/state.py — added alert_start_time: Optional[float] to AgentState
agent/graph.py — instrumented fetch_logs, rag_lookup, _analyze_with_llm, execute_tool, decide, verify_fix, notify
webhook/server.py — added GET /metrics endpoint
agent/worker.py — added start_http_server(9100) at startup
k8s/aiops/worker.yaml — exposed containerPort: 9100 for Prometheus scraping
requirements.txt — added prometheus_client>=0.20.0

Assets 2

v1.1.2 — RAG improvement for unknown fields & ConfigMap externalization

timyl released this 22 May 15:53

v1.1.2

ac35203

What's Changed

Bug Fix: RAG now consulted for unknown fields

Previously, when a dropped field was not in fixable_typos, the rag_lookup node was skipped entirely. This meant escalation messages relied solely on LLM training knowledge (confidence: medium). RAG is now queried for unknown fields as well, providing explicit correct field names from the knowledge base — raising confidence to high and improving Slack alert quality for human operators.

Feature: agent_config.yaml externalized as K8s ConfigMap

agent_config.yaml (fixable_typos, PLMN whitelist, vendor_fields) is now mounted into the pod via a K8s ConfigMap rather than baked into the Docker image. Config-only changes (adding typos, updating PLMN whitelist) now require only:

kubectl apply -f k8s/aiops/agent-config.yaml && kubectl rollout restart deployment/aiops-worker -n aiops

No image rebuild needed.

Config: expand fixable_typos

Added NfSetIdLists (capital N variant of nfSetIdLists)
Added localities (plural typo of locality)

Knowledge Accumulation Workflow (validated)

This release validates the end-to-end fault discovery → RAG accumulation → auto-fix promotion workflow:

New unknown field detected → RAG queried → Slack escalation with correct field name suggested
Operator confirms → adds field to fixable_typos ConfigMap (no rebuild)
Next occurrence → auto-fixed automatically

Assets 2

v1.1.1 — Function Calling + Safety Gate

timyl released this 22 May 03:56

v1.1.1

a19c8b4

v1.1.1

v1.1 — Function Calling

LLM outputs structured tool_calls via bind_tools, replacing string-based fix_action
New execute_tool node for unified PCF operation dispatch
decide() router reads tool_call_name, with rules-mode fallback compatibility
New tool registry: tools/tool_registry.py (update_pcf_plmn / fix_profile_field / notify_only / no_action)

v1.1.1 — Bug Fixes & Safety Hardening

Unknown field propagation: unknown_fields written to AgentState; LLM receives (unknown — cannot auto-fix) label and correctly routes to notify instead of no_action
fixable_typos safety gate: execute_tool now enforces whitelist check, blocking LLM from auto-fixing unauthorized fields using training knowledge
Slack notification fix: NOTIFY_WEBHOOK_URL added to Secret; escalation alerts now delivered correctly

Assets 2

v1.0 — Production Hardening

timyl released this 21 May 06:00

v1.0

23d177a

What's New in v1.0

This release focuses on production hardening across four areas: safety, reliability, resilience, and configurability.

Safety

Defense-in-depth PLMN whitelist — auto_fix node performs a hard code-level check before writing to PCF, independent of LLM confidence. Rejects any PLMN not in the config whitelist, guarding against prompt injection and LLM hallucination.

Configurability

Config-driven domain knowledge — ALLOWED_PLMNS, _FIXABLE_TYPOS, and _VENDOR_FIELDS extracted from source code into config/agent_config.yaml. Operators can update allowed PLMNs or fixable typo lists without touching Python code or rebuilding the image. Supports AGENT_CONFIG_PATH env var for K8s ConfigMap volume mount.

Reliability

Kafka manual offset commit — enable_auto_commit=False; offset committed only after successful graph.invoke() + audit log write. Prevents silent message loss if the worker crashes mid-processing.
LangGraph version pin — langgraph<1.0.0 to avoid 1.x breaking API changes (KeyError: '__end__' in conditional edge routing). Added END: END to analyze path_map for correct termination.

Resilience

PCF REST API retry — All PCF API calls now use tenacity with 3 attempts, exponential backoff (1–8s), and 5s timeout. Retries on network errors and 5xx responses only; 4xx errors fail immediately.

Files Changed

File	Change
`config/agent_config.yaml`	New — operator config for PLMNs and field whitelists
`agent/graph.py`	Config loader, auto_fix safety gate, LangGraph END fix
`agent/worker.py`	Kafka manual offset commit
`tools/pcf_tool.py`	Tenacity retry + 5s timeout
`requirements.txt`	Pin langgraph<1.0.0, add pyyaml, tenacity
`Dockerfile`	Include config/ directory
`k8s/aiops/webhook.yaml`	Service type: LoadBalancer

Upgrade Notes

config/agent_config.yaml is required at runtime. Default path: <project_root>/config/agent_config.yaml. Override with AGENT_CONFIG_PATH env var.
For K8s: mount agent_config.yaml as a ConfigMap volume and set AGENT_CONFIG_PATH to the mount path for config updates without image rebuilds.

Next: v1.1 will introduce LLM function calling for dynamic tool selection, enabling extensibility to new NF types and REST API operations without graph restructuring.

Assets 2

Releases: timyl/aiops-agent

Release list

v1.3.1 — ConfigMap Hot-Reload & Observability Fixes

What's New

ConfigMap Hot-Reload (Q1)

Safety Gate Metrics Cleanup (Q2)

Prometheus ServiceMonitor for Worker

Grafana Dashboard & Cost Panel (Q3)

Load Test Race Condition Fix

Documentation

Regression Test Results

Uh oh!

v1.3.0 — Prometheus Observability

What's New

New Metrics

Metrics Endpoints

Changes

Uh oh!

v1.1.2 — RAG improvement for unknown fields & ConfigMap externalization

What's Changed

Bug Fix: RAG now consulted for unknown fields

Feature: agent_config.yaml externalized as K8s ConfigMap

Config: expand fixable_typos

Knowledge Accumulation Workflow (validated)

Uh oh!

v1.1.1 — Function Calling + Safety Gate

v1.1.1

v1.1 — Function Calling

v1.1.1 — Bug Fixes & Safety Hardening

Uh oh!

v1.0 — Production Hardening

What's New in v1.0

Safety

Configurability

Reliability

Resilience

Files Changed

Upgrade Notes

Uh oh!