Skip to content

timyl/aiops-agent

Repository files navigation

AIOps Agent — 5G NF Intelligent Operations

CI

An autonomous AIOps agent that monitors 5G core network NF registration failures, localizes root causes, applies configuration fixes, and verifies recovery — end-to-end in minutes.

Stack: LangGraph · qwen-max (function calling) / OCI Generative AI · Alibaba Bailian RAG / OCI OpenSearch RAG · Kafka · Redis · Kubernetes · Prometheus · Elasticsearch


Background

Modern 5G core networks generate thousands of alerts per day across dozens of network functions. Traditional NOC operations rely on manual triage — engineers read logs, cross-reference metrics, and apply fixes by hand. This process is slow, error-prone, and does not scale as network complexity grows.

This agent was built to augment NOC operations for a 5G core network deployment. It addresses three pain points:

  • Reactive to proactive: correlates Prometheus metrics + Elasticsearch logs + NF config state to diagnose faults before they impact subscribers
  • Human-in-the-loop where it matters: auto-fixes high-confidence, bounded faults (config errors, field typos, PLMN mismatches); escalates ambiguous or out-of-scope faults to on-call engineers with a structured diagnosis
  • Predictive analysis foundation: the same pipeline — alert → evidence collection → LLM reasoning → action — can be extended to anomaly forecasting and capacity planning

Operations Efficiency

Workflow Comparison

Process Comparison

Performance Benchmark

Benchmark Chart

Alert triage breakdown (20-rule PCF ruleset):

Auto-fix (no human needed)   ████████░░░░░░░░░░░░  35%   (7 / 20 alert types)
Assisted diagnosis + notify  ████████████░░░░░░░░  50%   (10 / 20 alert types)
Escalate immediately          ████░░░░░░░░░░░░░░░░  15%   (3 / 20 alert types)

Architecture

AlertManager ──► Kafka (aiops-alerts topic)
                     │
                     ▼
              aiops-worker (Kafka Consumer)
              ├── Redis dedup lock (SET NX EX 300)
              └── Semaphore(3) ── LangGraph Agent
                                      │
                     ┌────────────────┼─────────────────────┐
                     ▼                ▼                     ▼
               fetch_logs      fetch_metrics        fetch_nrf_logs
                     └────────────────┼─────────────────────┘
                                      ▼
                                  rag_lookup  ──► Vector Knowledge Base
                                      ▼
                                   analyze   ──► LLM (qwen-max, function calling)
                                      │            bind_tools → tool_calls
                                      ▼
                                   decide()  [deterministic routing on tool_call_name]
                          ┌──────────┴──────────────────┐
                          ▼                             ▼
                    execute_tool                      notify ──► Slack
                    ├── PLMN whitelist gate
                    ├── fixable_typos whitelist gate
                    └── pcf_update_plmn / pcf_fix_field
                          │
                          ▼
                      verify_fix  (Prometheus rate / field key check)
                          │
                          ▼
                    incidents.jsonl  (audit log)

Webhook flow: AlertManager → aiops-webhook (FastAPI, Kafka producer) → returns 200 immediately → Kafka consumer handles async

Safety design: Two independent whitelists in execute_tool enforce boundaries regardless of LLM output — PLMN whitelist for update_pcf_plmn, fixable_typos whitelist for fix_profile_field. Unauthorized calls are rejected and escalated to Slack.


Observability

Real-time metrics from a load-test run (29 alerts, 5 concurrent scenarios).

Grafana Dashboard — AIOps Agent v1.3

Key results: 69% auto-fixed · 17% escalated to human · 100% fix verification rate
Latency: end-to-end p50 ~30 s · LLM call p50 ~5.4 s (qwen-max)

Metric Description
aiops_alerts_processed_total Alert outcomes by type: auto_fixed, escalated, no_action
aiops_alert_duration_seconds End-to-end processing latency per alert
aiops_fix_verified_total PCF config fix verification pass/fail rate
aiops_safety_gate_rejected_total Blocked LLM calls outside whitelist
aiops_llm_duration_seconds / aiops_llm_tokens_total LLM latency and token usage
aiops_rag_duration_seconds / aiops_rag_chunks_returned RAG knowledge base query performance

Metrics scraped from two endpoints: webhook /metrics on port 8000 and worker /metrics on port 9100, via Prometheus ServiceMonitor.


Fault Scenarios

The agent covers the full PCF alert ruleset. Actions fall into three tiers:

  • Auto-fix — agent applies the fix autonomously, verifies, and closes the incident
  • Assisted — agent diagnoses root cause, correlates evidence across sources, notifies on-call with a structured report
  • Escalate — agent detects the fault but scope is outside safe auto-fix boundary; pages immediately with context
# Scenario Severity Detection Agent Action MTTR
1 NF Registration — PLMN Mismatch Critical Prometheus retry rate > 3/2min Auto-fix: correct plmnList via PCF REST API, verify retry rate drops ~2 min
2 NF Registration — Silent Field Drop Critical NRF WARN logs in ES (field rejected) Auto-fix: rename typo field via PCF REST API, confirm NRF accepts ~30 sec
3 NF Registration — Unknown Field Critical NRF WARN logs in ES (unrecognized attr) Escalate: Slack alert with field name; outside auto-fix whitelist immediate
4 Policy Control Service Down Critical Service health metric = 0 Assisted: identify crashed pod, correlate with OOM/crash logs, recommend restart sequence ~3 min
5 Session Management High Error Rate Critical SM ingress error rate > 10% (24h) Assisted: cross-correlate with UDR/CHF errors to isolate root NF, structured Slack report ~5 min
6 Session Management Traffic Overload Major SM request rate > 90% max MPS Assisted: confirm burst pattern, recommend HPA scale-out, notify capacity team ~2 min
7 Diameter Connector High Error Rate Critical Diameter ingress error rate > 10% Assisted: check SCP peer health, correlate Diameter + SCP alerts, escalate with correlation summary ~4 min
8 Diameter Connector Traffic Saturation Major Diameter request rate > 90% max MPS Assisted: traffic surge detected, identify source NF, recommend load balancing review ~2 min
9 UDR Connectivity Timeout Spike Major UDR timeout rate > 10% of requests Assisted: query ES for timeout patterns and duration trends, suggest connection pool or retry tuning ~5 min
10 UDR Service High Error Rate Critical UDR egress error rate > 10% (24h) Assisted: correlate with DB tier health, determine whether UDR or underlying DB is root cause ~5 min
11 CHF Connectivity Timeout Spike Major CHF timeout rate > 10% of requests Assisted: analyze charging server response trends, notify billing ops with pattern data ~5 min
12 CHF Service High Error Rate Critical CHF egress error rate > 10% (24h) Assisted: spending limit control failure analysis, notify billing team with impact estimate ~4 min
13 Policy Datastore High Error Rate Critical PolicyDS ingress error rate > 10% Assisted: correlate with DB tier alert, determine if DB or policy engine is source ~5 min
14 Policy Database Tier Unreachable Critical DB health indicator = 0 Assisted: ES log analysis for crash reason (OOM/disk/network), trigger restart runbook via notify ~8 min
15 Pod CPU Congestion Critical Pod CPU congestion state = congested Auto-fix: trigger HPA scale-out, mark congested pod, notify if congestion persists post-scale ~2 min
16 Pod Memory Congestion Critical Pod memory congestion state = congested Assisted: identify memory leak signature in logs, recommend pod restart or JVM tuning ~3 min
17 Request Queue Congestion Critical Pending request queue state = congested Assisted: identify bottleneck service, correlate with CPU/memory alerts, recommend scale path ~4 min
18 Egress Peer Unreachable Major SCP peer health status ≠ 0 Assisted: diagnose peer connectivity, identify if peer or network path is down, suggest failover ~3 min
19 All Egress Peers in Peer-Set Down Critical Peer available count = 0 across peer-set Escalate: total egress path failure, page on-call immediately with peer-set name and last-seen time immediate
20 SMSC Connection Loss Major Active SMSC connection count = 0 for 10 min Assisted: check SMSC logs for disconnect reason, attempt reconnect trigger, notify messaging ops ~5 min

Tech Stack Decisions

Component Choice Why
Agent framework LangGraph Explicit node graph + conditional edges; LLM controls diagnosis only, not routing
LLM OpenAI GPT / OCI Generative AI (Llama 3) OpenAI-compatible API; pluggable — swap endpoint via env var
RAG OCI OpenSearch vector index / managed KB Provides 3GPP field name context for silent-drop detection; swappable backend
Message queue Kafka KRaft Decouples AlertManager from agent; survives burst alerts; enables multi-consumer patterns
Dedup lock Redis SET NX EX 300 Distributed across potential multi-pod deployments; prevents duplicate runs on same alert
Observability Elasticsearch + Prometheus PCF/NRF logs go to ES, metrics to Prometheus
Config fix PCF REST API (3GPP SBA) Direct NF control plane — no manual SSH, auditable, reversible

Prompt Engineering

Four-layer separation of concerns:

SYSTEM_PROMPT   — role definition + tool catalog (4 tools, stable)
ANALYSIS_PROMPT — parameterized evidence template ({allowed_plmns}, {field_errors}, etc.)
RAG chunks      — 3GPP field knowledge (retrieved at runtime, not hardcoded)
decide()        — deterministic routing on tool_call_name in Python (not LLM)

Function calling flow: LLM receives tool schemas via llm.bind_tools(TOOLS) and responds with a structured tool_calls object — no string parsing. execute_tool dispatches on tool_call_name and enforces whitelists independently of LLM reasoning.

LLM role  →  diagnosis + intent expression (tool_calls)
Code role →  routing (decide) + enforcement (execute_tool whitelists) + execution (PCF API)

Repository Structure

aiops-agent/
├── agent/
│   ├── graph.py          # LangGraph nodes + edges + decide() routing
│   ├── prompts.py        # SYSTEM_PROMPT + ANALYSIS_PROMPT
│   ├── state.py          # TypedDict state definition
│   ├── worker.py         # Kafka consumer + Redis dedup + Semaphore
│   └── log_fmt.py        # Structured log formatter
├── tools/
│   ├── es_tool.py         # Elasticsearch log query
│   ├── prometheus_tool.py # Prometheus metrics query
│   ├── pcf_tool.py        # PCF config read/write (3GPP SBA REST)
│   ├── rag_tool.py        # RAG retrieval (OCI OpenSearch / Alibaba Bailian)
│   └── tool_registry.py   # LangChain @tool definitions + TOOLS (bind) + TOOL_MAP (dispatch)
├── webhook/
│   └── server.py         # FastAPI: AlertManager → Kafka producer
├── k8s/
│   ├── aiops/            # Agent microservice manifests
│   │   ├── configmap.yaml
│   │   ├── webhook.yaml  # Deployment + ClusterIP Service
│   │   └── worker.yaml   # Deployment
│   ├── kafka/            # Kafka KRaft StatefulSet
│   ├── redis/            # Redis Deployment
│   └── alertmanager/     # AlertmanagerConfig CR + PrometheusRule
├── Dockerfile
└── requirements.txt

Quick Start

Prerequisites

  • Kubernetes cluster with aiops namespace
  • Kafka + Redis deployed in aiops namespace (see k8s/kafka/ and k8s/redis/)
  • Prometheus + Elasticsearch + 5G NF accessible from cluster nodes

1. Configure secrets

# Copy and fill in your credentials (do NOT commit this file)
cp k8s/aiops/secret.yaml.example k8s/aiops/secret.yaml
kubectl apply -f k8s/aiops/secret.yaml

2. Deploy to K8s

kubectl apply -f k8s/aiops/configmap.yaml
kubectl apply -f k8s/aiops/webhook.yaml
kubectl apply -f k8s/aiops/worker.yaml

# AlertManager routing
kubectl apply -f k8s/alertmanager/aiops-webhook.yaml

3. Verify

kubectl get pods -n aiops
kubectl logs -n aiops deploy/aiops-webhook
kubectl logs -n aiops deploy/aiops-worker -f

4. Trigger a fault (demo)

curl -X POST http://<alertmanager-host>/api/v2/alerts \
  -H 'Content-Type: application/json' \
  -d '[{"labels":{"alertname":"NfRegistrationFailure","namespace":"<nf-namespace>","NfType":"PCF"},"status":"firing"}]'

Cloud Migration Path

Self-hosted OCI AWS
Kubernetes (Kubespray) OKE EKS
Elasticsearch OCI OpenSearch Amazon OpenSearch
RAG knowledge base OCI Generative AI + OpenSearch vector Bedrock Knowledge Bases
LLM (GPT / Llama) OCI Generative AI Service Amazon Bedrock
incidents.jsonl OCI NoSQL Database DynamoDB
FastAPI webhook OCI API Gateway + Functions API Gateway + Lambda
Slack notify OCI Notifications SNS

Key Engineering Trade-offs

Why Kafka instead of direct webhook → agent?
Decouples ingestion from processing. AlertManager gets instant 200 response. Worker controls concurrency (Semaphore(3)) and deduplication (Redis). On restart, unprocessed offsets are re-consumed.

Why Redis for dedup instead of in-memory set?
In-memory state is lost on pod restart. Redis survives worker restarts and works across multiple worker replicas.

Why deterministic decide() instead of LLM routing?
LLM handles ambiguous log interpretation. Routing logic (PLMN whitelist check, confidence threshold, fix boundary) is code — testable, auditable, zero hallucination risk.

Why separate webhook + worker pods?
Different resource profiles: webhook is lightweight (Kafka producer only), worker is CPU/memory intensive (LLM calls, concurrent threads). Scale independently.


Deployment — aiops Namespace

dscl1@bastion:~$ kubectl get pods -n aiops 
NAME                            READY   STATUS    RESTARTS   AGE
aiops-webhook-869978964-klp7l   1/1     Running   0          9h
aiops-worker-7cd48b7c78-c72n6   1/1     Running   0          6s
aiops-worker-7cd48b7c78-mb2v9   1/1     Running   0          9h
aiops-worker-7cd48b7c78-mbbwk   1/1     Running   0          6s
aiops-worker-7cd48b7c78-rhvl9   1/1     Running   0          6s
aiops-worker-7cd48b7c78-z6prj   1/1     Running   0          6s
kafka-0                         1/1     Running   0          16h
redis-845d787d54-fdgtm          1/1     Running   0          17h

$ kubectl get svc -n aiops
NAME             TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)           AGE
aiops-webhook    ClusterIP   10.233.61.70    <none>        8000/TCP          97m
kafka            NodePort    10.233.43.141   <none>        9092:30092/TCP    8h
kafka-headless   ClusterIP   None            <none>        9092/TCP,9093/TCP 8h
redis            NodePort    10.233.44.168   <none>        6379:30379/TCP    9h

$ kubectl get deployment -n aiops
NAME            READY   UP-TO-DATE   AVAILABLE   AGE
aiops-webhook   1/1     1            1           97m
aiops-worker    1/1     1            1           97m
redis           1/1     1            1           9h

Sample Agent Runs

Scenario 1 — PLMN Mismatch (auto-fix in ~2 min)

Alert fires → agent fetches evidence → LLM calls update_pcf_plmn via function calling → PLMN whitelist gate passes → PCF REST API fix → Prometheus rate verified.

[ALERT]  NfRegistrationFailure | ns=occnp2 | NfType=PCF | status=firing
[AGENT]  Starting LangGraph agent for occnp2...

[GRAPH]  → fetch_logs      PCF ERROR/WARN logs retrieved from Elasticsearch
[GRAPH]  → fetch_metrics   nrf_retry_rate=10.7/2min  (threshold > 3)
[GRAPH]  → fetch_nrf_logs  no dropped fields detected
[GRAPH]  → rag_lookup      skipped — no dropped fields detected
[GRAPH]  → analyze         (mode=llm)

[LLM]    > POST /compatible-mode/v1/chat/completions  model=qwen-max
[LLM]    ┌── Evidence sent to LLM ─────────────────────────────────────
[LLM]    │  NRF retry rate: 10.7  [active failure loop > 10]
[LLM]    │  PCF plmnList: [{'mcc': '510', 'mnc': '088'}]
[LLM]    │  Allowed PLMNs: 510/011, 208/93, 001/01
[LLM]    │  Silent-drop field errors: none
[LLM]    │  PCF logs: WARN nrf-client: PLMN 510/088 not in allowed list
[LLM]    └────────────────────────────────────────────────────────────
[LLM]    < 200 OK  duration=5.1s
[LLM]    ── Observations ────────────────────────────────────
[LLM]    [1] NRF retry rate 10.7 confirms active registration failure loop.
[LLM]    [2] PCF plmnList 510/088 is not in the allowed PLMN list.
[LLM]    [3] No silent-drop field errors detected.
[LLM]    ── Diagnosis ───────────────────────────────────────
[LLM]    root_cause → PCF plmnList contains PLMN 510/088 not accepted by NRF.
[LLM]    confidence → high
[LLM]    tool_call  → update_pcf_plmn({'mcc': '510', 'mnc': '011'})
[LLM]    ────────────────────────────────────────────────────

[GRAPH]  → decide  route=execute_tool  tool=update_pcf_plmn  confidence=high
[GRAPH]  → execute_tool  tool=update_pcf_plmn  args={'mcc': '510', 'mnc': '011'}
[SAFETY] PLMN 510/011 ✓ confirmed in whitelist
[EXECUTE] > PUT /nrf-client-nfmanagement/nfProfileList
[EXECUTE] > plmnList=[{'mcc':'510','mnc':'011'}]  (was [{'mcc':'510','mnc':'088'}])
[PCF]    < 200 OK  duration=0.048s

[GRAPH]  → verify_fix  attempt 1/3  (sleeping 30s)
[VERIFY] nrf_retry_rate=6.3 >= 3  still recovering — retry
[GRAPH]  → verify_fix  attempt 2/3  (sleeping 60s)
[VERIFY] nrf_retry_rate=0.0 < 3  ✓ registration restored

[AGENT]  ===== Run Complete =====
[AGENT]  Root cause  : PCF plmnList contains PLMN 510/088 not accepted by NRF.
[AGENT]  Fix applied : True  |  Fix verified: True
[AUDIT]  Incident saved → /data/incidents.jsonl

Scenario 2 — Silent-Drop Field Typo (auto-fix in ~5 sec)

NRF silently drops a misspelled field; PCF thinks registration succeeded (HTTP 200) but NF profile is incomplete. Agent detects via NRF WARN logs + RAG lookup, fixes via PCF REST API, verifies by checking profile keys.

 [INFO] [WORKER] Consumed offset=23 partition=0
 [INFO] [ALERT] NfRegistrationFailure | ns=occnp2 | NfType=PCF | status=firing
 [INFO] [AGENT] Starting LangGraph agent for occnp2...
 [INFO] [GRAPH] → fetch_logs
 [INFO] [TOOL/ES]  > GET http://172.16.100.91:30200/k8s-2026.05.27/_search
 [INFO] [TOOL/ES]  > filter: namespace=occnp2  level IN [ERROR,WARN]  @timestamp >= now-5m  (since 05:44:48Z)
 [INFO] [TOOL/ES]  < 200 OK  duration=0.782s  returned=46 lines  (ERROR=0, WARN=46)
 [INFO] [GRAPH] → fetch_metrics
 [INFO] [TOOL/PM]  > GET http://172.16.100.91:30504/api/v1/query
 [INFO] [TOOL/PM]  > query: increase(occnp_nrfclient_nw_conn_out_request_total{MessageType="AutonomousNfRegistration",namespace="occnp2"}[2m])
 [INFO] [TOOL/PM]  < 200 OK  duration=0.010s
 [INFO] [TOOL/PM]    nrf_retry_rate=1.3/2min  (threshold=3, ✓ normal)
 [INFO] [TOOL/PM]    pcf_local_status=PCF_LOCAL_REGISTERED
 [INFO] [TOOL/PCF] > GET http://172.16.100.231:8000/PCF/nf-common-component/v1/nrf-client-nfmanagement/nfProfileList
 [INFO] [TOOL/PCF] < 200 OK  duration=0.005s
 [INFO] [TOOL/PCF]   nfStatus=REGISTERED  plmnList=[{'mcc': '510', 'mnc': '011'}]
 [INFO] [TOOL/PCF]   plmn 510/011 → ✓ in NRF allowed list
 [INFO] [GRAPH] → fetch_nrf_logs
 [INFO] [TOOL/ES]  > GET http://172.16.100.91:30200/k8s-2026.05.27/_search
 [INFO] [TOOL/ES]  > filter: namespace=ocnrf1  level IN [ERROR,WARN]  @timestamp >= now-5m  (since 05:44:49Z)
 [INFO] [TOOL/ES]  < 200 OK  duration=0.006s  returned=1 lines  (WARN=1)
 [INFO] [TOOL/ES]  ⚠ PCF field typo detected (auto-fixable): 'nfSetIdLists'
 [INFO] [TOOL/ES]    [2026-05-27T05:49:25] WARN ocnrf-nfregistration-b5bb7fcdd-gbmpg [requesterNfType=PCF]: Allow only VendorSpecific attributes. Value of enableF5 is true and acceptAdditionalAttributes is false. The foll
 [INFO] [GRAPH] → rag_lookup  (querying knowledge base for ['nfSetIdLists'])
 [INFO] [LLM]   > RAG query: field name 'nfSetIdLists'
 [INFO] [LLM]   < RAG returned 5 chunk(s)
 [INFO] [LLM]     · priority: 3GPP TS 29.510 correct field name. Integer 0-65535. Lower value = higher priority for NF s...
 [INFO] [LLM]     · Oracle NRF silent drop: when NRF receives an NF Profile PUT/PATCH request with unknown or non-standa...
 [INFO] [LLM]     · nfSetIdList: 3GPP TS 29.510 section 6.1.6.2.2 — correct field name. List of NF Set identifiers the N...
 [INFO] [LLM]     · scpInfo: vendor-specific field, not in 3GPP TS 29.510 core NF profile schema. Some vendor implementa...
 [INFO] [LLM]     · olcHSupportInd: vendor-specific indicator field. Not a standard 3GPP TS 29.510 NF Profile field. Ora...
 [INFO] [GRAPH] → analyze  (mode=llm)
 [INFO] [LLM]   > POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions  model=qwen-max  temperature=0  tools=['update_pcf_plmn', 'fix_profile_field', 'notify_only', 'no_action']
 [INFO] [LLM]   ┌── SYSTEM PROMPT ────────────────────────────────────────────
 [INFO] [LLM]   │  You are a 5G core network SRE agent specializing in NF registration diagnostics.
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  Your job: read the evidence provided and output a structured diagnosis, then call the appropriate tool.
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  Output:
 [INFO] [LLM]   │  - root_cause: one concise sentence describing the confirmed fault
 [INFO] [LLM]   │  - confidence:
 [INFO] [LLM]   │      "high"   — evidence is unambiguous, single clear cause
 [INFO] [LLM]   │      "medium" — most likely cause but minor uncertainty remains
 [INFO] [LLM]   │      "low"    — signals conflict or insufficient evidence to determine root cause
 [INFO] [LLM]   │  - Call exactly one tool to express your decision:
 [INFO] [LLM]   │      update_pcf_plmn(mcc, mnc)             — PCF plmnList contains a PLMN not accepted by NRF; correct value is mcc/mnc
 [INFO] [LLM]   │      fix_profile_field(wrong_name, correct_name) — field name typo in PCF profile dropped silently by NRF
 [INFO] [LLM]   │      notify_only(reason)                   — fault detected but outside auto-fix scope; human intervention required
 [INFO] [LLM]   │      no_action(reason)                     — system is healthy; alert is stale or self-resolved
 [INFO] [LLM]   ├── USER MESSAGE (ANALYSIS_PROMPT filled) ────────────────
 [INFO] [LLM]   │  Analyze this NF registration incident. Think step by step.
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  Alert: NfRegistrationFailure | namespace: occnp2
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  === EVIDENCE ===
 [INFO] [LLM]   │  NRF retry rate (2min): 1.3  [healthy < 3, active failure loop > 10]
 [INFO] [LLM]   │  PCF local status: PCF_LOCAL_REGISTERED
 [INFO] [LLM]   │  PCF plmnList: [{'mcc': '510', 'mnc': '011'}]
 [INFO] [LLM]   │  Allowed PLMNs: 510/011,208/93,001/01,001/001,505/02
 [INFO] [LLM]   │  Silent-drop field errors (auto-detected from NRF logs): ['nfSetIdLists']
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  === PCF LOGS (ERROR/WARN) ===
 [INFO] [LLM]   │  [2026-05-27T05:49:25] WARN occnp-occnp-nrf-client-nfmanagement-6c966f7847-jh94h: Performance data not available for npcf
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  === NRF LOGS (WARN) ===
 [INFO] [LLM]   │  [2026-05-27T05:49:25] WARN ocnrf-nfregistration-b5bb7fcdd-gbmpg [requesterNfType=PCF]: Allow only VendorSpecific attribu
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  === KNOWLEDGE BASE (retrieved for this incident) ===
 [INFO] [LLM]   │  priority: 3GPP TS 29.510 correct field name. Integer 0-65535. Lower value = higher priority for NF selection.nfSetIdLists: INVALID — not a 3GPP TS 29.510 field. Common typo: extra 's' appended to nfSetIdList. Oracle NRF behavior: returns HTTP 200 but silently drops this field. NRF WARN log: 'The following attributes have been dropped/ignored [nfSetIdLists]'. Business impact: downstream NFs cannot discover this PCF by NF Set membership. Fix: rename field to nfSetIdList (remove trailing 's').
 [INFO] [LLM]   │  ---
 [INFO] [LLM]   │  Oracle NRF silent drop: when NRF receives an NF Profile PUT/PATCH request with unknown or non-standard fields, it returns HTTP 200 OK but omits those fields from storage. A WARN log is emitted: 'The following attributes have been dropped/ignored [fieldName]'. The registration appears successful from the PCF perspective (nrf_rate stays normal), but the profile stored in NRF is incomplete. Detection: query NRF WARN logs, not Prometheus metrics.
 [INFO] [LLM]   │  ---
 [INFO] [LLM]   │  nfSetIdList: 3GPP TS 29.510 section 6.1.6.2.2 — correct field name. List of NF Set identifiers the NF belongs to. Type: array of strings. Example: ['pcfSet-A', 'pcfSet-B']. NRF validates and stores this field.nfServiceList: 3GPP TS 29.510 correct field name. Map of NF service instances exposed by this NF. Key: service name. NRF indexes this for service discovery.plmnList: 3GPP TS 29.510 correct field name. List of PLMN IDs the NF serves. Each entry: {mcc, mnc}. NRF validates mcc/mnc against its allowed list.
 [INFO] [LLM]   │  ---
 [INFO] [LLM]   │  scpInfo: vendor-specific field, not in 3GPP TS 29.510 core NF profile schema. Some vendor implementations include this; Oracle NRF may drop it with WARN log. Not safe to auto-fix — could be intentional vendor extension.servingScope: vendor-specific or operator-defined field. Not in 3GPP TS 29.510 standard NF profile. Oracle NRF may silently drop. Requires operator review before removal.lcHSupportInd: vendor-specific indicator field. Not a standard 3GPP TS 29.510 NF Profile field. Oracle NRF silently drops. Likely safe to remove if not used by any consumer NF.
 [INFO] [LLM]   │  ---
 [INFO] [LLM]   │  olcHSupportInd: vendor-specific indicator field. Not a standard 3GPP TS 29.510 NF Profile field. Oracle NRF silently drops. Likely safe to remove if not used by any consumer NF.
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  Analyze the evidence, then call the appropriate tool with correct parameters.
 [INFO] [LLM]   │  Also include in your text response:
 [INFO] [LLM]   │  {
 [INFO] [LLM]   │    "observations": [
 [INFO] [LLM]   │      "<key fact 1 from the evidence>",
 [INFO] [LLM]   │      "<key fact 2>",
 [INFO] [LLM]   │      "<key fact 3 if any>"
 [INFO] [LLM]   │    ],
 [INFO] [LLM]   │    "root_cause": "<one concise sentence>",
 [INFO] [LLM]   │    "confidence": "high" | "medium" | "low"
 [INFO] [LLM]   │  }
 [INFO] [LLM]   └────────────────────────────────────────────────────────────
 [INFO] [LLM]   tokens: prompt=1770  completion=187
 [INFO] [LLM]   < 200 OK  duration=23.0s
 [INFO] [LLM]   ── Tool Call ───────────────────────────────────────
 [INFO] [LLM]   tool → fix_profile_field  args={'wrong_name': 'nfSetIdLists', 'correct_name': 'nfSetIdList'}
 [INFO] [LLM]   ── Observations ────────────────────────────────────
 [INFO] [LLM]   [1] NRF retry rate is within a healthy range (1.3, which is < 3).
 [INFO] [LLM]   [2] The PCF's plmnList contains a PLMN that is in the allowed PLMNs list (510/011).
 [INFO] [LLM]   [3] There is a silent-drop field error for 'nfSetIdLists' as detected from NRF logs, which is a known typo of the correct field name 'nfSetIdList'.
 [INFO] [LLM]   ── Diagnosis ───────────────────────────────────────
 [INFO] [LLM]   root_cause → The NF registration failure is due to a field name typo in the PCF profile ('nfSetIdLists' instead of 'nfSetIdList') that is silently dropped by the NRF.
 [INFO] [LLM]   confidence → high
 [INFO] [LLM]   tool_call  → fix_profile_field({'wrong_name': 'nfSetIdLists', 'correct_name': 'nfSetIdList'})
 [INFO] [LLM]   ────────────────────────────────────────────────────
 [INFO] [GRAPH] → decide  route=execute_tool  tool=fix_profile_field  confidence=high
 [INFO] [GRAPH] → execute_tool  tool=fix_profile_field  args={'wrong_name': 'nfSetIdLists', 'correct_name': 'nfSetIdList'}
 [INFO] [SAFETY] Field 'nfSetIdLists' ✓ confirmed in fixable_typos whitelist
 [INFO] [FIX]   > calling fix_profile_field(**{'wrong_name': 'nfSetIdLists', 'correct_name': 'nfSetIdList'})
 [INFO] [EXECUTE] fix_profile_field completed  duration=0.055s
 [INFO] [GRAPH] → verify_fix  (field fix — verifying PCF profile keys, sleeping 5s...)
 [INFO] [TOOL/PCF] < profile keys: ['load', 'nfType', 'pcfInfo', 'capacity', 'locality', 'nfStatus', 'plmnList', 'nfServices', 'nfSetIdList', 'nfInstanceId', 'ipv4Addresses', 'nfServiceList', 'heartBeatTimer']
 [INFO] [VERIFY] profile has 'nfSetIdList', 'nfSetIdLists' absent  ✓ Field fix confirmed
 [INFO] [AGENT] ===== Run Complete =====
 [INFO] [AGENT] Root cause  : The NF registration failure is due to a field name typo in the PCF profile ('nfSetIdLists' instead of 'nfSetIdList') that is silently dropped by the NRF.
 [INFO] [AGENT] Fix action  : fix_field:nfSetIdLists:nfSetIdList
 [INFO] [AGENT] Confidence  : high
 [INFO] [AGENT] Fix applied : True
 [INFO] [AGENT] Fix verified: True
 [INFO] [AGENT] ==========================
 [INFO] [AUDIT] Incident saved → /data/incidents.jsonl
 [INFO] [KAFKA] Committed offset=24 partition=0

Scenario 3 — Unknown Field (safety gate blocks → Slack escalation)

Field name is not in the fixable_typos whitelist. Even if the LLM infers a correction using 3GPP training knowledge, execute_tool rejects the call at the code layer and escalates to Slack.

[GRAPH]  → fetch_nrf_logs  ⚠ unknown dropped field: 'nfSetIdListxxx' (not in fixable_typos)
[GRAPH]  → rag_lookup  skipped — unknown fields ['nfSetIdListxxx'] — no RAG needed
[GRAPH]  → analyze  (mode=llm)
[LLM]    ── Diagnosis ───────────────────────────────────────
[LLM]    root_cause → Field 'nfSetIdListxxx' is causing NRF to silently drop it.
[LLM]    confidence → high
[LLM]    tool_call  → fix_profile_field({'wrong_name': 'nfSetIdListxxx', 'correct_name': 'nfSetIdList'})
[LLM]    ────────────────────────────────────────────────────

[GRAPH]  → decide  route=execute_tool  tool=fix_profile_field  confidence=high
[GRAPH]  → execute_tool  tool=fix_profile_field
[SAFETY] Field 'nfSetIdListxxx' not in fixable_typos whitelist — refusing.
[SAFETY] Approved: ['ipv4Address', 'nfServiceLists', 'nfSetIdLists', 'plmnLists']
[GRAPH]  → notify  (human escalation — no applicable auto-fix for this fault)
[ESCALATION] Alert: NfRegistrationFailure | ns=occnp2
[ESCALATION] Root cause: Field 'nfSetIdListxxx' dropped by NRF — not in approved fixable list
[ESCALATION] → Human operator intervention required
[NOTIFY] Slack notified  status=200
[AUDIT]  Incident saved → /data/incidents.jsonl

Slack escalation alert


Audit Log (incidents.jsonl)

Every agent run appends a structured record — full decision trail, queryable, maps to DynamoDB/OCI NoSQL in production.

{"ts":"2026-05-18T03:22:47","alert_name":"NfRegistrationFailure","namespace":"5gpcf","root_cause":"PCF plmnList contains invalid PLMN (510/088), not in NRF allowed list.","fix_action":"update_plmn:510:011","confidence":"high","fix_applied":true,"fix_verified":true,"error":null}
{"ts":"2026-05-18T04:09:22","alert_name":"NfRegistrationFailure","namespace":"5gpcf","root_cause":"Field 'nfSetIdLists' silently dropped by NRF — typo of 'nfSetIdList'.","fix_action":"fix_field:nfSetIdLists:nfSetIdList","confidence":"high","fix_applied":true,"fix_verified":true,"error":null}
{"ts":"2026-05-18T11:08:06","alert_name":"NfRegistrationFailure","namespace":"5gpcf","root_cause":"PCF plmnList contains invalid PLMN (510/089), not accepted by NRF.","fix_action":"update_plmn:510:011","confidence":"high","fix_applied":true,"fix_verified":true,"error":null}
{"ts":"2026-05-18T13:23:13","alert_name":"NfRegistrationFailure","namespace":"5gpcf","root_cause":"Field 'nfSetIdLists' incorrectly named, silently dropped by NRF.","fix_action":"fix_field:nfSetIdLists:nfSetIdList","confidence":"high","fix_applied":true,"fix_verified":true,"error":null}

About

Autonomous AIOps agent for 5G NF registration failure detection, root cause analysis, and self-healing via LangGraph + LLM + RAG

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors