AIOps Agent — 5G NF Intelligent Operations

An autonomous AIOps agent that monitors 5G core network NF registration failures, localizes root causes, applies configuration fixes, and verifies recovery — end-to-end in minutes.

Stack: LangGraph · qwen-max (function calling) / OCI Generative AI · Alibaba Bailian RAG / OCI OpenSearch RAG · Kafka · Redis · Kubernetes · Prometheus · Elasticsearch

Background

Modern 5G core networks generate thousands of alerts per day across dozens of network functions. Traditional NOC operations rely on manual triage — engineers read logs, cross-reference metrics, and apply fixes by hand. This process is slow, error-prone, and does not scale as network complexity grows.

This agent was built to augment NOC operations for a 5G core network deployment. It addresses three pain points:

Reactive to proactive: correlates Prometheus metrics + Elasticsearch logs + NF config state to diagnose faults before they impact subscribers
Human-in-the-loop where it matters: auto-fixes high-confidence, bounded faults (config errors, field typos, PLMN mismatches); escalates ambiguous or out-of-scope faults to on-call engineers with a structured diagnosis
Predictive analysis foundation: the same pipeline — alert → evidence collection → LLM reasoning → action — can be extended to anomaly forecasting and capacity planning

Operations Efficiency

Workflow Comparison

Performance Benchmark

Alert triage breakdown (20-rule PCF ruleset):

Auto-fix (no human needed)   ████████░░░░░░░░░░░░  35%   (7 / 20 alert types)
Assisted diagnosis + notify  ████████████░░░░░░░░  50%   (10 / 20 alert types)
Escalate immediately          ████░░░░░░░░░░░░░░░░  15%   (3 / 20 alert types)

Architecture

AlertManager ──► Kafka (aiops-alerts topic)
                     │
                     ▼
              aiops-worker (Kafka Consumer)
              ├── Redis dedup lock (SET NX EX 300)
              └── Semaphore(3) ── LangGraph Agent
                                      │
                     ┌────────────────┼─────────────────────┐
                     ▼                ▼                     ▼
               fetch_logs      fetch_metrics        fetch_nrf_logs
                     └────────────────┼─────────────────────┘
                                      ▼
                                  rag_lookup  ──► Vector Knowledge Base
                                      ▼
                                   analyze   ──► LLM (qwen-max, function calling)
                                      │            bind_tools → tool_calls
                                      ▼
                                   decide()  [deterministic routing on tool_call_name]
                          ┌──────────┴──────────────────┐
                          ▼                             ▼
                    execute_tool                      notify ──► Slack
                    ├── PLMN whitelist gate
                    ├── fixable_typos whitelist gate
                    └── pcf_update_plmn / pcf_fix_field
                          │
                          ▼
                      verify_fix  (Prometheus rate / field key check)
                          │
                          ▼
                    incidents.jsonl  (audit log)

Webhook flow: AlertManager → aiops-webhook (FastAPI, Kafka producer) → returns 200 immediately → Kafka consumer handles async

Safety design: Two independent whitelists in execute_tool enforce boundaries regardless of LLM output — PLMN whitelist for update_pcf_plmn, fixable_typos whitelist for fix_profile_field. Unauthorized calls are rejected and escalated to Slack.

Observability

Real-time metrics from a load-test run (29 alerts, 5 concurrent scenarios).

Key results: 69% auto-fixed · 17% escalated to human · 100% fix verification rate
Latency: end-to-end p50 ~30 s · LLM call p50 ~5.4 s (qwen-max)

Metric	Description
`aiops_alerts_processed_total`	Alert outcomes by type: `auto_fixed`, `escalated`, `no_action`
`aiops_alert_duration_seconds`	End-to-end processing latency per alert
`aiops_fix_verified_total`	PCF config fix verification pass/fail rate
`aiops_safety_gate_rejected_total`	Blocked LLM calls outside whitelist
`aiops_llm_duration_seconds` / `aiops_llm_tokens_total`	LLM latency and token usage
`aiops_rag_duration_seconds` / `aiops_rag_chunks_returned`	RAG knowledge base query performance

Metrics scraped from two endpoints: webhook /metrics on port 8000 and worker /metrics on port 9100, via Prometheus ServiceMonitor.

Fault Scenarios

The agent covers the full PCF alert ruleset. Actions fall into three tiers:

Auto-fix — agent applies the fix autonomously, verifies, and closes the incident
Assisted — agent diagnoses root cause, correlates evidence across sources, notifies on-call with a structured report
Escalate — agent detects the fault but scope is outside safe auto-fix boundary; pages immediately with context

#	Scenario	Severity	Detection	Agent Action	MTTR
1	NF Registration — PLMN Mismatch	Critical	Prometheus retry rate > 3/2min	Auto-fix: correct `plmnList` via PCF REST API, verify retry rate drops	~2 min
2	NF Registration — Silent Field Drop	Critical	NRF WARN logs in ES (field rejected)	Auto-fix: rename typo field via PCF REST API, confirm NRF accepts	~30 sec
3	NF Registration — Unknown Field	Critical	NRF WARN logs in ES (unrecognized attr)	Escalate: Slack alert with field name; outside auto-fix whitelist	immediate
4	Policy Control Service Down	Critical	Service health metric = 0	Assisted: identify crashed pod, correlate with OOM/crash logs, recommend restart sequence	~3 min
5	Session Management High Error Rate	Critical	SM ingress error rate > 10% (24h)	Assisted: cross-correlate with UDR/CHF errors to isolate root NF, structured Slack report	~5 min
6	Session Management Traffic Overload	Major	SM request rate > 90% max MPS	Assisted: confirm burst pattern, recommend HPA scale-out, notify capacity team	~2 min
7	Diameter Connector High Error Rate	Critical	Diameter ingress error rate > 10%	Assisted: check SCP peer health, correlate Diameter + SCP alerts, escalate with correlation summary	~4 min
8	Diameter Connector Traffic Saturation	Major	Diameter request rate > 90% max MPS	Assisted: traffic surge detected, identify source NF, recommend load balancing review	~2 min
9	UDR Connectivity Timeout Spike	Major	UDR timeout rate > 10% of requests	Assisted: query ES for timeout patterns and duration trends, suggest connection pool or retry tuning	~5 min
10	UDR Service High Error Rate	Critical	UDR egress error rate > 10% (24h)	Assisted: correlate with DB tier health, determine whether UDR or underlying DB is root cause	~5 min
11	CHF Connectivity Timeout Spike	Major	CHF timeout rate > 10% of requests	Assisted: analyze charging server response trends, notify billing ops with pattern data	~5 min
12	CHF Service High Error Rate	Critical	CHF egress error rate > 10% (24h)	Assisted: spending limit control failure analysis, notify billing team with impact estimate	~4 min
13	Policy Datastore High Error Rate	Critical	PolicyDS ingress error rate > 10%	Assisted: correlate with DB tier alert, determine if DB or policy engine is source	~5 min
14	Policy Database Tier Unreachable	Critical	DB health indicator = 0	Assisted: ES log analysis for crash reason (OOM/disk/network), trigger restart runbook via notify	~8 min
15	Pod CPU Congestion	Critical	Pod CPU congestion state = congested	Auto-fix: trigger HPA scale-out, mark congested pod, notify if congestion persists post-scale	~2 min
16	Pod Memory Congestion	Critical	Pod memory congestion state = congested	Assisted: identify memory leak signature in logs, recommend pod restart or JVM tuning	~3 min
17	Request Queue Congestion	Critical	Pending request queue state = congested	Assisted: identify bottleneck service, correlate with CPU/memory alerts, recommend scale path	~4 min
18	Egress Peer Unreachable	Major	SCP peer health status ≠ 0	Assisted: diagnose peer connectivity, identify if peer or network path is down, suggest failover	~3 min
19	All Egress Peers in Peer-Set Down	Critical	Peer available count = 0 across peer-set	Escalate: total egress path failure, page on-call immediately with peer-set name and last-seen time	immediate
20	SMSC Connection Loss	Major	Active SMSC connection count = 0 for 10 min	Assisted: check SMSC logs for disconnect reason, attempt reconnect trigger, notify messaging ops	~5 min

Tech Stack Decisions

Component	Choice	Why
Agent framework	LangGraph	Explicit node graph + conditional edges; LLM controls diagnosis only, not routing
LLM	OpenAI GPT / OCI Generative AI (Llama 3)	OpenAI-compatible API; pluggable — swap endpoint via env var
RAG	OCI OpenSearch vector index / managed KB	Provides 3GPP field name context for silent-drop detection; swappable backend
Message queue	Kafka KRaft	Decouples AlertManager from agent; survives burst alerts; enables multi-consumer patterns
Dedup lock	Redis `SET NX EX 300`	Distributed across potential multi-pod deployments; prevents duplicate runs on same alert
Observability	Elasticsearch + Prometheus	PCF/NRF logs go to ES, metrics to Prometheus
Config fix	PCF REST API (3GPP SBA)	Direct NF control plane — no manual SSH, auditable, reversible

Prompt Engineering

Four-layer separation of concerns:

SYSTEM_PROMPT   — role definition + tool catalog (4 tools, stable)
ANALYSIS_PROMPT — parameterized evidence template ({allowed_plmns}, {field_errors}, etc.)
RAG chunks      — 3GPP field knowledge (retrieved at runtime, not hardcoded)
decide()        — deterministic routing on tool_call_name in Python (not LLM)

Function calling flow: LLM receives tool schemas via llm.bind_tools(TOOLS) and responds with a structured tool_calls object — no string parsing. execute_tool dispatches on tool_call_name and enforces whitelists independently of LLM reasoning.

LLM role  →  diagnosis + intent expression (tool_calls)
Code role →  routing (decide) + enforcement (execute_tool whitelists) + execution (PCF API)

Repository Structure

aiops-agent/
├── agent/
│   ├── graph.py          # LangGraph nodes + edges + decide() routing
│   ├── prompts.py        # SYSTEM_PROMPT + ANALYSIS_PROMPT
│   ├── state.py          # TypedDict state definition
│   ├── worker.py         # Kafka consumer + Redis dedup + Semaphore
│   └── log_fmt.py        # Structured log formatter
├── tools/
│   ├── es_tool.py         # Elasticsearch log query
│   ├── prometheus_tool.py # Prometheus metrics query
│   ├── pcf_tool.py        # PCF config read/write (3GPP SBA REST)
│   ├── rag_tool.py        # RAG retrieval (OCI OpenSearch / Alibaba Bailian)
│   └── tool_registry.py   # LangChain @tool definitions + TOOLS (bind) + TOOL_MAP (dispatch)
├── webhook/
│   └── server.py         # FastAPI: AlertManager → Kafka producer
├── k8s/
│   ├── aiops/            # Agent microservice manifests
│   │   ├── configmap.yaml
│   │   ├── webhook.yaml  # Deployment + ClusterIP Service
│   │   └── worker.yaml   # Deployment
│   ├── kafka/            # Kafka KRaft StatefulSet
│   ├── redis/            # Redis Deployment
│   └── alertmanager/     # AlertmanagerConfig CR + PrometheusRule
├── Dockerfile
└── requirements.txt

Quick Start

Prerequisites

Kubernetes cluster with aiops namespace
Kafka + Redis deployed in aiops namespace (see k8s/kafka/ and k8s/redis/)
Prometheus + Elasticsearch + 5G NF accessible from cluster nodes

1. Configure secrets

# Copy and fill in your credentials (do NOT commit this file)
cp k8s/aiops/secret.yaml.example k8s/aiops/secret.yaml
kubectl apply -f k8s/aiops/secret.yaml

2. Deploy to K8s

kubectl apply -f k8s/aiops/configmap.yaml
kubectl apply -f k8s/aiops/webhook.yaml
kubectl apply -f k8s/aiops/worker.yaml

# AlertManager routing
kubectl apply -f k8s/alertmanager/aiops-webhook.yaml

3. Verify

kubectl get pods -n aiops
kubectl logs -n aiops deploy/aiops-webhook
kubectl logs -n aiops deploy/aiops-worker -f

4. Trigger a fault (demo)

curl -X POST http://<alertmanager-host>/api/v2/alerts \
  -H 'Content-Type: application/json' \
  -d '[{"labels":{"alertname":"NfRegistrationFailure","namespace":"<nf-namespace>","NfType":"PCF"},"status":"firing"}]'

Cloud Migration Path

Self-hosted	OCI	AWS
Kubernetes (Kubespray)	OKE	EKS
Elasticsearch	OCI OpenSearch	Amazon OpenSearch
RAG knowledge base	OCI Generative AI + OpenSearch vector	Bedrock Knowledge Bases
LLM (GPT / Llama)	OCI Generative AI Service	Amazon Bedrock
incidents.jsonl	OCI NoSQL Database	DynamoDB
FastAPI webhook	OCI API Gateway + Functions	API Gateway + Lambda
Slack notify	OCI Notifications	SNS

Key Engineering Trade-offs

Why Kafka instead of direct webhook → agent?
Decouples ingestion from processing. AlertManager gets instant 200 response. Worker controls concurrency (Semaphore(3)) and deduplication (Redis). On restart, unprocessed offsets are re-consumed.

Why Redis for dedup instead of in-memory set?
In-memory state is lost on pod restart. Redis survives worker restarts and works across multiple worker replicas.

Why deterministic decide() instead of LLM routing?
LLM handles ambiguous log interpretation. Routing logic (PLMN whitelist check, confidence threshold, fix boundary) is code — testable, auditable, zero hallucination risk.

Why separate webhook + worker pods?
Different resource profiles: webhook is lightweight (Kafka producer only), worker is CPU/memory intensive (LLM calls, concurrent threads). Scale independently.

Deployment — `aiops` Namespace

dscl1@bastion:~$ kubectl get pods -n aiops 
NAME                            READY   STATUS    RESTARTS   AGE
aiops-webhook-869978964-klp7l   1/1     Running   0          9h
aiops-worker-7cd48b7c78-c72n6   1/1     Running   0          6s
aiops-worker-7cd48b7c78-mb2v9   1/1     Running   0          9h
aiops-worker-7cd48b7c78-mbbwk   1/1     Running   0          6s
aiops-worker-7cd48b7c78-rhvl9   1/1     Running   0          6s
aiops-worker-7cd48b7c78-z6prj   1/1     Running   0          6s
kafka-0                         1/1     Running   0          16h
redis-845d787d54-fdgtm          1/1     Running   0          17h

$ kubectl get svc -n aiops
NAME             TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)           AGE
aiops-webhook    ClusterIP   10.233.61.70    <none>        8000/TCP          97m
kafka            NodePort    10.233.43.141   <none>        9092:30092/TCP    8h
kafka-headless   ClusterIP   None            <none>        9092/TCP,9093/TCP 8h
redis            NodePort    10.233.44.168   <none>        6379:30379/TCP    9h

$ kubectl get deployment -n aiops
NAME            READY   UP-TO-DATE   AVAILABLE   AGE
aiops-webhook   1/1     1            1           97m
aiops-worker    1/1     1            1           97m
redis           1/1     1            1           9h

Sample Agent Runs

Scenario 1 — PLMN Mismatch (auto-fix in ~2 min)

Alert fires → agent fetches evidence → LLM calls update_pcf_plmn via function calling → PLMN whitelist gate passes → PCF REST API fix → Prometheus rate verified.

[ALERT]  NfRegistrationFailure | ns=occnp2 | NfType=PCF | status=firing
[AGENT]  Starting LangGraph agent for occnp2...

[GRAPH]  → fetch_logs      PCF ERROR/WARN logs retrieved from Elasticsearch
[GRAPH]  → fetch_metrics   nrf_retry_rate=10.7/2min  (threshold > 3)
[GRAPH]  → fetch_nrf_logs  no dropped fields detected
[GRAPH]  → rag_lookup      skipped — no dropped fields detected
[GRAPH]  → analyze         (mode=llm)

[LLM]    > POST /compatible-mode/v1/chat/completions  model=qwen-max
[LLM]    ┌── Evidence sent to LLM ─────────────────────────────────────
[LLM]    │  NRF retry rate: 10.7  [active failure loop > 10]
[LLM]    │  PCF plmnList: [{'mcc': '510', 'mnc': '088'}]
[LLM]    │  Allowed PLMNs: 510/011, 208/93, 001/01
[LLM]    │  Silent-drop field errors: none
[LLM]    │  PCF logs: WARN nrf-client: PLMN 510/088 not in allowed list
[LLM]    └────────────────────────────────────────────────────────────
[LLM]    < 200 OK  duration=5.1s
[LLM]    ── Observations ────────────────────────────────────
[LLM]    [1] NRF retry rate 10.7 confirms active registration failure loop.
[LLM]    [2] PCF plmnList 510/088 is not in the allowed PLMN list.
[LLM]    [3] No silent-drop field errors detected.
[LLM]    ── Diagnosis ───────────────────────────────────────
[LLM]    root_cause → PCF plmnList contains PLMN 510/088 not accepted by NRF.
[LLM]    confidence → high
[LLM]    tool_call  → update_pcf_plmn({'mcc': '510', 'mnc': '011'})
[LLM]    ────────────────────────────────────────────────────

[GRAPH]  → decide  route=execute_tool  tool=update_pcf_plmn  confidence=high
[GRAPH]  → execute_tool  tool=update_pcf_plmn  args={'mcc': '510', 'mnc': '011'}
[SAFETY] PLMN 510/011 ✓ confirmed in whitelist
[EXECUTE] > PUT /nrf-client-nfmanagement/nfProfileList
[EXECUTE] > plmnList=[{'mcc':'510','mnc':'011'}]  (was [{'mcc':'510','mnc':'088'}])
[PCF]    < 200 OK  duration=0.048s

[GRAPH]  → verify_fix  attempt 1/3  (sleeping 30s)
[VERIFY] nrf_retry_rate=6.3 >= 3  still recovering — retry
[GRAPH]  → verify_fix  attempt 2/3  (sleeping 60s)
[VERIFY] nrf_retry_rate=0.0 < 3  ✓ registration restored

[AGENT]  ===== Run Complete =====
[AGENT]  Root cause  : PCF plmnList contains PLMN 510/088 not accepted by NRF.
[AGENT]  Fix applied : True  |  Fix verified: True
[AUDIT]  Incident saved → /data/incidents.jsonl

Scenario 2 — Silent-Drop Field Typo (auto-fix in ~5 sec)

NRF silently drops a misspelled field; PCF thinks registration succeeded (HTTP 200) but NF profile is incomplete. Agent detects via NRF WARN logs + RAG lookup, fixes via PCF REST API, verifies by checking profile keys.

 [INFO] [WORKER] Consumed offset=23 partition=0
 [INFO] [ALERT] NfRegistrationFailure | ns=occnp2 | NfType=PCF | status=firing
 [INFO] [AGENT] Starting LangGraph agent for occnp2...
 [INFO] [GRAPH] → fetch_logs
 [INFO] [TOOL/ES]  > GET http://172.16.100.91:30200/k8s-2026.05.27/_search
 [INFO] [TOOL/ES]  > filter: namespace=occnp2  level IN [ERROR,WARN]  @timestamp >= now-5m  (since 05:44:48Z)
 [INFO] [TOOL/ES]  < 200 OK  duration=0.782s  returned=46 lines  (ERROR=0, WARN=46)
 [INFO] [GRAPH] → fetch_metrics
 [INFO] [TOOL/PM]  > GET http://172.16.100.91:30504/api/v1/query
 [INFO] [TOOL/PM]  > query: increase(occnp_nrfclient_nw_conn_out_request_total{MessageType="AutonomousNfRegistration",namespace="occnp2"}[2m])
 [INFO] [TOOL/PM]  < 200 OK  duration=0.010s
 [INFO] [TOOL/PM]    nrf_retry_rate=1.3/2min  (threshold=3, ✓ normal)
 [INFO] [TOOL/PM]    pcf_local_status=PCF_LOCAL_REGISTERED
 [INFO] [TOOL/PCF] > GET http://172.16.100.231:8000/PCF/nf-common-component/v1/nrf-client-nfmanagement/nfProfileList
 [INFO] [TOOL/PCF] < 200 OK  duration=0.005s
 [INFO] [TOOL/PCF]   nfStatus=REGISTERED  plmnList=[{'mcc': '510', 'mnc': '011'}]
 [INFO] [TOOL/PCF]   plmn 510/011 → ✓ in NRF allowed list
 [INFO] [GRAPH] → fetch_nrf_logs
 [INFO] [TOOL/ES]  > GET http://172.16.100.91:30200/k8s-2026.05.27/_search
 [INFO] [TOOL/ES]  > filter: namespace=ocnrf1  level IN [ERROR,WARN]  @timestamp >= now-5m  (since 05:44:49Z)
 [INFO] [TOOL/ES]  < 200 OK  duration=0.006s  returned=1 lines  (WARN=1)
 [INFO] [TOOL/ES]  ⚠ PCF field typo detected (auto-fixable): 'nfSetIdLists'
 [INFO] [TOOL/ES]    [2026-05-27T05:49:25] WARN ocnrf-nfregistration-b5bb7fcdd-gbmpg [requesterNfType=PCF]: Allow only VendorSpecific attributes. Value of enableF5 is true and acceptAdditionalAttributes is false. The foll
 [INFO] [GRAPH] → rag_lookup  (querying knowledge base for ['nfSetIdLists'])
 [INFO] [LLM]   > RAG query: field name 'nfSetIdLists'
 [INFO] [LLM]   < RAG returned 5 chunk(s)
 [INFO] [LLM]     · priority: 3GPP TS 29.510 correct field name. Integer 0-65535. Lower value = higher priority for NF s...
 [INFO] [LLM]     · Oracle NRF silent drop: when NRF receives an NF Profile PUT/PATCH request with unknown or non-standa...
 [INFO] [LLM]     · nfSetIdList: 3GPP TS 29.510 section 6.1.6.2.2 — correct field name. List of NF Set identifiers the N...
 [INFO] [LLM]     · scpInfo: vendor-specific field, not in 3GPP TS 29.510 core NF profile schema. Some vendor implementa...
 [INFO] [LLM]     · olcHSupportInd: vendor-specific indicator field. Not a standard 3GPP TS 29.510 NF Profile field. Ora...
 [INFO] [GRAPH] → analyze  (mode=llm)
 [INFO] [LLM]   > POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions  model=qwen-max  temperature=0  tools=['update_pcf_plmn', 'fix_profile_field', 'notify_only', 'no_action']
 [INFO] [LLM]   ┌── SYSTEM PROMPT ────────────────────────────────────────────
 [INFO] [LLM]   │  You are a 5G core network SRE agent specializing in NF registration diagnostics.
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  Your job: read the evidence provided and output a structured diagnosis, then call the appropriate tool.
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  Output:
 [INFO] [LLM]   │  - root_cause: one concise sentence describing the confirmed fault
 [INFO] [LLM]   │  - confidence:
 [INFO] [LLM]   │      "high"   — evidence is unambiguous, single clear cause
 [INFO] [LLM]   │      "medium" — most likely cause but minor uncertainty remains
 [INFO] [LLM]   │      "low"    — signals conflict or insufficient evidence to determine root cause
 [INFO] [LLM]   │  - Call exactly one tool to express your decision:
 [INFO] [LLM]   │      update_pcf_plmn(mcc, mnc)             — PCF plmnList contains a PLMN not accepted by NRF; correct value is mcc/mnc
 [INFO] [LLM]   │      fix_profile_field(wrong_name, correct_name) — field name typo in PCF profile dropped silently by NRF
 [INFO] [LLM]   │      notify_only(reason)                   — fault detected but outside auto-fix scope; human intervention required
 [INFO] [LLM]   │      no_action(reason)                     — system is healthy; alert is stale or self-resolved
 [INFO] [LLM]   ├── USER MESSAGE (ANALYSIS_PROMPT filled) ────────────────
 [INFO] [LLM]   │  Analyze this NF registration incident. Think step by step.
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  Alert: NfRegistrationFailure | namespace: occnp2
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  === EVIDENCE ===
 [INFO] [LLM]   │  NRF retry rate (2min): 1.3  [healthy < 3, active failure loop > 10]
 [INFO] [LLM]   │  PCF local status: PCF_LOCAL_REGISTERED
 [INFO] [LLM]   │  PCF plmnList: [{'mcc': '510', 'mnc': '011'}]
 [INFO] [LLM]   │  Allowed PLMNs: 510/011,208/93,001/01,001/001,505/02
 [INFO] [LLM]   │  Silent-drop field errors (auto-detected from NRF logs): ['nfSetIdLists']
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  === PCF LOGS (ERROR/WARN) ===
 [INFO] [LLM]   │  [2026-05-27T05:49:25] WARN occnp-occnp-nrf-client-nfmanagement-6c966f7847-jh94h: Performance data not available for npcf
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  === NRF LOGS (WARN) ===
 [INFO] [LLM]   │  [2026-05-27T05:49:25] WARN ocnrf-nfregistration-b5bb7fcdd-gbmpg [requesterNfType=PCF]: Allow only VendorSpecific attribu
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  === KNOWLEDGE BASE (retrieved for this incident) ===
 [INFO] [LLM]   │  priority: 3GPP TS 29.510 correct field name. Integer 0-65535. Lower value = higher priority for NF selection.nfSetIdLists: INVALID — not a 3GPP TS 29.510 field. Common typo: extra 's' appended to nfSetIdList. Oracle NRF behavior: returns HTTP 200 but silently drops this field. NRF WARN log: 'The following attributes have been dropped/ignored [nfSetIdLists]'. Business impact: downstream NFs cannot discover this PCF by NF Set membership. Fix: rename field to nfSetIdList (remove trailing 's').
 [INFO] [LLM]   │  ---
 [INFO] [LLM]   │  Oracle NRF silent drop: when NRF receives an NF Profile PUT/PATCH request with unknown or non-standard fields, it returns HTTP 200 OK but omits those fields from storage. A WARN log is emitted: 'The following attributes have been dropped/ignored [fieldName]'. The registration appears successful from the PCF perspective (nrf_rate stays normal), but the profile stored in NRF is incomplete. Detection: query NRF WARN logs, not Prometheus metrics.
 [INFO] [LLM]   │  ---
 [INFO] [LLM]   │  nfSetIdList: 3GPP TS 29.510 section 6.1.6.2.2 — correct field name. List of NF Set identifiers the NF belongs to. Type: array of strings. Example: ['pcfSet-A', 'pcfSet-B']. NRF validates and stores this field.nfServiceList: 3GPP TS 29.510 correct field name. Map of NF service instances exposed by this NF. Key: service name. NRF indexes this for service discovery.plmnList: 3GPP TS 29.510 correct field name. List of PLMN IDs the NF serves. Each entry: {mcc, mnc}. NRF validates mcc/mnc against its allowed list.
 [INFO] [LLM]   │  ---
 [INFO] [LLM]   │  scpInfo: vendor-specific field, not in 3GPP TS 29.510 core NF profile schema. Some vendor implementations include this; Oracle NRF may drop it with WARN log. Not safe to auto-fix — could be intentional vendor extension.servingScope: vendor-specific or operator-defined field. Not in 3GPP TS 29.510 standard NF profile. Oracle NRF may silently drop. Requires operator review before removal.lcHSupportInd: vendor-specific indicator field. Not a standard 3GPP TS 29.510 NF Profile field. Oracle NRF silently drops. Likely safe to remove if not used by any consumer NF.
 [INFO] [LLM]   │  ---
 [INFO] [LLM]   │  olcHSupportInd: vendor-specific indicator field. Not a standard 3GPP TS 29.510 NF Profile field. Oracle NRF silently drops. Likely safe to remove if not used by any consumer NF.
 [INFO] [LLM]   │  
 [INFO] [LLM]   │  Analyze the evidence, then call the appropriate tool with correct parameters.
 [INFO] [LLM]   │  Also include in your text response:
 [INFO] [LLM]   │  {
 [INFO] [LLM]   │    "observations": [
 [INFO] [LLM]   │      "<key fact 1 from the evidence>",
 [INFO] [LLM]   │      "<key fact 2>",
 [INFO] [LLM]   │      "<key fact 3 if any>"
 [INFO] [LLM]   │    ],
 [INFO] [LLM]   │    "root_cause": "<one concise sentence>",
 [INFO] [LLM]   │    "confidence": "high" | "medium" | "low"
 [INFO] [LLM]   │  }
 [INFO] [LLM]   └────────────────────────────────────────────────────────────
 [INFO] [LLM]   tokens: prompt=1770  completion=187
 [INFO] [LLM]   < 200 OK  duration=23.0s
 [INFO] [LLM]   ── Tool Call ───────────────────────────────────────
 [INFO] [LLM]   tool → fix_profile_field  args={'wrong_name': 'nfSetIdLists', 'correct_name': 'nfSetIdList'}
 [INFO] [LLM]   ── Observations ────────────────────────────────────
 [INFO] [LLM]   [1] NRF retry rate is within a healthy range (1.3, which is < 3).
 [INFO] [LLM]   [2] The PCF's plmnList contains a PLMN that is in the allowed PLMNs list (510/011).
 [INFO] [LLM]   [3] There is a silent-drop field error for 'nfSetIdLists' as detected from NRF logs, which is a known typo of the correct field name 'nfSetIdList'.
 [INFO] [LLM]   ── Diagnosis ───────────────────────────────────────
 [INFO] [LLM]   root_cause → The NF registration failure is due to a field name typo in the PCF profile ('nfSetIdLists' instead of 'nfSetIdList') that is silently dropped by the NRF.
 [INFO] [LLM]   confidence → high
 [INFO] [LLM]   tool_call  → fix_profile_field({'wrong_name': 'nfSetIdLists', 'correct_name': 'nfSetIdList'})
 [INFO] [LLM]   ────────────────────────────────────────────────────
 [INFO] [GRAPH] → decide  route=execute_tool  tool=fix_profile_field  confidence=high
 [INFO] [GRAPH] → execute_tool  tool=fix_profile_field  args={'wrong_name': 'nfSetIdLists', 'correct_name': 'nfSetIdList'}
 [INFO] [SAFETY] Field 'nfSetIdLists' ✓ confirmed in fixable_typos whitelist
 [INFO] [FIX]   > calling fix_profile_field(**{'wrong_name': 'nfSetIdLists', 'correct_name': 'nfSetIdList'})
 [INFO] [EXECUTE] fix_profile_field completed  duration=0.055s
 [INFO] [GRAPH] → verify_fix  (field fix — verifying PCF profile keys, sleeping 5s...)
 [INFO] [TOOL/PCF] < profile keys: ['load', 'nfType', 'pcfInfo', 'capacity', 'locality', 'nfStatus', 'plmnList', 'nfServices', 'nfSetIdList', 'nfInstanceId', 'ipv4Addresses', 'nfServiceList', 'heartBeatTimer']
 [INFO] [VERIFY] profile has 'nfSetIdList', 'nfSetIdLists' absent  ✓ Field fix confirmed
 [INFO] [AGENT] ===== Run Complete =====
 [INFO] [AGENT] Root cause  : The NF registration failure is due to a field name typo in the PCF profile ('nfSetIdLists' instead of 'nfSetIdList') that is silently dropped by the NRF.
 [INFO] [AGENT] Fix action  : fix_field:nfSetIdLists:nfSetIdList
 [INFO] [AGENT] Confidence  : high
 [INFO] [AGENT] Fix applied : True
 [INFO] [AGENT] Fix verified: True
 [INFO] [AGENT] ==========================
 [INFO] [AUDIT] Incident saved → /data/incidents.jsonl
 [INFO] [KAFKA] Committed offset=24 partition=0

Scenario 3 — Unknown Field (safety gate blocks → Slack escalation)

Field name is not in the fixable_typos whitelist. Even if the LLM infers a correction using 3GPP training knowledge, execute_tool rejects the call at the code layer and escalates to Slack.

[GRAPH]  → fetch_nrf_logs  ⚠ unknown dropped field: 'nfSetIdListxxx' (not in fixable_typos)
[GRAPH]  → rag_lookup  skipped — unknown fields ['nfSetIdListxxx'] — no RAG needed
[GRAPH]  → analyze  (mode=llm)
[LLM]    ── Diagnosis ───────────────────────────────────────
[LLM]    root_cause → Field 'nfSetIdListxxx' is causing NRF to silently drop it.
[LLM]    confidence → high
[LLM]    tool_call  → fix_profile_field({'wrong_name': 'nfSetIdListxxx', 'correct_name': 'nfSetIdList'})
[LLM]    ────────────────────────────────────────────────────

[GRAPH]  → decide  route=execute_tool  tool=fix_profile_field  confidence=high
[GRAPH]  → execute_tool  tool=fix_profile_field
[SAFETY] Field 'nfSetIdListxxx' not in fixable_typos whitelist — refusing.
[SAFETY] Approved: ['ipv4Address', 'nfServiceLists', 'nfSetIdLists', 'plmnLists']
[GRAPH]  → notify  (human escalation — no applicable auto-fix for this fault)
[ESCALATION] Alert: NfRegistrationFailure | ns=occnp2
[ESCALATION] Root cause: Field 'nfSetIdListxxx' dropped by NRF — not in approved fixable list
[ESCALATION] → Human operator intervention required
[NOTIFY] Slack notified  status=200
[AUDIT]  Incident saved → /data/incidents.jsonl

Audit Log (`incidents.jsonl`)

Every agent run appends a structured record — full decision trail, queryable, maps to DynamoDB/OCI NoSQL in production.

{"ts":"2026-05-18T03:22:47","alert_name":"NfRegistrationFailure","namespace":"5gpcf","root_cause":"PCF plmnList contains invalid PLMN (510/088), not in NRF allowed list.","fix_action":"update_plmn:510:011","confidence":"high","fix_applied":true,"fix_verified":true,"error":null}
{"ts":"2026-05-18T04:09:22","alert_name":"NfRegistrationFailure","namespace":"5gpcf","root_cause":"Field 'nfSetIdLists' silently dropped by NRF — typo of 'nfSetIdList'.","fix_action":"fix_field:nfSetIdLists:nfSetIdList","confidence":"high","fix_applied":true,"fix_verified":true,"error":null}
{"ts":"2026-05-18T11:08:06","alert_name":"NfRegistrationFailure","namespace":"5gpcf","root_cause":"PCF plmnList contains invalid PLMN (510/089), not accepted by NRF.","fix_action":"update_plmn:510:011","confidence":"high","fix_applied":true,"fix_verified":true,"error":null}
{"ts":"2026-05-18T13:23:13","alert_name":"NfRegistrationFailure","namespace":"5gpcf","root_cause":"Field 'nfSetIdLists' incorrectly named, silently dropped by NRF.","fix_action":"fix_field:nfSetIdLists:nfSetIdList","confidence":"high","fix_applied":true,"fix_verified":true,"error":null}

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
agent		agent
config		config
docs		docs
k8s		k8s
rag_local		rag_local
scripts		scripts
tests		tests
tools		tools
webhook		webhook
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AIOps Agent — 5G NF Intelligent Operations

Background

Operations Efficiency

Workflow Comparison

Performance Benchmark

Architecture

Observability

Fault Scenarios

Tech Stack Decisions

Prompt Engineering

Repository Structure

Quick Start

Prerequisites

1. Configure secrets

2. Deploy to K8s

3. Verify

4. Trigger a fault (demo)

Cloud Migration Path

Key Engineering Trade-offs

Deployment — `aiops` Namespace

Sample Agent Runs

Scenario 1 — PLMN Mismatch (auto-fix in ~2 min)

Scenario 2 — Silent-Drop Field Typo (auto-fix in ~5 sec)

Scenario 3 — Unknown Field (safety gate blocks → Slack escalation)

Audit Log (`incidents.jsonl`)

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AIOps Agent — 5G NF Intelligent Operations

Background

Operations Efficiency

Workflow Comparison

Performance Benchmark

Architecture

Observability

Fault Scenarios

Tech Stack Decisions

Prompt Engineering

Repository Structure

Quick Start

Prerequisites

1. Configure secrets

2. Deploy to K8s

3. Verify

4. Trigger a fault (demo)

Cloud Migration Path

Key Engineering Trade-offs

Deployment — aiops Namespace

Sample Agent Runs

Scenario 1 — PLMN Mismatch (auto-fix in ~2 min)

Scenario 2 — Silent-Drop Field Typo (auto-fix in ~5 sec)

Scenario 3 — Unknown Field (safety gate blocks → Slack escalation)

Audit Log (incidents.jsonl)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Deployment — `aiops` Namespace

Audit Log (`incidents.jsonl`)

Packages