# AI Agent for IT Infrastructure: Automated Anomaly Response

**Copyright (c) 2026 Shrikara Kaudambady**

This notebook demonstrates a simulated AI Agent for IT Operations (AIOps). The agent follows an **Observe-Orient-Decide-Act (OODA)** loop to autonomously handle an infrastructure anomaly.

**Scenario:** The agent monitors server CPU utilization. When it detects an anomalous spike, it automatically performs a root cause analysis by running a series of (simulated) diagnostic checks and then recommends a solution.

### Part 1: The Event (Observe)

The process begins when the monitoring system detects an event that violates normal operating thresholds. Here, we'll generate a time-series dataset of CPU usage and inject a sudden, sustained spike to simulate an anomaly. This spike is the **trigger** for our AI agent.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time

# Generate a baseline of normal CPU usage
timestamps = pd.date_range(start='2026-01-01', periods=120, freq='T') # 2 hours of data
baseline_usage = np.random.normal(loc=30, scale=5, size=120)
baseline_usage = np.clip(baseline_usage, 20, 45)

# Inject an anomaly
anomaly_start_index = 80
anomaly = np.linspace(95, 100, 40)
cpu_usage = np.copy(baseline_usage)
cpu_usage[anomaly_start_index:] = anomaly + np.random.normal(0, 1, 40)
cpu_usage = np.clip(cpu_usage, 0, 100)

df = pd.DataFrame({'timestamp': timestamps, 'cpu_percent': cpu_usage})
anomaly_timestamp = df['timestamp'][anomaly_start_index]

# Visualize the event
plt.figure(figsize=(15, 6))
plt.plot(df['timestamp'], df['cpu_percent'], label='CPU Utilization (%)')
plt.axvline(anomaly_timestamp, color='r', linestyle='--', label=f'Anomaly Detected at {anomaly_timestamp.time()}')
plt.title('Server CPU Utilization Monitoring')
plt.xlabel('Time')
plt.ylabel('CPU Usage (%)')
plt.legend()
plt.grid(True)
plt.show()

### Part 2: Automated Diagnostics (Orient)

The agent has observed the anomaly. Now, it must **orient** itself by gathering more data to understand the context. We'll define several functions that simulate running diagnostic commands on the affected server. In a real system, these would execute shell commands or API calls.

In [None]:
def check_top_processes(timestamp):
    """Simulates running `ps aux` or `top` to find high-CPU processes."""
    print("  - [Diagnostic] Checking top processes...")
    data = [
        {'pid': 101, 'user': 'root', 'cpu_percent': 95.2, 'memory_percent': 75.1, 'command': 'java -jar /opt/app/server.jar'},
        {'pid': 203, 'user': 'mysql', 'cpu_percent': 2.1, 'memory_percent': 15.3, 'command': 'mysqld'},
        {'pid': 1, 'user': 'root', 'cpu_percent': 0.1, 'memory_percent': 0.5, 'command': 'systemd'}
    ]
    return pd.DataFrame(data)

def check_app_logs(timestamp):
    """Simulates checking the application's log file for recent errors."""
    print("  - [Diagnostic] Grepping application logs for errors...")
    log_time = timestamp - pd.Timedelta(minutes=2)
    return f'''
    {log_time - pd.Timedelta(minutes=1)} [INFO] Request received: /api/data
    {log_time} [ERROR] OutOfMemoryError: Java heap space. Dumping heap to /tmp/heap.bin
    {log_time + pd.Timedelta(seconds=30)} [WARN] High garbage collection overhead detected.
    '''

def check_recent_deployments(timestamp):
    """Simulates checking a deployment log to rule out a recent code change."""
    print("  - [Diagnostic] Checking deployment history...")
    deploy_time = timestamp - pd.Timedelta(days=3)
    return f"Last deployment was on {deploy_time}. No recent changes found."

### Part 3: The Decision Engine (Decide)

With the diagnostic data collected, the agent's "brain" must **decide** on the most probable cause. We'll create a simple rule-based function that analyzes the evidence and forms a conclusion.

In [None]:
def run_anomaly_diagnostics(timestamp):
    """The main agent logic that runs diagnostics and makes a decision."""
    print("\n--- AI Agent Anomaly Report ---")
    print(f"Trigger: CPU usage anomaly detected at {timestamp.time()}\n")
    
    print("[Step 1] Running diagnostics...")
    top_processes = check_top_processes(timestamp)
    app_logs = check_app_logs(timestamp)
    deployment_info = check_recent_deployments(timestamp)
    print("[Step 1] Diagnostics complete.\n")
    time.sleep(1) # For dramatic effect

    print("[Step 2] Analyzing results...")
    # Find the process with the highest CPU
    high_cpu_process = top_processes.loc[top_processes['cpu_percent'].idxmax()]
    
    # Define rules for decision making
    suspected_cause = "Undetermined"
    recommended_action = "Escalate to human operator for manual investigation."
    
    is_memory_error = 'OutOfMemoryError' in app_logs or 'Java heap space' in app_logs
    is_app_server = 'java' in high_cpu_process['command']
    
    if high_cpu_process['cpu_percent'] > 90 and is_app_server and is_memory_error:
        suspected_cause = f"The '{high_cpu_process['command'].split()[0]}' application is unresponsive due to a Java heap space error."
        recommended_action = f"Restart the '{high_cpu_process['command'].split()[0]}' process (PID: {high_cpu_process['pid']}) to clear its memory state."
    
    print("[Step 2] Analysis complete.\n")
    time.sleep(1)
    
    # Compile the final report
    report = {
        'findings': [
            f"Process '{high_cpu_process['command'].split()[0]}' (PID: {high_cpu_process['pid']}) is using {high_cpu_process['cpu_percent']}% CPU.",
            f"Found critical error in application logs: '{app_logs.splitlines()[2].strip()}'.",
            deployment_info
        ],
        'cause': suspected_cause,
        'action': recommended_action
    }
    
    return report

### Part 4: The Response (Act)

Finally, the agent must **act**. In this simulation, the action is to present its findings in a clear, human-readable report. In a real-world scenario, this could mean automatically creating a high-priority ticket, posting a message to a Slack channel, or even attempting a safe, automated remediation step.

In [None]:
# Trigger the agent's workflow
final_report = run_anomaly_diagnostics(anomaly_timestamp)

print("[Step 3] Generating final report...\n")
print("========================================")
print("              FINDINGS")
print("----------------------------------------")
for finding in final_report['findings']:
    print(f"- {finding}")
print("
----------------------------------------")
print("           CONCLUSION")
print("----------------------------------------")
print(f"Probable Cause: {final_report['cause']}")
print(f"
Recommended Action: {final_report['action']}")
print("========================================")

### Conclusion

This notebook has demonstrated a simplified but powerful concept in AIOps: an autonomous agent for incident response. By following the **Observe-Orient-Decide-Act** loop, the agent was able to:

1.  **Observe:** Detect a CPU spike.
2.  **Orient:** Correlate the spike with high CPU usage from a specific process and a critical memory error in the logs.
3.  **Decide:** Conclude that the application's memory leak was the root cause.
4.  **Act:** Recommend a specific, actionable remediation step (restarting the process).

Implementing such agents can dramatically reduce the Mean Time to Resolution (MTTR) for common incidents, freeing up human engineers to focus on more complex problems.