<a href="https://colab.research.google.com/github/sbht04/ai-agents/blob/main/autonomous_it_support_agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment: The Autonomous IT Support Agent**

This is a coding assignment for the **Level 1 IT Incident Responder**.

This assignment assumes the learners have the `openai` library installed and an API key ready and stored in the variable name `OPENAI_APIKEY` as a notebook secret variable and enabled

---

**Objective:** You are building an AI agent that acts as the "first responder" for server incidents. It must:

1. **Investigate:** Check server health and logs when a user reports an issue.
2. **Act:** If CPU is critical (>90%), it should **Restart** the service.
3. **Escalate:** If the issue is complex or logs show a major problem it should **Escalate** to a human.

**Your Task:** complete the code below by filling in the sections marked `### TODO`.

In [None]:
import os
import json
from openai import OpenAI
from google.colab import userdata

In [None]:
OPENAI_APIKEY = userdata.get('OPENAI_APIKEY')
client = OpenAI(api_key=OPENAI_APIKEY)

# Optional: Verify it loaded (printing secrets is usually discouraged in production)
print("API Key loaded successfully.")

API Key loaded successfully.


# Initialize Client

==========================================
## PART 1: DEFINE THE TOOLS (BUSINESS LOGIC)
==========================================

In [None]:
# --- Already implement tool 1: Check Health ---
def get_server_health(server_id: str) -> str:
    """Returns CPU and Memory usage for a given server."""
    print(f"-> TOOL: Checking health for {server_id}...")

    metrics = {
        # Scenario 1: High CPU (Needs Restart)
        "payment-server-01": {"cpu": "98%", "memory": "40%", "status": "Warning"},

        # Scenario 2: Healthy (No Action Needed)
        "db-node-02": {"cpu": "12%", "memory": "60%", "status": "Healthy"},

        # Scenario 3: High Memory Leak (Needs Restart or Escalation)
        "auth-service-03": {"cpu": "45%", "memory": "95%", "status": "Critical"},

        # Scenario 4: Network/Dependency Failure (Needs Escalation)
        "search-index-09": {"cpu": "10%", "memory": "15%", "status": "Error"},

        # Scenario 5: Completely Normal
        "frontend-node-04": {"cpu": "25%", "memory": "30%", "status": "Healthy"},
    }

    result = metrics.get(server_id, {"error": "Server not found. Check the ID."})
    return json.dumps(result)


In [None]:
def fetch_recent_logs(server_id: str, lines: int = 5) -> str:
    """Returns the last N lines of logs."""
    print(f"-> TOOL: Fetching last {lines} log lines for {server_id}...")

    # Different logs for different servers to trigger different agent behaviors
    log_database = {
        "payment-server-01": [
            "[INFO] Request received /pay/v1",
            "[WARN] CPU threshold exceeded 90%",
            "[WARN] Thread pool exhaustion",
            "[CRITICAL] Process hung, not accepting new connections",
            "[ERROR] Timeout waiting for thread"
        ],
        "db-node-02": [
            "[INFO] Backup started",
            "[INFO] Backup completed successfully",
            "[INFO] User query executed in 12ms",
            "[INFO] Health check: OK",
            "[INFO] Replication sync active"
        ],
        "auth-service-03": [
            "[INFO] Token validated user_882",
            "[WARN] Garbage collection taking too long (>5s)",
            "[ERROR] java.lang.OutOfMemoryError: Java heap space",
            "[CRITICAL] Application crashing due to memory leak",
            "[INFO] Restarting context..."
        ],
        "search-index-09": [
            "[INFO] Indexing started",
            "[ERROR] Connection refused: elastic-cluster-main:9200",
            "[ERROR] Failed to write document ID 4432",
            "[CRITICAL] Dependency Unreachable: Search Engine is down",
            "[ERROR] Retrying in 30s..."
        ],
        "frontend-node-04": [
            "[INFO] GET /home 200 OK",
            "[INFO] GET /assets/logo.png 200 OK",
            "[INFO] GET /login 200 OK",
            "[INFO] GET /api/v1/status 200 OK",
            "[INFO] Health check passed"
        ]
    }

    # Default logs if server not found in specific list
    default_logs = ["[INFO] System stable", "[INFO] Heartbeat signal received"]

    logs = log_database.get(server_id, default_logs)
    return json.dumps({"logs": logs[:lines]})

---
### ----- Implement code below -----
---


In [None]:
# --- TASK 1: Implement the Restart Tool ---
def restart_service(server_id: str) -> str:

    print("-> TOOL: Restarting service...")

    # --- Error Case 1: Missing or empty server_id ---
    if not server_id or not isinstance(server_id, str):
        return json.dumps({
            "status": "error",
            "message": "Invalid server_id provided"
        })

    # --- Error Case 2: Simulated failure scenario ---

    if server_id.lower() == "unreachable":
        return json.dumps({
            "status": "error",
            "message": f"Server '{server_id}' is unreachable. Restart failed."
        })

    # --- Success Case ---
    return json.dumps({
        "status": "success",
        "message": f"Server '{server_id}' restarted successfully"
    })




In [None]:
# --- TASK 2: Implement the Escalation Tool ---
def escalate_to_engineer(summary: str) -> str:


    print("-> TOOL: Escalating to human...")

    # --- Error Case 1: Missing or invalid summary ---
    if not summary or not isinstance(summary, str):
        return json.dumps({
            "status": "error",
            "message": "Invalid summary provided"
        })

    # --- Success Case ---
    return json.dumps({
        "status": "success",
        "message": "Escalation ticket created successfully",
        "summary": summary
    })



In [None]:
# Map functions for the agent execution loop
AVAILABLE_FUNCTIONS = {
    "get_server_health": get_server_health,
    "fetch_recent_logs": fetch_recent_logs,
    "restart_service": restart_service,
    "escalate_to_engineer": escalate_to_engineer,
}

==========================================
## PART 2: DEFINE THE AGENT SCHEMA
==========================================


In [None]:
tools_schema = [
    {
        "type": "function",
        "function": {
            "name": "get_server_health",
            "description": "Checks the current CPU and memory usage of a specific server.",
            "parameters": {
                "type": "object",
                "properties": {
                    "server_id": {"type": "string", "description": "The ID of the server, e.g., 'payment-server-01'"}
                },
                "required": ["server_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "fetch_recent_logs",
            "description": "Retrieves the most recent log entries from a server to diagnose errors.",
            "parameters": {
                "type": "object",
                "properties": {
                    "server_id": {"type": "string", "description": "The ID of the server."},
                    "lines": {"type": "integer", "description": "Number of log lines to fetch."}
                },
                "required": ["server_id"]
            }
        }
    },
    # --- >>>> TASK 3: Define Schema for restart_service ---
    {
        "type": "function",
        "function": {
            "name": "restart_service",
            "description": "Restarts the service for a particular Server",
            "parameters": {
                "type": "object",
                "properties": {
                    "server_id": {"type": "string", "description": "The ID of the server, e.g., 'payment-server-01'"}
                },
                "required": ["server_id"]
            }
        }
    },
    # --- >>>> TASK 4: Define Schema for escalate_to_engineer ---
    {
        "type": "function",
        "function": {
            "name": "escalate_to_engineer",
            "description": "Escalates the issue to a human engineer when automated fixes fail or the error is unknown.",
            "parameters": {
                "type": "object",
                "properties": {
                   "summary": {"type": "string", "description": "summary of the error'"}
                },
                "required": ["summary"]
            }
        }
    }
]


 ==========================================
## PART 3: THE AGENT EXECUTION LOOP
 ==========================================


In [None]:
def run_it_agent(user_issue: str):
    print(f"\n--- New Incident: {user_issue} ---")

    messages = [
        {"role": "system", "content": "You are a Level 1 IT Responder. Investigate server issues. "
                                      "If CPU or Memory is > 90%, restart the service. If logs show critical dependency errors (like connection refused) that a restart won't fix, escalate to an engineer."},
        {"role": "user", "content": user_issue}
    ]

    while True:
        print("\n[AI Thinking...]")
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools_schema,
            tool_choice="auto"
        )

        response_msg = response.choices[0].message
        messages.append(response_msg)

        if response_msg.tool_calls:
            for tool_call in response_msg.tool_calls:
                func_name = tool_call.function.name
                func_args = json.loads(tool_call.function.arguments)

                # Retrieve the actual python function based on name
                function_to_call = AVAILABLE_FUNCTIONS.get(func_name)

                if function_to_call:
                    # Execute the function
                    tool_output = function_to_call(**func_args)

                    # Append tool result back to the conversation
                    messages.append({
                        "role": "tool",
                        "tool_call_id": tool_call.id,
                        "name": func_name,
                        "content": tool_output
                    })

        else:
            print(f"\n[FINAL RESPONSE]: {response_msg.content}")
            break

 ==========================================
## PART 4: TEST SCENARIOS
 ==========================================


In [None]:
# Scenario A: Should trigger a restart (CPU is 98%)
run_it_agent("The payment-server-01 is extremely slow and timing out.")


--- New Incident: The payment-server-01 is extremely slow and timing out. ---

[AI Thinking...]
-> TOOL: Checking health for payment-server-01...
-> TOOL: Fetching last 10 log lines for payment-server-01...

[AI Thinking...]
-> TOOL: Restarting service...

[AI Thinking...]

[FINAL RESPONSE]: The service on `payment-server-01` was consuming 98% CPU, which likely caused the slowness and timeouts. I have restarted the service, and it was successful. The server should be operating normally again. Please let me know if there are any further issues!


In [None]:
# Scenario B: Should trigger an escalation (DB is healthy but logs might be weird)
run_it_agent("Something is wrong with db-node-02")


--- New Incident: Something is wrong with db-node-02 ---

[AI Thinking...]
-> TOOL: Checking health for db-node-02...
-> TOOL: Fetching last 50 log lines for db-node-02...

[AI Thinking...]

[FINAL RESPONSE]: The server "db-node-02" appears to be functioning normally based on the health check with CPU usage at 12% and memory usage at 60%. The status is marked as "Healthy".

The recent logs show normal operations such as backup processes and health checks, with no critical errors reported. If there are specific issues you are experiencing with "db-node-02," please provide more details so I can assist you further.


In [None]:
# Custom Python Script to fetch the list of models from OpenAI
try:
    # Fetch the list of models from OpenAI
    models = client.models.list()

    # Extract the IDs and sort them alphabetically
    model_ids = sorted([model.id for m in models for model in [m]])

    print(f"Found {len(model_ids)} available models:\n")
    for model_id in model_ids:
        print(f"- {model_id}")
except Exception as e:
    print(f"Error fetching models: {e}")

Found 1 available models:

- gpt-4o


In [None]:
# Scenario C: The High Memory Case (auth-service-03)
# Agent should see Memory 95% + OutOfMemoryError logs -> Restart
run_it_agent("Users are reporting login failures on auth-service-03.")

print("\n" + "="*50 + "\n")


--- New Incident: Users are reporting login failures on auth-service-03. ---

[AI Thinking...]
-> TOOL: Checking health for auth-service-03...
-> TOOL: Fetching last 10 log lines for auth-service-03...

[AI Thinking...]
-> TOOL: Restarting service...

[AI Thinking...]

[FINAL RESPONSE]: The auth-service-03 server experienced a critical memory issue, specifically a memory leak that caused the application to crash. I've restarted the service, and the server has been successfully restarted. Please ask users to try logging in again. If the problem persists, let me know so we can escalate it to an engineer for further investigation.




In [None]:
# Scenario D: The Dependency Failure (search-index-09)
# Agent should see healthy CPU but "Connection Refused" logs -> Escalate
run_it_agent("Search isn't working. Can you check search-index-09?")

print("\n" + "="*50 + "\n")


--- New Incident: Search isn't working. Can you check search-index-09? ---

[AI Thinking...]
-> TOOL: Checking health for search-index-09...
-> TOOL: Fetching last 10 log lines for search-index-09...

[AI Thinking...]
-> TOOL: Escalating to human...

[AI Thinking...]

[FINAL RESPONSE]: I've reviewed the issue with `search-index-09`. The server's CPU and memory usage are normal, but the logs indicate a critical dependency error: "Connection refused" to the main search engine cluster at `elastic-cluster-main:9200`. This issue has been escalated to an engineer for further investigation, and an escalation ticket has been successfully created.




In [None]:
# Scenario E: The Healthy Server (frontend-node-04)
# Agent should see normal stats and 200 OK logs -> Do nothing / Report healthy
run_it_agent("Check frontend-node-04 just to be safe.")


--- New Incident: Check frontend-node-04 just to be safe. ---

[AI Thinking...]
-> TOOL: Checking health for frontend-node-04...

[AI Thinking...]

[FINAL RESPONSE]: The server "frontend-node-04" is currently healthy with CPU usage at 25% and memory usage at 30%. There are no immediate concerns to address.
