# Day 2 — Exercise 6: Robust Agent Error Handling (Enterprise Level)

## Background & Plan

**Objective:** Design an enterprise-grade agent capable of handling tool failures gracefully through comprehensive error handling and logging.

### Why this topic matters
In production environments, services and tools can be unreliable due to network issues, rate limits, or bugs. Without robust error handling, a single failure can cascade and degrade user experience. Structured logging and monitoring provide visibility into system health and help identify recurring problems.

### What we'll build (final objective)
We will enhance a simple, unreliable agent by adding retry logic, timeouts, and a fallback mechanism. We'll capture structured logs and analyze them to compute metrics like success rate and average retries. This mirrors enterprise practices where agents must be resilient and observable.

### How steps unfold (basic → intermediate → advanced)
- **Stage A — Basic Setup:** Configure structured logging and create a baseline agent that simulates failures. Test its behaviour.
- **Stage B — Intermediate Enhancements:** Implement retry and timeout decorators and add a fallback mechanism. Compose these to build a robust agent. Test it thoroughly.
- **Stage C — Advanced Analysis:** Collect logs from multiple runs, parse them into a structured format (a DataFrame), and compute metrics to understand performance and failure patterns.

### Requirements
- Python ≥ 3.9
- Standard libraries: `logging`, `time`, `random`, `functools`, `signal`
- Optional: `pandas` for tabular analysis (install with `pip install pandas` if not available)


## Plan Outline
- **Stage A — Setup Logging and Baseline Agent**
  - Step 1: Configure structured logging
  - Step 2: Define a baseline agent that fails randomly
  - Step 3: Perform a basic test of the baseline agent

- **Stage B — Add Retries, Timeouts, and Fallbacks**
  - Step 4: Implement a retry decorator with exponential backoff
  - Step 5: Implement a timeout decorator
  - Step 6: Implement a fallback function for unrecoverable failures
  - Step 7: Compose a robust agent using retry, timeout, and fallback
  - Step 8: Test the robust agent

- **Stage C — Collect Logs and Analyze Metrics**
  - Step 9: Run multiple trials and collect detailed logs
  - Step 10: Load logs into a DataFrame and compute metrics (success rate, retries)
  - Step 11: Explore the log data (preview table; optional visualization)


### Stage A — Setup Logging and Baseline Agent

We begin by configuring structured logging and defining a baseline agent that occasionally fails. This stage establishes a baseline for later improvements.

#### Step 1: Configure structured logging

In [1]:
import logging

# Set up logging with timestamp, level, and message
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s [%(levelname)s] %(message)s'
)

logging.info('Logging configured.')


2025-09-20 02:21:29,387 [INFO] Logging configured.


We configure the logging module to include timestamps and log levels. This configuration will be used throughout the notebook to capture structured events.

#### Step 2: Define a baseline agent function

In [2]:
import random

def baseline_agent(x: int) -> int:
    '''Baseline agent that doubles the input but fails randomly.'''
    # Simulate a 50% chance of failure
    if random.random() < 0.5:
        raise RuntimeError('Simulated tool failure')
    return x * 2

print('Baseline agent defined.')


Baseline agent defined.


The `baseline_agent` multiplies its input by 2 but raises a `RuntimeError` half of the time to simulate a flaky external tool.

#### Step 3: Perform a basic test of the baseline agent

In [3]:
# Execute baseline agent on several inputs to observe its behaviour
for i in range(5):
    try:
        result = baseline_agent(i)
        print(f'Call {i}: SUCCESS → {result}')
    except Exception as e:
        print(f'Call {i}: FAILURE → {e}')


Call 0: SUCCESS → 0
Call 1: FAILURE → Simulated tool failure
Call 2: SUCCESS → 4
Call 3: SUCCESS → 6
Call 4: SUCCESS → 8


You'll notice approximately half of the calls fail. In the next stage we'll implement strategies to handle these failures.

### Stage B — Add Retries, Timeouts, and Fallbacks

We now enhance our agent by adding retry logic with exponential backoff, a timeout to prevent hanging calls, and a fallback function to provide a default response when all retries fail.

#### Step 4: Implement a retry decorator

In [4]:
import time
from functools import wraps

def retry(max_retries: int = 3, backoff_base: float = 2.0):
    '''Return a decorator that retries the wrapped function upon exception.'''
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(1, max_retries + 1):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    logging.warning(f'Attempt {attempt} failed: {e}')
                    if attempt == max_retries:
                        logging.error('Max retries reached; re-raising exception.')
                        raise
                    # Exponential backoff to avoid rapid retries
                    wait = backoff_base ** attempt
                    logging.info(f'Waiting {wait:.1f}s before retrying...')
                    time.sleep(wait)
        return wrapper
    return decorator


The `retry` decorator attempts to call the function up to `max_retries` times. After each failure, it waits for an exponentially increasing delay before retrying.

#### Step 5: Implement a timeout decorator

In [5]:
import signal

class TimeoutError(Exception):
    '''Raised when a function call exceeds the allotted time.''' 
    pass


def timeout(seconds: int):
    '''Return a decorator that aborts the function if it runs longer than `seconds`.'''
    def decorator(func):
        def _handle_timeout(signum, frame):
            raise TimeoutError('Function call timed out')

        @wraps(func)
        def wrapper(*args, **kwargs):
            # Register alarm signal handler
            signal.signal(signal.SIGALRM, _handle_timeout)
            signal.alarm(seconds)
            try:
                return func(*args, **kwargs)
            finally:
                # Cancel the alarm
                signal.alarm(0)
        return wrapper
    return decorator


The `timeout` decorator uses Unix signals to interrupt a long-running call. It raises a `TimeoutError` if the wrapped function doesn't return within the specified number of seconds.

#### Step 6: Implement a fallback function

In [6]:
def fallback_agent(x: int) -> int:
    '''Fallback used when baseline agent repeatedly fails. Returns a default value.'''
    logging.info('Fallback agent invoked.')
    return -1  # Return a sentinel value indicating failure


The fallback agent provides a safe default value when the primary agent cannot produce a result after retries. In a real system, this might call a simpler heuristic or return a cached response.

#### Step 7: Compose a robust agent using retry, timeout, and fallback

In [7]:
# First wrap the baseline agent with timeout
@timeout(seconds=1)
def timed_baseline(x: int) -> int:
    return baseline_agent(x)

# Then wrap with retry; if it still fails, use the fallback
@retry(max_retries=3, backoff_base=1.5)
def robust_agent(x: int) -> int:
    try:
        return timed_baseline(x)
    except (RuntimeError, TimeoutError):
        return fallback_agent(x)

print('Robust agent composed.')


Robust agent composed.


We first apply a timeout to the baseline agent, then wrap it with the retry decorator. If all retries fail or a timeout occurs, we fall back to the safe fallback agent.

#### Step 8: Test the robust agent

In [8]:
for i in range(5):
    result = robust_agent(i)
    status = 'SUCCESS' if result != -1 else 'FALLBACK'
    print(f'Robust call {i}: {status} → {result}')


2025-09-20 02:21:29,492 [INFO] Fallback agent invoked.
2025-09-20 02:21:29,493 [INFO] Fallback agent invoked.
2025-09-20 02:21:29,498 [INFO] Fallback agent invoked.


Robust call 0: SUCCESS → 0
Robust call 1: FALLBACK → -1
Robust call 2: FALLBACK → -1
Robust call 3: FALLBACK → -1
Robust call 4: SUCCESS → 8


Observe that the robust agent returns a default value (-1) when all retries fail, rather than raising an exception. This makes it safer to use in production pipelines.

### Stage C — Collect Logs and Analyze Metrics

Finally, we'll run the robust agent across multiple inputs, capture detailed logs including retry counts and elapsed times, and analyze the results. This provides insight into system performance and reliability.

#### Step 9: Run multiple trials and collect detailed logs

In [9]:
import pandas as pd
import time

# Function to call robust agent and record metrics
def call_with_logging(x: int):
    start = time.time()
    tries = 0
    result = None
    error = None
    while True:
        tries += 1
        try:
            result = timed_baseline(x)
            break
        except Exception as e:
            if tries >= 3:
                result = fallback_agent(x)
                error = type(e).__name__
                break
            time.sleep(1.5 ** tries)
    elapsed = time.time() - start
    status = 'success' if result != -1 else 'fallback'
    return {'input': x, 'status': status, 'result': result, 'error': error, 'retries': tries - 1, 'elapsed_s': elapsed}

# Run the robust agent for inputs 0–9
log_records = [call_with_logging(i) for i in range(10)]
log_df = pd.DataFrame(log_records)
log_df


2025-09-20 02:21:30,553 [INFO] NumExpr defaulting to 8 threads.
2025-09-20 02:21:34,597 [INFO] Fallback agent invoked.


Unnamed: 0,input,status,result,error,retries,elapsed_s
0,0,success,0,,0,1.3e-05
1,1,success,2,,0,2e-06
2,2,success,4,,0,2e-06
3,3,fallback,-1,RuntimeError,2,3.759263
4,4,success,8,,0,4.8e-05
5,5,success,10,,1,1.505483
6,6,success,12,,0,1.4e-05
7,7,success,14,,0,1.1e-05
8,8,success,16,,0,9e-06
9,9,success,18,,0,1.1e-05


The `call_with_logging` function wraps the robust agent and records the number of retries, total elapsed time, and whether a fallback occurred. We run this on inputs 0–9 and store the results in a DataFrame.

#### Step 10: Compute metrics from the logs

In [11]:
# Compute success rate and average number of retries
success_rate = len(log_df[log_df['status'] == 'success']) / len(log_df)
average_retries = log_df['retries'].mean()
average_elapsed = log_df['elapsed_s'].mean()

print(f'Success rate: {success_rate:.2f}')
print(f'Average retries: {average_retries:.2f}')
print(f'Average elapsed time (s): {average_elapsed:.2f}')

# Breakdown by error type
error_counts = log_df['error'].value_counts(dropna=False)
print('Error types and counts:')
print(error_counts)


Success rate: 0.90
Average retries: 0.30
Average elapsed time (s): 0.53
Error types and counts:
error
None            9
RuntimeError    1
Name: count, dtype: int64


We calculate the overall success rate, average number of retries, and average elapsed time. We also list the types of errors encountered during the runs. These metrics help identify where improvements are needed.

#### Step 11: Explore and visualize the log data

In [12]:
# Display the first few rows of the log DataFrame
log_df.head()


Unnamed: 0,input,status,result,error,retries,elapsed_s
0,0,success,0,,0,1.3e-05
1,1,success,2,,0,2e-06
2,2,success,4,,0,2e-06
3,3,fallback,-1,RuntimeError,2,3.759263
4,4,success,8,,0,4.8e-05


The DataFrame preview shows a detailed record of each run. You could further visualize the distribution of retries or elapsed times using plots (e.g., histograms).

## Wrap-Up

In this enterprise-level exercise, we went beyond basic error handling:

- **Structured Logging:** We configured a logger to include timestamps and severity levels.
- **Retry & Backoff:** Implemented a reusable `retry` decorator with exponential backoff.
- **Timeouts:** Added a timeout decorator to prevent hanging calls.
- **Fallback Mechanism:** Provided a safe fallback when all retries fail.
- **Robust Composition:** Combined these elements to create a `robust_agent`.
- **Detailed Monitoring:** Collected detailed logs across runs, including retries and elapsed time.
- **Metrics Analysis:** Computed success rates and other metrics to assess reliability.

### Learning Outcomes
- You can now design agents that handle failures gracefully and provide structured logs for monitoring.
- You learned how to use decorators to add cross-cutting concerns (retries and timeouts).
- You gained experience analyzing logs to derive operational insights.

### Next Steps
- Integrate these patterns with real tool APIs and external services.
- Export logs to centralized monitoring systems (e.g., Elasticsearch, Prometheus).
- Implement alerting based on failure rates or latency thresholds.


### Quick Install

In [13]:
# Install pandas if needed for DataFrame analysis
# Uncomment the following line if pandas is not installed
# !pip install pandas
print('For log analysis, pandas is required. If missing, install it via pip.')


For log analysis, pandas is required. If missing, install it via pip.
