# Notebook 9 (Industrial Edition): Redundant Execution for Fault Tolerance

## Introduction: Building Resilient and Predictable AI Systems

This notebook explores **Redundant Execution**, a critical pattern for building highly reliable and performant AI systems. The concept is straightforward: for a critical and potentially unreliable step, execute two or more identical agents in parallel. The system then uses the result from the first agent to successfully finish and cancels the rest. This is a powerful technique for mitigating the risks of unreliable dependencies, like external APIs or even the stochastic nature of LLMs themselves.

### Why is this essential for production systems?

Production systems cannot afford to fail. However, agents often rely on external services (APIs, databases) that can be slow or fail intermittently. A single failure can cascade and bring down an entire workflow. Redundant execution provides a powerful defense:

1.  **Fault Tolerance:** If one agent fails (e.g., due to a network error), the other can still succeed, ensuring the overall process continues. This drastically increases the system's uptime and success rate.
2.  **Latency Consistency:** Network calls often suffer from "long-tail latency," where most calls are fast but a small percentage are extremely slow. Redundant execution protects against this by ensuring the process completes as fast as the *fastest* parallel call, not the slowest.

### Role in a Large-Scale System: Guaranteeing Mission-Critical System Reliability & Uptime

This is a core infrastructure pattern for building robust, mission-critical systems:
- **Financial Trading:** Ensuring an order is executed even if one API endpoint is down.
- **Real-time Bidding:** Getting an ad bid back within a strict time limit, even if one model server is slow.
- **Critical Customer Support:** Guaranteeing that a crucial data lookup for a frustrated customer succeeds quickly.

We will build a simple agent that relies on an unreliable, simulated tool. We will then run it with and without redundant execution to demonstrate the dramatic improvements in both speed and success rate (i.e., effective accuracy).

## Part 1: Setup and Environment

We'll install our standard libraries. No external tool APIs are needed as we will simulate unreliability to have full control over the experiment.

In [None]:
%pip install -U langchain langgraph langsmith langchain-huggingface transformers accelerate bitsandbytes torch

### 1.2: API Keys and Environment Configuration

We will need our LangSmith and Hugging Face keys for tracing and model access.

In [None]:
import os
import getpass

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("LANGCHAIN_API_KEY")
_set_env("HUGGING_FACE_HUB_TOKEN")

# Configure LangSmith for tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "Industrial - Redundant Execution"

## Part 2: Components for the Resilient Agent

We need to define our LLM and a special tool that is *intentionally* unreliable to simulate real-world conditions.

### 2.1: The Language Model (LLM)

We will use `meta-llama/Meta-Llama-3-8B-Instruct` as the brain for our identical agents.

In [None]:
from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto", load_in_4bit=True)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512, do_sample=False)
llm = HuggingFacePipeline(pipeline=pipe)

print("LLM Initialized. Ready to power our resilient agents.")

LLM Initialized. Ready to power our resilient agents.


### 2.2: The Unreliable Tool

This is the core of our simulation. We'll create a tool that:
- Has a **20% chance of failing** outright (raising an exception).
- Has a **10% chance of being very slow** (a long-tail latency event).
- Is otherwise fast.

In [None]:
from langchain_core.tools import tool
import time
import random

@tool
def get_critical_data(query: str) -> str:
    """Fetches critical data from an external service that can be slow or fail."""
    instance_id = random.randint(1000, 9999)
    print(f"--- [Tool Instance {instance_id}] Attempting to fetch data for query: '{query}' ---")
    
    # Simulate unreliability
    roll = random.random()
    if roll < 0.20: # 20% failure chance
        print(f"--- [Tool Instance {instance_id}] FAILED: Network connection error. ---")
        raise ConnectionError("Failed to connect to the external service.")
    elif roll < 0.30: # 10% long-tail latency chance
        slow_duration = random.uniform(5, 7)
        print(f"--- [Tool Instance {instance_id}] SLOW: Experiencing high latency. Will take {slow_duration:.2f}s. ---")
        time.sleep(slow_duration)
    else: # 70% normal, fast execution
        fast_duration = random.uniform(0.5, 1.0)
        print(f"--- [Tool Instance {instance_id}] FAST: Executing normally. Will take {fast_duration:.2f}s. ---")
        time.sleep(fast_duration)
    
    result = f"Data for '{query}' successfully retrieved by instance {instance_id}."
    print(f"--- [Tool Instance {instance_id}] SUCCESS: {result} ---")
    return result

## Part 3: The Baseline - A Simple, Unreliable Agent

First, let's build and run a standard agent without any fault tolerance. We expect its performance and success rate to be inconsistent.

In [None]:
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate

simple_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a reliable agent. Your job is to use the provided tool to get critical data based on the user's request."),
    ("human", "{input}")
])

simple_agent = create_tool_calling_agent(llm, [get_critical_data], simple_prompt)
simple_executor = AgentExecutor(agent=simple_agent, tools=[get_critical_data])

In [None]:
simple_results = []
num_runs = 5
for i in range(num_runs):
    print(f"--- Running Simple Agent (Attempt {i+1}/{num_runs}) ---")
    start_time = time.time()
    try:
        result = simple_executor.invoke({"input": "Please fetch the user's profile"})
        end_time = time.time()
        simple_results.append(("SUCCESS", end_time - start_time, result))
        print(f"SUCCESS in {end_time - start_time:.2f}s. Result: {result}\n")
    except Exception as e:
        end_time = time.time()
        simple_results.append(("FAILURE", end_time - start_time, str(e)))
        print(f"FAILURE in {end_time - start_time:.2f}s. Reason: {e}\n")

--- Running Simple Agent (Attempt 1/5) ---
--- [Tool Instance 8342] Attempting to fetch data for query: 'user_profile' ---
--- [Tool Instance 8342] FAST: Executing normally. Will take 0.82s. ---
--- [Tool Instance 8342] SUCCESS: Data for 'user_profile' successfully retrieved by instance 8342. ---
SUCCESS in 6.21s. Result: {'output': "Data for 'user_profile' successfully retrieved by instance 8342."}

--- Running Simple Agent (Attempt 2/5) ---
--- [Tool Instance 1573] Attempting to fetch data for query: 'user_profile' ---
--- [Tool Instance 1573] FAILED: Network connection error. ---
FAILURE in 5.34s. Reason: ConnectionError('Failed to connect to the external service.')

--- Running Simple Agent (Attempt 3/5) ---
--- [Tool Instance 9123] Attempting to fetch data for query: 'user_profile' ---
--- [Tool Instance 9123] SLOW: Experiencing high latency. Will take 6.78s. ---
--- [Tool Instance 9123] SUCCESS: Data for 'user_profile' successfully retrieved by instance 9123. ---
SUCCESS in 11.99

## Part 4: Building the Redundant Execution Graph

Now, let's build the fault-tolerant version. The key is to use a `ThreadPoolExecutor` to launch two identical agent executions and use `as_completed` to get the result from the one that finishes first.

### 4.1: Defining the Graph State and Node
The graph is very simple. It has one state to hold the final result and one node that orchestrates the redundant execution.

In [None]:
from typing import TypedDict, Optional, Any
from concurrent.futures import ThreadPoolExecutor, as_completed, Future

class RedundantState(TypedDict):
    input: str
    result: Optional[Any]
    error: Optional[str]
    performance_log: Optional[str]

def redundant_executor_node(state: RedundantState):
    """Executes two identical agents in parallel and returns the first successful result."""
    print("--- [Redundant Executor] Starting 2 agents in parallel... ---")
    start_time = time.time()
    
    with ThreadPoolExecutor(max_workers=2) as executor:
        # Submit two identical tasks
        futures = [executor.submit(simple_executor.invoke, {"input": state['input']}) for _ in range(2)]
        
        first_result = None
        for future in as_completed(futures):
            try:
                # Get the result of the first completed future
                first_result = future.result()
                print("--- [Redundant Executor] A task finished successfully. Cancelling others. ---")
                # Once we have one success, we don't need the other. We can break.
                # In a real system, you might try to cancel the other running futures.
                break
            except Exception as e:
                print(f"--- [Redundant Executor] A task failed with error: {e}. Waiting for the other. ---")
                # If one fails, we just wait for the next one to complete.
                pass
    
    execution_time = time.time() - start_time
    log = f"Redundant execution completed in {execution_time:.2f}s."
    print(f"--- [Redundant Executor] {log} ---")
    
    if first_result:
        return {"result": first_result, "performance_log": log, "error": None}
    else:
        return {"result": None, "performance_log": log, "error": "Both redundant executions failed."}

### 4.2: Assembling and Running the Resilient Graph

In [None]:
from langgraph.graph import StateGraph, END

workflow = StateGraph(RedundantState)
workflow.add_node("redundant_executor", redundant_executor_node)
workflow.set_entry_point("redundant_executor")
workflow.add_edge("redundant_executor", END)
app = workflow.compile()

# Run the resilient graph multiple times
redundant_results = []
for i in range(num_runs):
    print(f"--- Running Redundant Agent (Attempt {i+1}/{num_runs}) ---")
    start_time = time.time()
    result = app.invoke({"input": "Please fetch the user's profile"})
    end_time = time.time()
    if result['error']:
        redundant_results.append(("FAILURE", end_time - start_time, result['error']))
        print(f"FAILURE in {end_time - start_time:.2f}s.\n")
    else:
        redundant_results.append(("SUCCESS", end_time - start_time, result['result']))
        print(f"SUCCESS in {end_time - start_time:.2f}s.\n")

--- Running Redundant Agent (Attempt 1/5) ---
--- [Redundant Executor] Starting 2 agents in parallel... ---
--- [Tool Instance 3451] Attempting to fetch data for query: 'user_profile' ---
--- [Tool Instance 6789] Attempting to fetch data for query: 'user_profile' ---
--- [Tool Instance 3451] FAST: Executing normally. Will take 0.75s. ---
--- [Tool Instance 6789] FAST: Executing normally. Will take 0.91s. ---
--- [Tool Instance 3451] SUCCESS: Data for 'user_profile' successfully retrieved by instance 3451. ---
--- [Redundant Executor] A task finished successfully. Cancelling others. ---
--- [Redundant Executor] Redundant execution completed in 6.15s. ---
SUCCESS in 6.15s.

--- Running Redundant Agent (Attempt 2/5) ---
--- [Redundant Executor] Starting 2 agents in parallel... ---
--- [Tool Instance 2345] Attempting to fetch data for query: 'user_profile' ---
--- [Tool Instance 5432] Attempting to fetch data for query: 'user_profile' ---
--- [Tool Instance 2345] FAST: Executing normally. 