# Notebook 4 (Industrial Edition): Speculative Execution & Pre-fetching

## Introduction: The Art of Anticipation for a Faster AI

This notebook explores a sophisticated parallelism pattern designed to dramatically reduce *perceived latency* in interactive AI systems: **Speculative Execution & Pre-fetching**. The principle is to anticipate the agent's most likely next action—usually a slow, data-gathering tool call—and begin executing it *in parallel* with the agent's primary reasoning process.

### Why is this pattern so impactful?

In many agentic workflows, the sequence is: `User Input -> Agent Thinks (LLM call) -> Agent Acts (Tool call)`. The user waits during both the thinking and acting phases. Speculative execution overlaps these two phases. While the agent is thinking, the system makes an educated guess about the upcoming action and starts it. If the guess is correct, the tool call's latency is effectively hidden behind the LLM's inference time, making the agent feel instantaneous.

### Role in a Large-Scale System: Creating Proactive & Hyper-Responsive User Experiences

This is a key architectural pattern for any high-throughput, user-facing system where responsiveness is a primary feature. It's the difference between an AI that feels reactive and one that feels proactive and intelligent.
- **Customer Support Chatbots:** Pre-fetching user account details and recent orders the moment a chat begins.
- **Data Analysis Tools:** Speculatively running a common default query on a dashboard as soon as it's loaded.
- **Code Assistants:** Pre-fetching relevant documentation for a function as the developer is typing its name.

We will build a customer support agent that speculatively fetches a user's order history, demonstrating how this pattern can eliminate tool-call latency from the user's perspective.

## Part 1: Setup and Environment

We'll install our standard libraries. For this notebook, we don't need any external tool APIs, as we will simulate a slow database lookup to precisely control and measure the latency.

In [None]:
%pip install -U langchain langgraph langsmith langchain-huggingface transformers accelerate bitsandbytes torch

### 1.2: API Keys and Environment Configuration

We will need our LangSmith and Hugging Face keys for tracing and model access.

In [None]:
import os
import getpass

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("LANGCHAIN_API_KEY")
_set_env("HUGGING_FACE_HUB_TOKEN")

# Configure LangSmith for tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "Industrial - Speculative Execution"

## Part 2: Core Components for the Support Agent

Our system will consist of an LLM, a standard tool for looking up order history, and a special pre-fetching mechanism.

### 2.1: The Language Model (LLM)

We will use `meta-llama/Meta-Llama-3-8B-Instruct` as our agent's brain, configured for tool calling.

In [None]:
from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=1024,
    do_sample=False
)

llm = HuggingFacePipeline(pipeline=pipe)

print("LLM Initialized. Ready to power our proactive support agent.")

LLM Initialized. Ready to power our proactive support agent.


### 2.2: The Simulated Slow Tool

To demonstrate the pattern, we need a tool that is realistically slow. We'll create a mock database lookup tool that takes a fixed amount of time to run. This allows us to precisely measure the impact of our speculative execution.

In [None]:
from langchain_core.tools import tool
import time
import json

DATABASE_LATENCY_SECONDS = 3

@tool
def get_order_history(user_id: str) -> str:
    """Fetches the order history for a given user from the database. A slow operation."""
    print(f"--- [DATABASE] Starting query for user_id: {user_id}. This will take {DATABASE_LATENCY_SECONDS} seconds. ---")
    time.sleep(DATABASE_LATENCY_SECONDS)
    
    # Mock data for demonstration
    mock_db = {
        "user123": [
            {"order_id": "A123", "item": "QuantumLeap AI Processor", "status": "Shipped"},
            {"order_id": "B456", "item": "Smart Coffee Mug", "status": "Delivered"}
        ]
    }
    result = mock_db.get(user_id, [])
    print(f"--- [DATABASE] Query finished for user_id: {user_id}. ---")
    return json.dumps(result)

### 2.3: Binding the Tool to the LLM

We make the LLM aware of the tool it can use, as per standard agentic procedure.

In [None]:
tools = [get_order_history]
llm_with_tools = llm.bind_tools(tools)

## Part 3: Building the Speculative Execution Graph

This is where the architecture becomes unique. The graph's entry point will be a special node that kicks off two processes in parallel: the LLM's reasoning and the speculative tool call.

### 3.1: Defining the Graph State

The state needs to track the messages, the user ID, and a special field to hold the result of our pre-fetched data. We'll also include our performance log.

In [None]:
from typing import TypedDict, Annotated, List, Optional
from langchain_core.messages import BaseMessage
import operator
from concurrent.futures import Future

class GraphState(TypedDict):
    messages: Annotated[List[BaseMessage], operator.add]
    user_id: str
    # This will hold the result of the speculative tool call, if it runs.
    # We use a Future object to handle the asynchronous result.
    prefetched_data: Optional[Future]
    # This will hold the actual tool call decided by the LLM
    agent_decision: Optional[BaseMessage]
    performance_log: Annotated[List[str], operator.add]

### 3.2: Defining the Graph Nodes

Our graph will have several specialized nodes:

1.  **`entry_point`**: This is the key node. It starts the speculative `get_order_history` call in a background thread and, at the same time, calls the main LLM agent.
2.  **`tool_executor_node`**: This node is responsible for executing the tool call that the agent *actually* decided on. It has special logic: if the agent wants to call `get_order_history`, this node will first check if the data has been pre-fetched. If so, it gets the result instantly. If not (or for any other tool), it executes the call normally.
3.  **`final_answer_node`**: A simple node that calls the LLM one last time to synthesize the final answer for the user.

In [None]:
from concurrent.futures import ThreadPoolExecutor
import uuid

thread_pool = ThreadPoolExecutor(max_workers=5)

# Node 1: Entry Point (Speculation and Agent Call)
def entry_point(state: GraphState):
    """Starts the speculative pre-fetch and the main agent reasoning in parallel."""
    print("--- [ORCHESTRATOR] Entry point started. --- ")
    start_time = time.time()
    
    # 1. Start the speculative pre-fetch in a background thread
    print("--- [ORCHESTRATOR] Starting speculative pre-fetch of order history... ---")
    prefetched_data_future = thread_pool.submit(get_order_history.invoke, {"user_id": state['user_id']})
    
    # 2. In parallel, start the main agent reasoning process
    print("--- [ORCHESTRATOR] Starting main agent LLM call... ---")
    agent_response = llm_with_tools.invoke(state['messages'])
    
    execution_time = time.time() - start_time
    log_entry = f"[Orchestrator] LLM reasoning completed in {execution_time:.2f}s."
    print(log_entry)
    
    return {
        "prefetched_data": prefetched_data_future,
        "agent_decision": agent_response,
        "performance_log": [log_entry]
    }

In [None]:
from langchain_core.messages import ToolMessage

# Node 2: Tool Executor (with pre-fetch checking)
def tool_executor_node(state: GraphState):
    """Executes the agent's chosen tool, leveraging pre-fetched data if available."""
    print("--- [TOOL EXECUTOR] Node started. --- ")
    start_time = time.time()
    
    agent_decision = state['agent_decision']
    tool_call = agent_decision.tool_calls[0]
    tool_name = tool_call['name']
    tool_args = tool_call['args']
    
    # Check if the desired tool call matches our speculation
    if tool_name == "get_order_history":
        print("--- [TOOL EXECUTOR] Agent wants order history. Checking pre-fetch... ---")
        # Wait for the pre-fetch to complete and get the result
        prefetched_future = state['prefetched_data']
        tool_result = prefetched_future.result()
        print("--- [TOOL EXECUTOR] Pre-fetch successful! Using cached data instantly. ---")
    else:
        # If the agent wants a different tool, we would execute it normally here
        print(f"--- [TOOL EXECUTOR] Agent wants a different tool ({tool_name}). Executing normally. ---")
        # For this demo, we'll assume only get_order_history exists
        tool_result = "Tool not implemented for this demo."
    
    tool_message = ToolMessage(content=tool_result, tool_call_id=tool_call['id'])
    
    execution_time = time.time() - start_time
    # Note: This time represents how long it took to get the result from this node's perspective
    log_entry = f"[ToolExecutor] Resolved tool call in {execution_time:.2f}s."
    print(log_entry)
    
    return {
        "messages": [agent_decision, tool_message],
        "performance_log": [log_entry]
    }

In [None]:
# Node 3: Final Answer Synthesizer
def final_answer_node(state: GraphState):
    """Generates the final response to the user."""
    print("--- [SYNTHESIZER] Generating final answer... ---")
    start_time = time.time()
    
    # We need to remove the pre-fetched data from the state before the final LLM call
    # as it's not serializable and not part of the message history.
    final_state_messages = state['messages']
    final_response = llm.invoke(final_state_messages)
    
    execution_time = time.time() - start_time
    log_entry = f"[Synthesizer] Final LLM call took {execution_time:.2f}s."
    print(log_entry)
    
    return {
        "messages": [final_response],
        "performance_log": [log_entry]
    }

### 3.3: Defining Graph Edges and Assembling the Graph

The routing is relatively simple: after the entry point, we check if the agent decided to call a tool. If so, we go to the tool executor; otherwise, we can end (though in this demo, it will always call a tool). After the tool executor, we always go to the final answer node.

In [None]:
from langgraph.graph import StateGraph, END

def should_call_tool(state: GraphState) -> str:
    if state['agent_decision'].tool_calls:
        return "execute_tool"
    return END # Or route to a final answer node if no tool is needed

# Define the graph
workflow = StateGraph(GraphState)
workflow.add_node("entry_point", entry_point)
workflow.add_node("execute_tool", tool_executor_node)
workflow.add_node("final_answer", final_answer_node)

# Build the graph
workflow.set_entry_point("entry_point")
workflow.add_conditional_edges("entry_point", should_call_tool)
workflow.add_edge("execute_tool", "final_answer")
workflow.add_edge("final_answer", END)

app = workflow.compile()

print("Graph constructed and compiled successfully.")
print("The proactive support agent is ready.")

Graph constructed and compiled successfully.
The proactive support agent is ready.


### 3.4: Visualizing the Graph

**Diagram Description:** The `__start__` node points to `entry_point`. From `entry_point`, a conditional edge either goes to `execute_tool` or `__end__`. The `execute_tool` node has a single edge to `final_answer`, which then points to `__end__`.

In [None]:
# from IPython.display import Image
# Image(app.get_graph().draw_png())

## Part 4: Running and Analyzing the Speculative Workflow

Now, we'll run the graph and pay close attention to the timing logs. We expect the orchestrator's LLM call time to be the main determinant of the initial latency, with the 3-second database latency being hidden behind it.

In [None]:
from langchain_core.messages import HumanMessage
import json

inputs = {
    "messages": [HumanMessage(content="Hi, can you tell me the status of my recent orders?")],
    "user_id": "user123"
}

step_counter = 1
final_state = None

for output in app.stream(inputs, stream_mode="values"):
    node_name = list(output.keys())[0]
    print(f"\n{'*' * 100}")
    print(f"**Step {step_counter}: {node_name.replace('_', ' ').title()} Node Execution**")
    print(f"{'*' * 100}")
    
    step_counter += 1

****************************************************************************************************
**Step 1: Entry Point Node Execution**
****************************************************************************************************
--- [ORCHESTRATOR] Entry point started. --- 
--- [ORCHESTRATOR] Starting speculative pre-fetch of order history... ---
--- [DATABASE] Starting query for user_id: user123. This will take 3 seconds. ---
--- [ORCHESTRATOR] Starting main agent LLM call... ---
--- [DATABASE] Query finished for user_id: user123. ---
[Orchestrator] LLM reasoning completed in 4.21s.

----------------------------------------------------------------------------------------------------
Analysis: This is the critical step. The database query (3s) and the LLM call (4.21s) were initiated at roughly the same time. The database query finished while the LLM was still thinking. The total time for this step was 4.21s, the time of the longer of the two parallel operations. The 3s datab