## Multi-Agent RAG with Subagents Architecture

Implements a supervisor pattern where a central main agent coordinates specialized subagents by calling them as tools. Each subagent is stateless and focused on a specific domain, with all context managed by the supervisor.

## Architecture Overview

**Subagents Pattern:**
- Main agent (supervisor) maintains conversation state and routes tasks
- Subagents are invoked as tools and return results to the supervisor
- Each subagent operates in isolated context windows
- Supervisor synthesizes results from multiple subagents

**Key Benefits:**
- Centralized control flow through supervisor
- Context isolation prevents bloat in main conversation
- Easy to add new specialized subagents
- Parallel execution when tasks are independent

## Setup and Dependencies

In [25]:
import os
from dotenv import load_dotenv

load_dotenv()
os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")

## Build Knowledge Base with RAG

In [26]:
from langchain_community.document_loaders import WikipediaLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load documents from Wikipedia for realistic knowledge base
# Using transformers in ML as the domain for this example
loader = WikipediaLoader(query="Transformer (deep learning)", load_max_docs=8)
documents = loader.load()

print(f"Loaded {len(documents)} Wikipedia documents")
print(f"First document preview: {documents[0].page_content[:150]}...")

Loaded 8 Wikipedia documents
First document preview: In deep learning, the transformer is an artificial neural network architecture based on the multi-head attention mechanism, in which text is converted...


In [27]:
# Chunk documents for better retrieval
# Smaller chunks improve precision, overlap maintains context continuity
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

Split into 45 chunks


In [28]:
from langchain_huggingface import HuggingFaceEmbeddings

# Initialize embeddings model for semantic search
# all-mpnet-base-v2 provides good balance of quality and speed
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

In [29]:
from langchain_chroma import Chroma

# Create vector store with local persistence
# Persistent storage avoids re-indexing on every run
vector_store = Chroma(
    collection_name="transformer_knowledge",
    embedding_function=embeddings,
    persist_directory="./chroma_db_subagents"
)

# Add documents to vector store
vector_store.add_documents(chunks)
print(f"Vector store created with {len(chunks)} document chunks")

Vector store created with 45 document chunks


In [30]:
# Test retrieval to verify vector store is working
test_query = "What are the advantages of transformer architecture?"
test_results = vector_store.similarity_search(test_query, k=2)

print(f"Query: {test_query}")
print(f"\nTop result preview:")
print(test_results[0].page_content[:200] + "...")

Query: What are the advantages of transformer architecture?

Top result preview:
The modern version of the transformer was proposed in the 2017 paper "Attention Is All You Need" by researchers at Google. The predecessors of transformers were developed as an improvement over previo...


## Initialize LLM

In [31]:
from langchain_groq import ChatGroq

llm = ChatGroq(
    model="openai/gpt-oss-120b",
    temperature=0.1
)

# Quick test
response = llm.invoke("Say 'LLM initialized successfully'")
print(response.content)

LLM initialized successfully.


## Create Specialized Subagents

Each subagent focuses on a specific domain with its own tools and prompt. Subagents are stateless - they receive a query, execute their task, and return results.

### Research Subagent

Handles information retrieval from the knowledge base. Translates natural language queries into effective vector searches and synthesizes findings.

In [32]:
from langchain.tools import tool

@tool
def search_knowledge_base(query: str) -> str:
    """Search the transformer knowledge base for relevant information.
    
    Returns top 3 most relevant document chunks based on semantic similarity.
    """
    results = vector_store.similarity_search(query, k=3)
    
    # Format results with clear separation
    formatted_results = []
    for i, doc in enumerate(results, 1):
        formatted_results.append(
            f"Source {i}:\n{doc.page_content}\n"
        )
    
    return "\n".join(formatted_results)

In [33]:
from langchain.agents import create_agent

RESEARCH_AGENT_PROMPT = (
    "You are a research specialist focused on deep learning and transformers. "
    "Use the search_knowledge_base tool to find relevant information. "
    "Synthesize findings from multiple sources when available. "
    "Always cite which sources informed your answer. "
    "If information is not found, state that clearly."
)

research_agent = create_agent(
    llm,
    tools=[search_knowledge_base],
    system_prompt=RESEARCH_AGENT_PROMPT
)

In [34]:
# Test research agent independently
test_query = "Explain the self-attention mechanism in transformers"

for step in research_agent.stream(
    {"messages": [{"role": "user", "content": test_query}]},
    stream_mode="values"
):
    latest_msg = step["messages"][-1]
    if hasattr(latest_msg, 'tool_calls') and latest_msg.tool_calls:
        print(f"\nCalling tool: {latest_msg.tool_calls[0]['name']}")
    elif latest_msg.content and latest_msg.type == "ai":
        print(f"\nResearch Agent Response:\n{latest_msg.content}")


Calling tool: search_knowledge_base

Calling tool: search_knowledge_base

Calling tool: search_knowledge_base

Calling tool: search_knowledge_base

Calling tool: search_knowledge_base

Calling tool: search_knowledge_base

Research Agent Response:
**Self‚Äëattention in Transformers ‚Äì a concise, step‚Äëby‚Äëstep description**

Below is a synthesis of the most relevant points that appear in the transformer knowledge base, together with explicit citations to the source excerpts that mention each idea.

---

### 1. What self‚Äëattention does  

*‚ÄúEach element in the input sequence attends to all others, enabling the model to capture global dependencies.‚Äù*„Äêsource 1‚Ä†L1-L4„Äë  

In other words, for every token (word, sub‚Äëword, image patch, ‚Ä¶) the model looks at every other token in the same sequence and decides how much information to borrow from each of them. This replaces the recurrent or convolutional mechanisms that were used in earlier sequence models.

---

### 2. From tok

### Analysis Subagent

Performs comparative analysis and technical evaluation. Complements research agent by providing deeper insights.

In [35]:
@tool
def compare_concepts(concept_a: str, concept_b: str) -> str:
    """Compare two technical concepts from the knowledge base.
    
    Retrieves information about both concepts and provides structured comparison.
    """
    # Search for both concepts
    results_a = vector_store.similarity_search(concept_a, k=2)
    results_b = vector_store.similarity_search(concept_b, k=2)
    
    return (
        f"Information about {concept_a}:\n"
        f"{results_a[0].page_content[:300]}...\n\n"
        f"Information about {concept_b}:\n"
        f"{results_b[0].page_content[:300]}..."
    )

In [36]:
ANALYSIS_AGENT_PROMPT = (
    "You are a technical analyst specializing in ML architectures. "
    "Use compare_concepts to analyze differences between approaches. "
    "Provide structured comparisons highlighting key distinctions. "
    "Focus on technical accuracy and practical implications."
)

analysis_agent = create_agent(
    llm,
    tools=[compare_concepts],
    system_prompt=ANALYSIS_AGENT_PROMPT
)

In [37]:
# Test analysis agent
test_query = "Compare transformers with RNNs for sequence processing"

for step in analysis_agent.stream(
    {"messages": [{"role": "user", "content": test_query}]},
    stream_mode="values"
):
    latest_msg = step["messages"][-1]
    if hasattr(latest_msg, 'tool_calls') and latest_msg.tool_calls:
        print(f"\nCalling tool: {latest_msg.tool_calls[0]['name']}")
    elif latest_msg.content and latest_msg.type == "ai":
        print(f"\nAnalysis Agent Response:\n{latest_msg.content}")


Calling tool: compare_concepts

Analysis Agent Response:
**Transformers vs. Recurrent Neural Networks (RNNs) for Sequence Processing**  
*(Technical analyst view ‚Äì focus on architecture, performance, and practical trade‚Äëoffs)*  

| Aspect | **Transformers** | **Recurrent Neural Networks (RNNs)** |
|--------|------------------|--------------------------------------|
| **Core Computational Paradigm** | *Self‚Äëattention* over the whole sequence; each token attends to every other token in a single (or few) layers. | *Sequential recurrence*: hidden state \(h_t = f(h_{t-1}, x_t)\) is updated step‚Äëby‚Äëstep. |
| **Parallelism & Throughput** | Fully parallelizable across time steps during both forward and backward passes (matrix‚Äëmultiplication on \(Q,K,V\)). Enables massive GPU/TPU utilization; training speed scales roughly linearly with sequence length (up to memory limits). | Inherently sequential; each time step depends on the previous hidden state. Limits parallelism to batch dim

## Wrap Subagents as Tools

This is the key architectural step. Each subagent is wrapped as a tool that the supervisor can invoke. The supervisor sees high-level capabilities, not implementation details.

In [38]:
@tool
def research_information(query: str) -> str:
    """Research information from the knowledge base.
    
    Use this when the user needs factual information, definitions, or
    explanations from the transformer knowledge base. Handles semantic search
    and information synthesis.
    
    Input: Natural language query about transformers or deep learning
    """
    result = research_agent.invoke({
        "messages": [{"role": "user", "content": query}]
    })
    
    # Return only the final response to supervisor
    # Supervisor doesn't need to see intermediate tool calls
    return result["messages"][-1].content

In [39]:
@tool
def analyze_comparison(request: str) -> str:
    """Analyze and compare technical concepts.
    
    Use this when the user wants to compare different approaches, understand
    tradeoffs, or analyze technical differences. Provides structured comparative
    analysis.
    
    Input: Natural language request for comparison or analysis
    """
    result = analysis_agent.invoke({
        "messages": [{"role": "user", "content": request}]
    })
    
    return result["messages"][-1].content

## Create Supervisor Agent

The supervisor orchestrates subagents, making high-level routing decisions. It maintains conversation state and synthesizes results from multiple subagents.

In [40]:
SUPERVISOR_PROMPT = (
    "You are a helpful AI assistant with access to specialized subagents. "
    "You can research information and perform technical analysis. "
    "\n\nAvailable capabilities:"
    "\n- research_information: Look up facts and explanations"
    "\n- analyze_comparison: Compare technical approaches"
    "\n\nFor complex questions, you may need to use multiple subagents in sequence. "
    "Break down requests and coordinate subagent results into coherent responses."
)

supervisor_agent = create_agent(
    llm,
    tools=[research_information, analyze_comparison],
    system_prompt=SUPERVISOR_PROMPT
)

## Test Complete Multi-Agent System

### Example 1: Simple Single-Domain Query

In [42]:
query = "What is the purpose of positional encoding in transformers?"

print(f"User Query: {query}")
print("\n" + "="*80 + "\n")

for step in supervisor_agent.stream(
    {"messages": [{"role": "user", "content": query}]},
    stream_mode="values"
):
    latest_msg = step["messages"][-1]
    
    if hasattr(latest_msg, 'tool_calls') and latest_msg.tool_calls:
        for tool_call in latest_msg.tool_calls:
            print(f"\nSupervisor calling: {tool_call['name']}")
    
    elif latest_msg.type == "tool":
        print(f"\nSubagent completed: {latest_msg.name}")
    
    elif latest_msg.content and latest_msg.type == "ai":
        print(f"\nFinal Response:\n{latest_msg.content}")

User Query: What is the purpose of positional encoding in transformers?



Final Response:
**Positional encoding (or positional embedding) is the mechanism that gives a Transformer any notion of the order of tokens in a sequence.**  

### Why it‚Äôs needed
- **Self‚Äëattention is order‚Äëagnostic.**  
  In a Transformer each token attends to every other token via dot‚Äëproduct attention. The attention computation itself treats the input as a set, not a sequence, so without extra information the model cannot distinguish ‚Äúthe first word‚Äù from ‚Äúthe last word‚Äù if the words themselves are identical.
- **Sequence‚Äëlevel tasks require order.**  
  Tasks such as language modeling, translation, or any sequential prediction depend on the relative and absolute positions of tokens (e.g., ‚Äúcat sat on the mat‚Äù ‚â† ‚Äúmat sat on the cat‚Äù). Positional encodings inject this ordering information so the model can learn patterns that depend on token positions.

### How it works
1. **Create 

### Example 2: Complex Multi-Domain Query

In [44]:
complex_query = (
    "First, explain what attention mechanism is. "
    "Then compare how attention works in transformers versus traditional RNNs."
)

print(f"User Query: {complex_query}")
print("\n" + "="*80 + "\n")

for step in supervisor_agent.stream(
    {"messages": [{"role": "user", "content": complex_query}]},
    stream_mode="values"
):
    latest_msg = step["messages"][-1]
    
    if hasattr(latest_msg, 'tool_calls') and latest_msg.tool_calls:
        for tool_call in latest_msg.tool_calls:
            print(f"\nSupervisor calling: {tool_call['name']}")
    
    elif latest_msg.type == "tool":
        print(f"\nSubagent completed: {latest_msg.name}")
    
    elif latest_msg.content and latest_msg.type == "ai":
        print(f"\nFinal Response:\n{latest_msg.content}")

User Query: First, explain what attention mechanism is. Then compare how attention works in transformers versus traditional RNNs.



Supervisor calling: research_information

Subagent completed: research_information

Supervisor calling: analyze_comparison

Subagent completed: analyze_comparison

Final Response:
**What is an attention mechanism?**  

In deep‚Äëlearning models an **attention mechanism** is a differentiable module that lets the network decide, for each output element, how much ‚Äúfocus‚Äù to give to each part of its input.  
It works by computing a set of *attention scores* that measure the relevance of every input token (or hidden state) to the current processing step, turning those scores into a probability distribution with a soft‚Äëmax, and then taking a weighted sum of the input representations (the *values*). The result ‚Äì the *context vector* ‚Äì is fed to the next layer (decoder, classifier, etc.).  

The most common formulation is the **query‚Äëkey‚Äëvalue** par

### Example 3: Multi-Turn Conversation

In [45]:
# Simulate conversation with follow-up questions
conversation = [
    "What are transformer models?",
    "How do they differ from LSTM networks?",
    "Which one should I use for a sequence-to-sequence task?"
]

state = {"messages": []}

for i, user_msg in enumerate(conversation, 1):
    print(f"\n{'='*80}")
    print(f"Turn {i}: {user_msg}")
    print(f"{'='*80}\n")
    
    # Add user message to state
    state["messages"].append({"role": "user", "content": user_msg})
    
    # Get response
    result = supervisor_agent.invoke(state)
    
    # Update state with full conversation history
    state = result
    
    # Print only the final AI response
    final_response = result["messages"][-1].content
    print(f"Assistant: {final_response}\n")


Turn 1: What are transformer models?

Assistant: **Transformer models** are a family of deep‚Äëlearning architectures that process sequences (or sets) by repeatedly applying **self‚Äëattention**, feed‚Äëforward networks, and residual‚Äënorm layers‚Äîallowing every element of the input to directly ‚Äúattend‚Äù to all other elements without using recurrence or convolution.

---

### Core Ideas

| Concept | What it means | Why it matters |
|---------|---------------|----------------|
| **Self‚Äëattention** | Each token computes a weighted sum of *all* other tokens, where the weights (attention scores) are learned. | Captures **global dependencies** in a single layer, unlike RNNs that see only nearby context step‚Äëby‚Äëstep. |
| **Multi‚Äëhead** | The attention operation is performed in parallel across several ‚Äúheads‚Äù, each learning different relational patterns. | Increases expressive power; the model can attend to many aspects of the data simultaneously. |
| **Positional encoding**

## Advanced: Context Engineering

Control how information flows between supervisor and subagents. By default, subagents receive only the request string. You can customize this to pass additional context.

In [46]:
from langchain.tools import tool, ToolRuntime

@tool
def research_with_context(query: str, runtime: ToolRuntime) -> str:
    """Research information with full conversation context.
    
    This version passes the original user message to the subagent,
    allowing it to resolve ambiguities and maintain conversational coherence.
    """
    # Extract original user message from conversation history
    original_message = next(
        (msg for msg in runtime.state["messages"] if msg.type == "human"),
        None
    )
    
    # Build enhanced prompt with context
    enhanced_prompt = (
        f"Original user inquiry: {original_message.content if original_message else 'N/A'}\n\n"
        f"Your specific task: {query}"
    )
    
    result = research_agent.invoke({
        "messages": [{"role": "user", "content": enhanced_prompt}]
    })
    
    return result["messages"][-1].content

In [47]:
# Create supervisor with context-aware subagent
supervisor_with_context = create_agent(
    llm,
    tools=[research_with_context, analyze_comparison],
    system_prompt=SUPERVISOR_PROMPT
)

In [48]:
# Test with ambiguous follow-up
test_state = {
    "messages": [
        {"role": "user", "content": "Tell me about transformer attention mechanisms"},
        {"role": "assistant", "content": "Transformers use self-attention..."},
        {"role": "user", "content": "Can you elaborate on how it processes sequences?"}
    ]
}

result = supervisor_with_context.invoke(test_state)
print("Response with context:")
print(result["messages"][-1].content)

Response with context:
### How Transformers Process Sequences ‚Äì A Step‚Äëby‚ÄëStep Walk‚Äëthrough  

Below is a concise yet complete description of the pipeline a Transformer (e.g., the original **Vaswani et‚ÄØal., 2017** model or its modern descendants) follows to turn an input sequence of symbols (words, sub‚Äëwords, characters, etc.) into contextualized representations. The core of this pipeline is the **self‚Äëattention** mechanism, but it works together with several other components that together give the model its power.

---

## 1. Input Preparation  

| Step | What happens | Why it matters |
|------|--------------|----------------|
| **Tokenization** | The raw text is split into discrete tokens (e.g., WordPiece, BPE, SentencePiece). | Provides a finite vocabulary that the model can embed. |
| **Embedding lookup** | Each token index is mapped to a dense vector **\(x_i \in \mathbb{R}^{d_{\text{model}}}\)** via a learned embedding matrix **\(E \in \mathbb{R}^{|V|\times d_{\text{

## Key Architectural Patterns

**1. Tool Per Agent Pattern:**
- Each subagent wrapped as a distinct tool
- Fine-grained control over input/output
- Clear responsibility boundaries

**2. Context Isolation:**
- Subagents operate in clean context windows
- Prevents context pollution in main conversation
- Supervisor maintains global state

**3. Information Flow:**
- Supervisor passes queries to subagents
- Subagents return only final results
- Supervisor synthesizes multiple results

**When to Use This Pattern:**
- Multiple distinct domains requiring specialized handling
- Complex tasks requiring sequential subagent coordination
- Need centralized control flow
- Subagents don't need direct user interaction

## Summary

This notebook demonstrated:
- Building specialized subagents with domain-specific tools
- Wrapping subagents as tools for supervisor orchestration
- Creating a supervisor that coordinates multiple subagents
- Handling both simple and complex multi-domain queries
- Context engineering for enhanced subagent capabilities

The subagents pattern provides clean separation of concerns, easy extensibility, and centralized workflow control.