# Notebook 10 (Industrial Edition): Parallel Query Expansion for RAG

## Introduction: Overcoming the Vocabulary Mismatch Problem

This notebook dives into our first advanced Retrieval-Augmented Generation (RAG) pattern: **Parallel Query Expansion & Transformation**. This pre-retrieval strategy is designed to solve one of the most common failure modes in RAG systems: the **vocabulary mismatch problem**. Users often don't know the precise terminology used in the knowledge base, causing simple searches to miss relevant documents.

### The Core Concept: Searching with Multiple Perspectives

Instead of taking the user's query at face value, we use a powerful LLM to brainstorm multiple, diverse ways to search for the same information. In a single, parallel-planned step, we can generate:
1.  **Hypothetical Documents (HyDE):** A generated answer that is likely to be semantically similar to the true answer documents.
2.  **Sub-Questions:** Decomposing a complex query into smaller, more specific questions.
3.  **Keyword & Entity Extraction:** Identifying the core concepts for a precise search.

By searching with all these perspectives at once, we dramatically increase the **recall** of our retrieval stepâ€”finding more of the relevant documents available.

### Role in a Large-Scale System: Maximizing Knowledge Discovery & Improving Accuracy

In any large-scale RAG system (e.g., an enterprise search engine, a research assistant), this pattern is critical for ensuring that the system is robust to user phrasing and can uncover all relevant information. Better recall at the retrieval stage directly leads to a more informed, comprehensive, and **accurate** final answer from the generator.

We will build and compare two RAG systems: a simple one using the raw query, and an advanced one using parallel query expansion. We will demonstrate that the advanced system produces a measurably more accurate and complete answer.

## Part 1: Setup and Environment

We'll need our standard libraries plus `langchain-community` for vector stores and embeddings to build our knowledge base.

In [None]:
%pip install -U langchain langgraph langsmith langchain-huggingface transformers accelerate bitsandbytes torch langchain-community sentence-transformers faiss-cpu

### 1.2: API Keys and Environment Configuration

We will need our LangSmith and Hugging Face keys.

In [None]:
import os
import getpass

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("LANGCHAIN_API_KEY")
_set_env("HUGGING_FACE_HUB_TOKEN")

# Configure LangSmith for tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "Industrial - RAG Query Expansion"

## Part 2: Components for the Advanced RAG System

We'll need an LLM, a knowledge base, and the structured output models that define our parallel query transformations.

### 2.1: The Language Model (LLM)

We will use `meta-llama/Meta-Llama-3-8B-Instruct` for its strong instruction-following and structured data generation capabilities.

In [None]:
from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto", load_in_4bit=True)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=2048, do_sample=False)
llm = HuggingFacePipeline(pipeline=pipe)

print("LLM Initialized. Ready to power our RAG system.")

LLM Initialized. Ready to power our RAG system.


### 2.2: Creating the Knowledge Base

We'll create a small, targeted knowledge base about AI architectures. The documents will use specific terminology that a user might not know, creating the perfect scenario to test our query expansion.

In [None]:
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Create knowledge base documents with specific terminology
kb_docs = [
    "**Multi-headed Attention Mechanism**: The core component of the Transformer architecture is the multi-headed self-attention mechanism. It allows the model to weigh the importance of different words in the input sequence when processing a particular word, capturing contextual relationships. Each 'head' learns a different set of attention patterns in parallel.",
    "**Mixture of Experts (MoE) Layers**: To scale up model size without a proportional increase in computational cost, some large language models employ Mixture of Experts layers. In an MoE layer, a router network dynamically selects a small subset of 'expert' sub-networks to process each input token. This allows for a very high parameter count while keeping inference costs manageable.",
    "**FlashAttention Optimization**: A significant performance bottleneck in training large Transformers is the memory bandwidth required by the attention mechanism. FlashAttention is an I/O-aware algorithm that reorders the computation to reduce the number of read/write operations to high-bandwidth memory (HBM), leading to substantial speedups."
]

# 2. Create an embedding model and vector store
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_texts(kb_docs, embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

print(f"Knowledge Base created with {len(kb_docs)} documents.")

Knowledge Base created with 3 documents.


### 2.3: Structured Data Models for Query Expansion

This Pydantic model defines the structure of our parallel query generation. The LLM will be instructed to populate all fields of this model in a single call.

In [None]:
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List

class ExpandedQueries(BaseModel):
    """A set of diverse, expanded queries to improve retrieval recall."""
    hypothetical_document: str = Field(description="A generated, paragraph-length hypothetical document that directly answers the user's question, which will be used for semantic search.", alias="hyde_query")
    sub_questions: List[str] = Field(description="A list of 2-3 smaller, more specific questions that break down the original query.")
    keywords: List[str] = Field(description="A list of 3-5 core keywords and entities extracted from the user's query.")

## Part 3: Building the RAG Systems

We will now build two complete RAG chains: a simple baseline and our advanced, expanded version.

### 3.1: The Simple RAG System (Baseline)

This is a standard RAG implementation. It takes the user's query, retrieves documents, and then passes them to a generator to produce an answer.

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

generator_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an expert AI Architect. Answer the user's question based *only* on the following context. If the context does not contain the answer, say so. Be concise and accurate.\n\nContext:\n{context}"),
    ("human", "Question: {question}")
])

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

simple_rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | generator_prompt
    | llm
    | StrOutputParser()
)

### 3.2: The Advanced RAG System (with Parallel Query Expansion)

This system uses a LangGraph graph to orchestrate the pre-retrieval query expansion step.

#### 3.2.1: Graph State and Nodes

In [None]:
from typing import TypedDict, List, Optional
from langchain_core.documents import Document

class RAGGraphState(TypedDict):
    original_question: str
    expanded_queries: Optional[ExpandedQueries]
    retrieved_docs: List[Document]
    final_answer: str

# The Query Expansion Node
query_expansion_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a query expansion specialist. Your goal is to transform a user's question into a diverse set of search queries to maximize retrieval recall. Generate a hypothetical document, sub-questions, and keywords."),
    ("human", "Please expand the following question: {question}")
])
query_expansion_chain = query_expansion_prompt | llm.with_structured_output(ExpandedQueries)

def query_expansion_node(state: RAGGraphState):
    print("--- [Expander] Generating parallel queries... ---")
    expanded_queries = query_expansion_chain.invoke({"question": state['original_question']})
    return {"expanded_queries": expanded_queries}

# The Retrieval Node (with parallel execution)
from concurrent.futures import ThreadPoolExecutor

def retrieval_node(state: RAGGraphState):
    print("--- [Retriever] Executing parallel searches... ---")
    all_queries = []
    expanded = state['expanded_queries']
    all_queries.append(expanded.hypothetical_document)
    all_queries.extend(expanded.sub_questions)
    all_queries.extend(expanded.keywords)
    
    all_docs = []
    with ThreadPoolExecutor(max_workers=5) as executor:
        # Run all retrievals in parallel
        results = executor.map(retriever.invoke, all_queries)
        for docs in results:
            all_docs.extend(docs)
    
    # Deduplicate documents based on page content
    unique_docs = {doc.page_content: doc for doc in all_docs}.values()
    print(f"--- [Retriever] Found {len(unique_docs)} unique documents from {len(all_queries)} queries. ---")
    return {"retrieved_docs": list(unique_docs)}

# The Generation Node
def generation_node(state: RAGGraphState):
    print("--- [Generator] Synthesizing final answer... ---")
    context = format_docs(state['retrieved_docs'])
    answer = (
        generator_prompt 
        | llm 
        | StrOutputParser()
    ).invoke({"context": context, "question": state['original_question']})
    return {"final_answer": answer}

#### 3.2.2: Assembling the Graph

In [None]:
from langgraph.graph import StateGraph, END

workflow = StateGraph(RAGGraphState)

workflow.add_node("expand_queries", query_expansion_node)
workflow.add_node("retrieve_docs", retrieval_node)
workflow.add_node("generate_answer", generation_node)

workflow.set_entry_point("expand_queries")
workflow.add_edge("expand_queries", "retrieve_docs")
workflow.add_edge("retrieve_docs", "generate_answer")
workflow.add_edge("generate_answer", END)

advanced_rag_app = workflow.compile()
print("Advanced RAG graph compiled successfully.")

Advanced RAG graph compiled successfully.


## Part 4: Head-to-Head RAG Comparison

Now we will ask both systems the same question. The question is intentionally phrased using general terms ("make models bigger and faster") that do not appear in our knowledge base, forcing the advanced system to leverage its expansion capabilities.

In [None]:
user_query = "How do modern AI systems get so big and fast at the same time? I've heard about attention but I'm not sure how it's optimized."

### 4.1: Running the Simple RAG System

In [None]:
print("--- [SIMPLE RAG] Retrieving documents...")
# We intercept the retrieval step to inspect the documents
simple_retrieved_docs = retriever.invoke(user_query)
print(f"--- [SIMPLE RAG] Documents Retrieved: {len(simple_retrieved_docs)}")
print("--- [SIMPLE RAG] Generating answer...\n")

simple_rag_answer = simple_rag_chain.invoke(user_query)

print("="*60)
print("                  SIMPLE RAG SYSTEM OUTPUT")
print("="*60 + "\n")
print(simple_rag_answer)

--- [SIMPLE RAG] Retrieving documents...
--- [SIMPLE RAG] Documents Retrieved: 1
--- [SIMPLE RAG] Generating answer...

                  SIMPLE RAG SYSTEM OUTPUT

Based on the context, one core component of the Transformer architecture is the multi-headed self-attention mechanism. It helps the model understand contextual relationships by weighing the importance of different words in a sequence. Each 'head' learns different attention patterns in parallel.


### 4.2: Running the Advanced RAG System

In [None]:
inputs = {"original_question": user_query}
advanced_rag_result = None
for output in advanced_rag_app.stream(inputs, stream_mode="values"):
    advanced_rag_result = output

print("="*60)
print("                 ADVANCED RAG SYSTEM OUTPUT")
print("="*60 + "\n")
print(advanced_rag_result['final_answer'])

--- [Expander] Generating parallel queries... ---
--- [Retriever] Executing parallel searches... ---
--- [Retriever] Found 3 unique documents from 9 queries. ---
--- [Generator] Synthesizing final answer... ---

                 ADVANCED RAG SYSTEM OUTPUT

Modern AI systems get big and fast through a combination of architectural innovations and optimization algorithms. Two key techniques are:

1.  **Mixture of Experts (MoE) Layers**: This architecture allows models to have a very high parameter count (making them 'big') without a proportional increase in computation. A router network dynamically selects a small group of 'expert' sub-networks to process each input, keeping inference costs low.

2.  **FlashAttention**: This is an I/O-aware algorithm that specifically optimizes the attention mechanism to make it 'fast'. It reduces the number of memory read/write operations, which is a major performance bottleneck, leading to significant speedups in training and inference.

The multi-heade