### Query Expansion for Better Retrieval

**The Problem**: User queries are often too short or use different terminology than documents. A query like "transformers" might miss documents about "attention mechanisms" or "self-attention layers".

**The Solution**: Query expansion generates multiple variations of the original query to cast a wider net.

**Why It Matters**:
- Better recall: Catch documents using synonyms or related terms
- Overcome vocabulary mismatch: User language vs document language
- More robust retrieval: Single query might miss relevant docs

**Techniques Covered**:
1. Multi-query expansion: Generate 3-5 alternative phrasings
2. Query rewriting: Add technical terms and context
3. Step-back prompting: Ask broader questions for context
4. HyDE: Generate hypothetical answers and search with them
5. Query decomposition: Break complex queries into sub-queries
6. Agent integration: Use all techniques with LangChain agents

**The Trade-off**: More queries = better results but higher latency. Choose based on your requirements.

In [1]:
from langchain_community.document_loaders import WikipediaLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_groq import ChatGroq
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.agents import create_agent
from langchain.tools import tool

import os
from dotenv import load_dotenv

load_dotenv()
os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")

  from .autonotebook import tqdm as notebook_tqdm


### Load Wikipedia Data

Using Wikipedia for this demo because:
- Rich, technical content on transformers and deep learning
- Good test case for terminology mismatch (formal vs informal language)
- Real-world complexity

In [2]:
# Load Wikipedia articles on transformers
loader = WikipediaLoader(
    query="Transformer (deep learning)",
    load_max_docs=10
)

documents = loader.load()

print(f"Loaded {len(documents)} Wikipedia pages")
print(f"\nFirst document preview:")
print(documents[0].page_content[:200] + "...")
print(f"\nArticle: {documents[0].metadata.get('title', 'Unknown')}")



  lis = BeautifulSoup(html).find_all('li')


Loaded 9 Wikipedia pages

First document preview:
In deep learning, the transformer is an artificial neural network architecture based on the multi-head attention mechanism, in which text is converted to numerical representations called tokens, and e...

Article: Transformer (deep learning)


In [3]:
# Split documents into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(documents)

print(f"Created {len(chunks)} chunks from {len(documents)} articles")
print(f"\nSample chunk:\n{chunks[10].page_content[:200]}...")

Created 104 chunks from 9 articles

Sample chunk:
. A slow neural network learns by gradient descent to generate keys and values for computing the weight changes of the fast neural network which computes answers to queries. This was later shown to be...


In [4]:
# Create embeddings and vector store
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(chunks, embeddings)

# Base retriever (no expansion yet)
base_retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 20}
)

print("Vector store ready")
print("Retriever: MMR with k=4 (diverse results)")

Vector store ready
Retriever: MMR with k=4 (diverse results)


In [21]:
# Initialize LLM for query expansion
llm = ChatGroq(
    model="meta-llama/llama-4-maverick-17b-128e-instruct",
    temperature=0.3  # Some creativity for variations
)

print("LLM initialized for query expansion")

LLM initialized for query expansion


### Baseline: No Query Expansion

First, test retrieval without expansion to establish a baseline.

In [6]:
test_query = "How do attention mechanisms work in transformers?"

print(f"Query: {test_query}")
print("\n" + "="*70)
print("BASELINE RETRIEVAL (No Expansion)")
print("="*70)

baseline_results = base_retriever.invoke(test_query)

print(f"\nRetrieved {len(baseline_results)} documents:\n")
for i, doc in enumerate(baseline_results, 1):
    print(f"[{i}] {doc.page_content[:120]}...")
    print()

Query: How do attention mechanisms work in transformers?

BASELINE RETRIEVAL (No Expansion)

Retrieved 4 documents:

[1] The modern version of the transformer was proposed in the 2017 paper "Attention Is All You Need" by researchers at Googl...

[2] == Variants ==

Many variants of attention implement soft weights, such as...

[3] Inspired by ideas about attention in humans, the attention mechanism was developed to address the weaknesses of using in...

[4] Mamba-2 serves as a successor to Mamba by introducing a new theoretical and computational framework called Structured St...



### Technique 1: Multi-Query Expansion

**Concept**: Generate 3-5 alternative phrasings, retrieve with all, merge results.

**Why It Works**: Different phrasings match different documents, improving recall.

In [7]:
multiquery_prompt = PromptTemplate.from_template("""
Generate 3 different versions of this question to retrieve relevant documents.
Use different technical terminology and synonyms while keeping the core intent.

Original question: {question}

Provide exactly 3 alternative versions, one per line:
""")

multiquery_chain = multiquery_prompt | llm | StrOutputParser()

print("Multi-query expansion chain ready")

Multi-query expansion chain ready


In [8]:
def multiquery_retrieval(query: str, retriever, top_k: int = 4):
    """Retrieve using multiple query variations."""
    variations_text = multiquery_chain.invoke({"question": query})
    variations = [
        line.strip().lstrip('0123456789.-) ')
        for line in variations_text.strip().split("\n")
        if line.strip()
    ]
    
    all_queries = [query] + variations
    
    print(f"Searching with {len(all_queries)} queries:")
    for i, q in enumerate(all_queries, 1):
        print(f"  {i}. {q}")
    print()
    
    all_docs = []
    seen_content = set()
    
    for q in all_queries:
        docs = retriever.invoke(q)
        for doc in docs:
            content_hash = hash(doc.page_content)
            if content_hash not in seen_content:
                seen_content.add(content_hash)
                all_docs.append(doc)
    
    print(f"Retrieved {len(all_docs)} unique documents\n")
    return all_docs[:top_k]

# Test
print("=" * 70)
print("MULTI-QUERY EXPANSION")
print("=" * 70)
multiquery_results = multiquery_retrieval(test_query, base_retriever, top_k=4)
for i, doc in enumerate(multiquery_results, 1):
    print(f"\n[{i}] {doc.page_content[:120]}...")

MULTI-QUERY EXPANSION
Searching with 4 queries:
  1. How do attention mechanisms work in transformers?
  2. What is the underlying process of attention mechanisms in transformer architectures?
  3. Can you explain how self‑attention operates within a transformer model?
  4. Describe the functioning of attention modules in transformer‑based networks.

Retrieved 10 unique documents


[1] The modern version of the transformer was proposed in the 2017 paper "Attention Is All You Need" by researchers at Googl...

[2] == Variants ==

Many variants of attention implement soft weights, such as...

[3] Inspired by ideas about attention in humans, the attention mechanism was developed to address the weaknesses of using in...

[4] Mamba-2 serves as a successor to Mamba by introducing a new theoretical and computational framework called Structured St...


### Technique 2: Query Rewriting

**Concept**: Rewrite query to be more specific and technical (1 improved version → 1 search).

**When to Use**: Vague or informal queries need enhancement.

In [9]:
rewriting_prompt = PromptTemplate.from_template("""
Rewrite this question to be more specific and include relevant technical terms.
Add key concepts and synonyms for better retrieval.

Original query: {query}

Rewritten query:
""")

rewriting_chain = rewriting_prompt | llm | StrOutputParser()

# Test
test_queries = ["how transformers work", "what is attention"]
print("Query Rewriting Examples")
print("=" * 70)
for query in test_queries:
    rewritten = rewriting_chain.invoke({"query": query})
    print(f"\nOriginal:  {query}")
    print(f"Rewritten: {rewritten}")

Query Rewriting Examples

Original:  how transformers work
Rewritten: **Rewritten query:**  
How do transformer neural network architectures operate—specifically the self‑attention mechanism, multi‑head attention, positional encoding, layer normalization, and feed‑forward sub‑layers—and what are the underlying principles (e.g., attention‑based deep learning, encoder‑decoder structure, transformer models, and attention mechanisms) that enable their performance?

Original:  what is attention
Rewritten: **Rewritten query:**  

*What is the attention mechanism in deep learning—specifically the scaled‑dot‑product, self‑attention, and multi‑head attention used in Transformer architectures—and how do related concepts such as context weighting, query‑key‑value interactions, alignment scores, and focus/weighting of input representations function and differ?*


### Technique 3: Step-Back Prompting

**Concept**: Generate a broader, conceptual version of the query to find background documents.

**Example**:
- Specific: "How do I configure FAISS indexing?"
- Step-back: "What are vector databases and how do they work?"

**Why It Works**: Specific queries miss foundational documents. Broader queries provide context.

**Pattern**: Search with BOTH step-back and original, merge results.

In [10]:
stepback_prompt = PromptTemplate.from_template("""
Given a specific question, generate a broader "step-back" question that would help 
understand the general context and foundational concepts.

Examples:
- Specific: "How does scaled dot-product attention work?"
  Step-back: "What are the different types of attention mechanisms?"

- Specific: "What is the role of positional encoding?"
  Step-back: "How do transformers handle sequence information?"

Original question: {question}

Generate a broader step-back question:
""")

stepback_chain = stepback_prompt | llm | StrOutputParser()

print("Step-back prompting chain ready")

Step-back prompting chain ready


In [11]:
def stepback_retrieval(query: str, retriever, top_k: int = 4):
    """Retrieve using both original and step-back queries."""
    stepback_query = stepback_chain.invoke({"question": query}).strip()
    
    print(f"Original query: {query}")
    print(f"Step-back query: {stepback_query}")
    print()
    
    # Retrieve from both
    original_docs = retriever.invoke(query)
    stepback_docs = retriever.invoke(stepback_query)
    
    # Interleave results
    combined = []
    seen = set()
    
    for i in range(max(len(original_docs), len(stepback_docs))):
        if i < len(original_docs):
            doc = original_docs[i]
            if doc.page_content not in seen:
                combined.append(doc)
                seen.add(doc.page_content)
        
        if i < len(stepback_docs):
            doc = stepback_docs[i]
            if doc.page_content not in seen:
                combined.append(doc)
                seen.add(doc.page_content)
        
        if len(combined) >= top_k:
            break
    
    return combined[:top_k]

# Test
print("=" * 70)
print("STEP-BACK PROMPTING")
print("=" * 70)
stepback_results = stepback_retrieval(
    "What is the purpose of multi-head attention?",
    base_retriever,
    top_k=4
)
print("\nFinal results (interleaved):")
for i, doc in enumerate(stepback_results, 1):
    print(f"\n[{i}] {doc.page_content[:120]}...")

STEP-BACK PROMPTING
Original query: What is the purpose of multi-head attention?
Step-back query: **Step‑back question:**  
*How do attention mechanisms, including multi‑head attention, enable transformers to process and represent sequence information?*


Final results (interleaved):

[1] . At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens v...

[2] Additional surveys of the attention mechanism in deep learning are provided by Niu et al. and Soydaner.
The major breakt...

[3] == Career ==
Noam Shazeer joined Google in 2000. One of his first major achievements was improving the spelling correcto...

[4] == Variants ==

Many variants of attention implement soft weights, such as...


### Technique 4: HyDE (Hypothetical Document Embeddings)

**Concept**: Generate a hypothetical answer, then search with that instead of the query.

**The Insight**: 
- Queries and documents live in different embedding spaces
- "How do transformers work?" (query) is far from "Transformers use self-attention..." (document)
- Hypothetical answers are closer to real documents in embedding space

**Process**:
1. Query: "How do transformers process sequences?"
2. Generate hypothetical answer: "Transformers process sequences using self-attention mechanisms that..."
3. Embed the hypothetical answer (not the query)
4. Search for documents similar to hypothetical answer

**When to Use**: Complex questions where you want documents containing answers, not documents repeating questions.

In [12]:
hyde_prompt = PromptTemplate.from_template("""
Generate a concise, factual answer to the following question as if you were writing 
technical documentation. The answer should be 2-3 sentences.

Question: {question}

Hypothetical answer:
""")

hyde_chain = hyde_prompt | llm | StrOutputParser()

print("HyDE chain ready")

HyDE chain ready


In [13]:
def hyde_retrieval(query: str, vectorstore, top_k: int = 4):
    """Retrieve using HyDE - search with hypothetical answer."""
    hypothetical_answer = hyde_chain.invoke({"question": query}).strip()
    
    print(f"Original query: {query}")
    print(f"\nHypothetical answer:\n{hypothetical_answer}")
    print()
    
    # Search using the hypothetical answer
    docs = vectorstore.similarity_search(hypothetical_answer, k=top_k)
    
    return docs

# Test
print("=" * 70)
print("HyDE (HYPOTHETICAL DOCUMENT EMBEDDINGS)")
print("=" * 70)
hyde_results = hyde_retrieval(
    "How do transformers handle long sequences?",
    vectorstore,
    top_k=4
)
print("Retrieved documents:")
for i, doc in enumerate(hyde_results, 1):
    print(f"\n[{i}] {doc.page_content[:120]}...")

HyDE (HYPOTHETICAL DOCUMENT EMBEDDINGS)
Original query: How do transformers handle long sequences?

Hypothetical answer:
Transformers process long sequences by applying self‑attention, which compares every token to every other token, giving a quadratic O(N²) time and memory cost in the sequence length N. To mitigate this, modern architectures replace full attention with efficient variants—such as sliding‑window, dilated, or sparse attention, linear‑complexity kernels, and memory‑augmented or recurrent mechanisms—that limit the number of token‑pair interactions while preserving contextual information. These approaches enable handling of sequences that are orders of magnitude longer than the original transformer design.

Retrieved documents:

[1] Transformers have the advantage of having no recurrent units, therefore requiring less training time than earlier recurr...

[2] Additional surveys of the attention mechanism in deep learning are provided by Niu et al. and Soydaner.
The major br

### Technique 5: Query Decomposition

**Concept**: Break complex queries into simpler sub-queries, retrieve for each, then combine.

**When to Use**: 
- Multi-part questions: "Explain transformers and how they differ from RNNs"
- Questions requiring multiple pieces of information
- Comparison questions

**Process**:
1. Decompose: "How do transformers work?" + "How do RNNs work?" + "What's the difference?"
2. Retrieve independently for each sub-query
3. Combine results

**Benefit**: Each sub-query focuses on one aspect, improving precision.

In [14]:
decomposition_prompt = PromptTemplate.from_template("""
Break down this complex question into 2-4 simpler sub-questions that, 
when answered together, would fully address the original question.

Example:
Complex: "Compare attention mechanisms in transformers and RNNs"
Sub-questions:
1. How do attention mechanisms work in transformers?
2. How do attention mechanisms work in RNNs?
3. What are the key differences?

Complex question: {question}

List sub-questions (one per line):
""")

decomposition_chain = decomposition_prompt | llm | StrOutputParser()

print("Query decomposition chain ready")

Query decomposition chain ready


In [15]:
def decomposed_retrieval(query: str, retriever, k_per_query: int = 2):
    """Decompose query and retrieve for each sub-query."""
    decomposition_text = decomposition_chain.invoke({"question": query})
    sub_queries = [
        q.strip().lstrip('0123456789.-) ') 
        for q in decomposition_text.strip().split("\n") 
        if q.strip()
    ]
    
    print(f"Original query: {query}")
    print(f"\nDecomposed into {len(sub_queries)} sub-queries:")
    for i, sq in enumerate(sub_queries, 1):
        print(f"  {i}. {sq}")
    print()
    
    all_docs = []
    seen = set()
    
    for sq in sub_queries:
        docs = retriever.invoke(sq)
        for doc in docs[:k_per_query]:
            if doc.page_content not in seen:
                all_docs.append(doc)
                seen.add(doc.page_content)
    
    print(f"Retrieved {len(all_docs)} unique documents\n")
    return all_docs

# Test
print("=" * 70)
print("QUERY DECOMPOSITION")
print("=" * 70)
complex_query = "Explain the transformer architecture and its advantages over RNNs"
decomposed_results = decomposed_retrieval(
    complex_query,
    base_retriever,
    k_per_query=2
)
print("Retrieved documents:")
for i, doc in enumerate(decomposed_results[:4], 1):
    print(f"\n[{i}] {doc.page_content[:120]}...")

QUERY DECOMPOSITION
Original query: Explain the transformer architecture and its advantages over RNNs

Decomposed into 4 sub-queries:
  1. What are the core components and workflow of the transformer architecture (e.g., encoder‑decoder layers, multi‑head self‑attention, positional encoding, feed‑forward networks)?
  2. How does self‑attention operate within a transformer and how does it differ from the sequential processing used in RNNs?
  3. What specific advantages do transformers have over RNNs regarding parallelism, handling long‑range dependencies, training efficiency, and scalability?
  4. In which practical tasks or scenarios do transformers outperform RNNs, and what trade‑offs (e.g., computational cost, data requirements) should be considered?

Retrieved 6 unique documents

Retrieved documents:

[1] In deep learning, the transformer is an artificial neural network architecture based on the multi-head attention mechani...

[2] . The first representational layer may attempt to id

### Comparing All Techniques

Let's compare all approaches on the same query.

In [16]:
import time

comparison_query = "How does self-attention work in transformers?"

print("=" * 80)
print("COMPARISON: Query Expansion Techniques")
print("=" * 80)
print(f"\nQuery: {comparison_query}\n")

# Baseline
print("[1] BASELINE")
start = time.time()
baseline = base_retriever.invoke(comparison_query)
time_baseline = time.time() - start
print(f"Time: {time_baseline:.3f}s | Docs: {len(baseline)}")

# Multi-query
print("\n[2] MULTI-QUERY")
start = time.time()
multiquery = multiquery_retrieval(comparison_query, base_retriever, top_k=4)
time_multiquery = time.time() - start
print(f"Time: {time_multiquery:.3f}s ({time_multiquery/time_baseline:.1f}x)")

# Step-back
print("\n[3] STEP-BACK")
start = time.time()
stepback = stepback_retrieval(comparison_query, base_retriever, top_k=4)
time_stepback = time.time() - start
print(f"Time: {time_stepback:.3f}s ({time_stepback/time_baseline:.1f}x)")

# HyDE
print("\n[4] HyDE")
start = time.time()
hyde = hyde_retrieval(comparison_query, vectorstore, top_k=4)
time_hyde = time.time() - start
print(f"Time: {time_hyde:.3f}s ({time_hyde/time_baseline:.1f}x)")

print("\n" + "=" * 80)

COMPARISON: Query Expansion Techniques

Query: How does self-attention work in transformers?

[1] BASELINE
Time: 0.348s | Docs: 4

[2] MULTI-QUERY
Searching with 4 queries:
  1. How does self-attention work in transformers?
  2. What is the mechanism behind self‑attention in transformer architectures?
  3. How does the self‑attention module operate within a transformer model?
  4. Can you explain the inner workings of self‑attention in transformer networks?

Retrieved 7 unique documents

Time: 0.624s (1.8x)

[3] STEP-BACK
Original query: How does self-attention work in transformers?
Step-back query: **Step‑back question:**  
*What are the core principles and components of attention mechanisms that enable transformers to model sequential data?*

Time: 0.457s (1.3x)

[4] HyDE
Original query: How does self-attention work in transformers?

Hypothetical answer:
Self‑attention computes a weighted sum of token representations, where the weights are derived from pairwise similarity scores betw

### Agent Integration: All Techniques with LangChain Agents

Now we'll build a complete agent that can use all expansion techniques.

**Why Use Agents**:
- Agents decide which expansion strategy to use
- Multi-step reasoning: expand → retrieve → analyze → answer
- Can combine multiple techniques

This shows how query expansion fits into production agent systems.

In [17]:
# Define tools for the agent

@tool
def search_multiquery(query: str) -> str:
    """Search using multi-query expansion. Use for complex technical questions requiring comprehensive coverage."""
    docs = multiquery_retrieval(query, base_retriever, top_k=3)
    results = [f"[{i+1}] {doc.page_content[:250]}..." for i, doc in enumerate(docs)]
    return "\n\n".join(results)

@tool
def search_stepback(query: str) -> str:
    """Search with step-back prompting. Use when user needs both specific details and broader context."""
    docs = stepback_retrieval(query, base_retriever, top_k=3)
    results = [f"[{i+1}] {doc.page_content[:250]}..." for i, doc in enumerate(docs)]
    return "\n\n".join(results)

@tool
def search_hyde(query: str) -> str:
    """Search using HyDE. Use for questions where you want documents containing detailed explanations."""
    docs = hyde_retrieval(query, vectorstore, top_k=3)
    results = [f"[{i+1}] {doc.page_content[:250]}..." for i, doc in enumerate(docs)]
    return "\n\n".join(results)

@tool
def search_decomposed(query: str) -> str:
    """Search with query decomposition. Use for complex multi-part or comparison questions."""
    docs = decomposed_retrieval(query, base_retriever, k_per_query=2)
    results = [f"[{i+1}] {doc.page_content[:250]}..." for i, doc in enumerate(docs[:3])]
    return "\n\n".join(results)

tools = [search_multiquery, search_stepback, search_hyde, search_decomposed]

print(f"Defined {len(tools)} tools for the agent")

Defined 4 tools for the agent


In [None]:
# Create agent using LangChain v1.x create_agent
agent = create_agent(
    model=llm,
    tools=tools,
    system_prompt="""
You are a helpful AI assistant specializing in deep learning and transformers.

You have access to different search strategies:
- search_multiquery: For technical questions needing comprehensive coverage
- search_stepback: For questions needing both details and broader context  
- search_hyde: For questions wanting detailed explanations
- search_decomposed: For complex multi-part or comparison questions

Choose the appropriate strategy based on the query type.
Provide clear, accurate answers based on the retrieved information.
"""
)

print("Agent created successfully")
print(f"Available tools: {[tool.name for tool in tools]}")

Agent created successfully
Available tools: ['search_multiquery', 'search_stepback', 'search_hyde', 'search_decomposed']


In [None]:
# Test agent with different query types
test_questions = [
    "Explain self-attention in transformers",
    "Compare transformers and RNNs for sequence modeling",
    "What are the advantages of transformer architecture?"
]

for i, question in enumerate(test_questions, 1):
    print("="*80)
    print(f"QUERY {i}: {question}")
    print("="*80)
    
    result = agent.invoke({
        "messages": [{"role": "user", "content": question}]
    })
    
    print("\nAgent Response:")
    print("-" * 80)
    print(result["messages"][-1].content)
    print("\n")

QUERY 1: Explain self-attention in transformers
Original query: self-attention in transformers

Hypothetical answer:
Self‑attention is the core mechanism in transformer models that allows each token in a sequence to weigh and aggregate information from every other token. For each token, three learned projections produce a query vector **q**, a set of key vectors **k**, and value vectors **v**; attention scores are computed as the scaled dot‑product \( \text{softmax}\big(\frac{q\,k^{\top}}{\sqrt{d_k}}\big) \) and used to form a weighted sum of the values, yielding a context‑aware representation. This operation is performed in parallel for all tokens and across multiple heads to capture diverse relational patterns.


Agent Response:
--------------------------------------------------------------------------------
Self-attention is a key component of the transformer architecture in deep learning. It allows the model to capture global dependencies and relationships between different element

### Production Recommendations

**When to Use Each Technique**:

1. **Multi-Query Expansion**
   - Best for: Terminology mismatch, synonym problems
   - Cost: High (3-5x retrieval)
   - Use when: Quality >> speed

2. **Query Rewriting**
   - Best for: Vague or informal queries
   - Cost: Low (1 LLM + 1 retrieval)
   - Use when: Balanced needs

3. **Step-Back Prompting**
   - Best for: Specific questions needing context
   - Cost: Low (2x retrieval)
   - Use when: Users need background + details

4. **HyDE**
   - Best for: Technical documentation search
   - Cost: Medium (LLM + retrieval)
   - Use when: Documents are very technical

5. **Query Decomposition**
   - Best for: Complex multi-part questions
   - Cost: High (N sub-queries)
   - Use when: Comparison or analysis queries

**Implementation Strategy**:

Start with: Query rewriting (low cost, good results)
Add selectively: Multi-query for critical queries
Use agents: Let them choose strategy based on query type

**Optimization Tips**:
- Cache expanded queries
- Use async for parallel retrieval
- Set confidence thresholds
- Monitor: precision@k, answer quality, latency

**Cost per 1000 queries (estimates)**:
- Baseline: ~$0.01
- Query rewriting: +$0.10-0.50
- Multi-query: +$0.30-1.50
- Step-back: +$0.10-0.50
- HyDE: +$0.20-0.80
- Decomposition: +$0.40-2.00
- Agent: +$0.50-2.00