# Research Synthesizer Demo

This notebook demonstrates an **Agentic RAG system** that can:
1. **Ingest** research papers from arxiv
2. **Index** them in a vector database (Chroma)
3. **Query** using basic RAG or advanced agentic approach
4. **Evaluate** both approaches with LLM-as-judge

## What makes this "Agentic"?
- **Query Decomposition**: Complex questions are broken into simpler sub-questions
- **Multi-hop Retrieval**: Each sub-question retrieves relevant context
- **Synthesis**: Sub-answers are combined into a comprehensive response

## Setup

First, let's import our modules and load the index.

In [None]:
import sys
import os
from pathlib import Path
from src.retriever import load_index, retrieve
from src.query_engine import create_query_engine, query_with_sources
from src.agent import create_synthesis_agent
from src.decomposition import decompose_query
from src.config import get_llm_with_fallback

# Change to project root directory (parent of notebooks/)
project_root = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
os.chdir(project_root)
sys.path.insert(0, str(project_root))


print(f"Working directory: {os.getcwd()}")

In [2]:
# Load the vector index
print("Loading index...")
index = load_index()
print("Index loaded!")

Loading index...


Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1990.52it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Index loaded!


## 1. Basic Retrieval

Let's start with simple retrieval to see what documents match a query.

In [3]:
# Basic retrieval
results = retrieve("What is retrieval augmented generation?", index, top_k=3)

print("Top 3 Retrieved Chunks:")
print("=" * 60)
for i, node in enumerate(results):
    print(f"\n--- Result {i+1} (score: {node.score:.4f}) ---")
    print(f"Title: {node.node.metadata.get('title', 'N/A')}")
    print(f"Text: {node.node.get_content()[:300]}...")

Top 3 Retrieved Chunks:

--- Result 1 (score: 0.4190) ---
Title: AR-RAG: Autoregressive Retrieval Augmentation for Image Generation
Text: A
promising approach to mitigating these challenges is retrieval-augmented generation (RAG), which
enhances the generation process by incorporating real-world images as additional references [8, 3].
While RAG has been extensively explored in the language domain [23, 13], its application to image
and...

--- Result 2 (score: 0.3763) ---
Title: AR-RAG: Autoregressive Retrieval Augmentation for Image Generation
Text: By comprehensively examining images produced by Im-
ageRAG alongside their corresponding retrieved reference images, we identify two critical challenges
inherent in image-level retrieval augmentation approaches. First, these methods tend to overcopy
irrelevant visual elements from retrieved referenc...

--- Result 3 (score: 0.3564) ---
Title: AR-RAG: Autoregressive Retrieval Augmentation for Image Generation
Text: Retrieval-augmented multim

## 2. Basic RAG (Baseline)

Now let's use the full RAG pipeline with answer generation.

In [4]:
# Create basic query engine
query_engine = create_query_engine(index)

# Ask a simple question
question = "What is retrieval augmented generation?"
result = query_with_sources(query_engine, question)

print(f"Question: {question}")
print("\n" + "=" * 60)
print(f"\nAnswer:\n{result['answer']}")
print(f"\n\nSources ({len(result['sources'])})")
for s in result['sources']:
    print(f"  - {s['title']} (score: {s['score']:.4f})")

Using model: moonshotai/kimi-k2.5


Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1960.71it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Question: What is retrieval augmented generation?


Answer:
 Retrieval-augmented generation (RAG) is a paradigm that enhances generative models by incorporating external knowledge or references during the generation process. Originally developed for natural language processing to retrieve relevant documents, it extends to image generation by utilizing external visual references—ranging from full images to finer granularities such as patches—to condition and guide the creation of visual content. In autoregressive generation frameworks, patch-level retrieval methods offer distinct advantages over image-level approaches, providing more precise conditioning that enhances generation quality and control.


Sources (5)
  - AR-RAG: Autoregressive Retrieval Augmentation for Image Generation (score: 0.4190)
  - AR-RAG: Autoregressive Retrieval Augmentation for Image Generation (score: 0.3763)
  - AR-RAG: Autoregressive Retrieval Augmentation for Image Generation (score: 0.3564)
  - AR-RAG: Autor

## 3. Query Decomposition

For complex questions, we first break them into sub-questions.

In [5]:
# Get LLM for decomposition
llm = get_llm_with_fallback()

# Complex question
complex_question = "Compare different retrieval methods and their effectiveness for RAG systems"

# Decompose
sub_questions = decompose_query(complex_question, llm)

print(f"Original Question:\n{complex_question}")
print("\n" + "=" * 60)
print("\nDecomposed into sub-questions:")
for i, q in enumerate(sub_questions):
    print(f"  {i+1}. {q}")

Using model: moonshotai/kimi-k2.5
Original Question:
Compare different retrieval methods and their effectiveness for RAG systems


Decomposed into sub-questions:
  1. What are the primary categories of retrieval methods used in RAG systems (e.g., sparse, dense, and hybrid approaches)?
  2. What evaluation metrics are commonly used to measure retrieval effectiveness in RAG systems?
  3. How do different retrieval methods compare in terms of retrieval accuracy and result relevance for RAG applications?
  4. What are the computational efficiency and scalability trade-offs between different retrieval methods in RAG systems?


## 4. Agentic RAG (Synthesis Agent)

The synthesis agent:
1. Decomposes the query
2. Answers each sub-question with retrieval
3. Synthesizes all answers into a final response

In [6]:
# Create synthesis agent
agent = create_synthesis_agent(index)

# Ask a complex question
complex_question = "What do papers say about chunking strategies and their impact on RAG performance?"

print(f"Question: {complex_question}")
print("\n" + "=" * 60)
print("Processing...\n")

result = agent(complex_question)

Using model: moonshotai/kimi-k2.5


Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1615.30it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Using model: moonshotai/kimi-k2.5


Loading weights: 100%|██████████| 103/103 [00:00<00:00, 2076.24it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Question: What do papers say about chunking strategies and their impact on RAG performance?

Processing...

Decomposing query: What do papers say about chunking strategies and their impact on RAG performance?
Sub-questions: ['What specific chunking strategies (e.g., fixed-size, semantic, recursive, agentic) have been proposed and evaluated in RAG research literature?', 'How do chunk size and overlap parameters affect retrieval accuracy and recall in RAG systems?', 'What impact do different chunking approaches have on the factual accuracy and coherence of generated responses in RAG pipelines?', 'What comparative analyses exist regarding the trade-offs between semantic chunking versus fixed-size chunking for different document types in RAG?']

Answering sub-question 1: What specific chunking strategies (e.g., fixed-size, semantic, recursive, agentic) have been proposed and evaluated in RAG research literature?

Answering sub-question 2: How do chunk size and overlap parameters affect ret

In [7]:
# Display results
print("Sub-questions generated:")
for i, q in enumerate(result['sub_questions']):
    print(f"  {i+1}. {q}")

print("\n" + "=" * 60)
print("\nFinal Synthesized Answer:")
print(result['answer'])

print("\n" + "=" * 60)
print(f"\nSources ({len(result['sources'])})")
for s in result['sources']:
    print(f"  - {s['title']}")

Sub-questions generated:
  1. What specific chunking strategies (e.g., fixed-size, semantic, recursive, agentic) have been proposed and evaluated in RAG research literature?
  2. How do chunk size and overlap parameters affect retrieval accuracy and recall in RAG systems?
  3. What impact do different chunking approaches have on the factual accuracy and coherence of generated responses in RAG pipelines?
  4. What comparative analyses exist regarding the trade-offs between semantic chunking versus fixed-size chunking for different document types in RAG?


Final Synthesized Answer:
 The provided research context offers limited direct discussion of specific text chunking strategies (e.g., fixed-size, semantic, recursive, or agentic chunking) as explicitly defined in the RAG literature. However, the available documents provide relevant insights regarding **retrieval granularity** and **structured document processing** that inform the impact of chunking approaches on RAG performance.

## Re

## 5. Comparison: Basic RAG vs Agentic RAG

Let's compare both approaches on the same question.

In [8]:
# Same question for both
test_question = "What evaluation metrics are used to assess RAG system performance?"

print("Question:", test_question)
print("\n" + "=" * 60)

# Basic RAG
print("\n[BASIC RAG]")
basic_result = query_with_sources(query_engine, test_question)
print(f"\nAnswer:\n{basic_result['answer'][:500]}...")
print(f"\nSources: {len(basic_result['sources'])}")

print("\n" + "=" * 60)

# Agentic RAG
print("\n[AGENTIC RAG]")
agentic_result = agent(test_question)
print(f"\nSub-questions: {len(agentic_result['sub_questions'])}")
print(f"\nAnswer:\n{agentic_result['answer'][:500]}...")
print(f"\nSources: {len(agentic_result['sources'])}")

Question: What evaluation metrics are used to assess RAG system performance?


[BASIC RAG]

Answer:
 Evaluation metrics for RAG systems vary by modality and task. For text generation and summarization, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics are standard, including ROUGE-1 and ROUGE-2 for unigram and bigram overlap precision, recall, and F1 scores, as well as ROUGE-L and ROUGE-Lsum for measuring longest common subsequences at the sentence and summary levels. For image generation, GenEval benchmarks assess specific compositional capabilities such as Two Object generati...

Sources: 5


[AGENTIC RAG]
Decomposing query: What evaluation metrics are used to assess RAG system performance?
Sub-questions: ['What metrics are used to evaluate the retrieval component of RAG systems, such as precision@k, recall, and mean reciprocal rank?', 'What metrics assess the quality, fluency, and relevance of the generated text outputs in RAG systems?', 'What specific metrics measur

## 6. Run Evaluation

Run a full evaluation comparing both approaches.

In [9]:
from src.evaluate import load_test_questions, compare_approaches, print_comparison_table

# Load test questions (path is relative to project root now)
questions = load_test_questions("data/test_questions.json")
print(f"Loaded {len(questions)} test questions")

# Show questions
for q in questions[:5]:
    print(f"  - [{q['complexity']}] {q['question'][:50]}...")

Loaded 10 test questions
  - [low] What is retrieval augmented generation?...
  - [low] What embedding models are commonly used for RAG sy...
  - [low] How do retrieval methods in RAG systems work?...
  - [medium] What are the main challenges in implementing RAG s...
  - [medium] Compare dense retrieval versus sparse retrieval me...


In [10]:
# Run quick evaluation on 2 questions (for demo speed)
quick_questions = questions[:2]

print("Running evaluation...")
comparison = compare_approaches(quick_questions)

# Print results
print_comparison_table(comparison)

Running evaluation...

Running Baseline RAG Evaluation
Loading index...


Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1932.61it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Using model: moonshotai/kimi-k2.5
Creating basic query engine...
Using model: moonshotai/kimi-k2.5


Loading weights: 100%|██████████| 103/103 [00:00<00:00, 2423.65it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



[1/2] What is retrieval augmented generation?...

[2/2] What embedding models are commonly used for RAG sy...

Running Agentic RAG Evaluation
Loading index...


Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1888.59it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Using model: moonshotai/kimi-k2.5
Creating synthesis agent...
Using model: moonshotai/kimi-k2.5


Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1928.40it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Using model: moonshotai/kimi-k2.5


Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1290.67it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



[1/2] What is retrieval augmented generation?...
Decomposing query: What is retrieval augmented generation?
Sub-questions: ['What is the fundamental definition and core concept of retrieval augmented generation?', 'What are the key components or architectural elements of a retrieval augmented generation system?', 'How does the retrieval augmented generation process work step-by-step?', 'What are the primary advantages and typical use cases of retrieval augmented generation compared to standard language generation?']

Answering sub-question 1: What is the fundamental definition and core concept of retrieval augmented generation?

Answering sub-question 2: What are the key components or architectural elements of a retrieval augmented generation system?

Answering sub-question 3: How does the retrieval augmented generation process work step-by-step?

Answering sub-question 4: What are the primary advantages and typical use cases of retrieval augmented generation compared to standard lang

## Summary

This demo showed:

1. **Basic Retrieval**: Find relevant document chunks using vector similarity
2. **Basic RAG**: Retrieve + Generate answers with citations
3. **Query Decomposition**: Break complex questions into simpler parts
4. **Agentic RAG**: Multi-hop retrieval with synthesis
5. **Evaluation**: Compare approaches with LLM-as-judge

### Key Insights

- **Basic RAG** is faster but may miss nuanced answers
- **Agentic RAG** provides more comprehensive answers for complex questions
- The tradeoff is latency vs quality
- Query decomposition helps cover multiple aspects of a question