# Build Verbatim: A Developer's Guide to Hallucination-Free RAG

*A hands-on tutorial for developers who want to build trustworthy, verifiable RAG systems*

<p align="center">
  <img src="https://github.com/KRLabsOrg/verbatim-rag/blob/main/assets/chiliground.png?raw=true" alt="ChiliGround Logo" width="400"/>
  <br><em>Chill, I Ground!</em>
</p>

## Part 0: Installation

First, install Verbatim RAG and its dependencies:

In [None]:
!pip install "verbatim-rag==0.1.6"

In [1]:
# Suppress verbose logs for cleaner output
import logging

logging.getLogger("transformers").setLevel(logging.ERROR)
logging.getLogger("milvus").setLevel(logging.WARNING)

### Environment Setup

If you plan to use LLM-based extraction (recommended for best accuracy):

In [None]:
import os

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = ""

# Or load from .env file
# from dotenv import load_dotenv
# load_dotenv()

---

## Part 1: The Hallucination Problem

If you've built RAG systems before, you've probably seen this:

**Question:** "How much synthetic data was generated?"

**Traditional RAG Answer:** "They generated around 58,000 synthetic training examples..."

**The actual source:** "We constructed a comprehensive dataset of 60k synthetic training examples..."

Notice the problem? The LLM rounded "60k" to "around 58,000" (wrong direction!). In medical, legal, or financial domains, this approximation could be catastrophic.

### Even with Perfect Retrieval, LLMs Still Hallucinate

Even with retrieval-augmented generation, LLMs can still:
- Round or approximate numbers
- Combine information incorrectly
- Fill in gaps with training knowledge
- Generate plausible-sounding but false information

**What if we could force the LLM to only use exact text from the source documents?**

That's exactly what Verbatim RAG does.

### Side-by-Side Comparison

Let's see the difference:

In [None]:
# Traditional RAG (typical output)
traditional_answer = """
The study generated around 58,000-60,000 synthetic examples 
using various data augmentation techniques.
"""

# Verbatim RAG (actual output)
verbatim_answer = """
[1] We constructed a comprehensive dataset of 60k synthetic 
training examples.
"""

print("Traditional RAG:")
print(traditional_answer)
print("Problems: Rounded numbers, paraphrased terms, no exact source")
print("\n" + "=" * 60 + "\n")
print("Verbatim RAG:")
print(verbatim_answer)
print("Result: Exact quote from source, verifiable, with citation")

### The Verbatim Solution

Instead of letting the LLM freely generate text based on retrieved documents, Verbatim RAG:

1. **Extracts exact text spans** from source documents
2. **Composes answers** entirely from these verbatim passages
3. **Provides precise citations** linking back to sources

The result? Zero hallucinations. Every claim is directly traceable to the source text.

**Key difference:**
- Traditional RAG: Retrieve → Generate (LLM can invent)
- Verbatim RAG: Retrieve → Extract → Compose (LLM only quotes)

---

## Part 2: The Complete System (5-Minute Quickstart)

Let's see Verbatim RAG in action. We'll build a complete hallucination-free system in under 15 lines of code.

In [1]:
from verbatim_rag import VerbatimRAG, VerbatimIndex
from verbatim_rag.vector_stores import LocalMilvusStore
from verbatim_rag.embedding_providers import SpladeProvider
from verbatim_rag.schema import DocumentSchema

# 1. Create index (storage layer)
store = LocalMilvusStore("./demo.db", enable_sparse=True, enable_dense=False)
embedder = SpladeProvider("naver/splade-v3", device="cpu")
index = VerbatimIndex(vector_store=store, sparse_provider=embedder)

# 2. Add a document
doc = DocumentSchema(
    content="""
# Methods
We used two approaches: zero-shot LLM extraction and a fine-tuned ModernBERT classifier on 58k synthetic examples.
    
# Results
The system achieved 42.01% accuracy on ArchEHR-QA.
""",
    title="Research Paper",
)
index.add_documents([doc])

# 3. Create RAG (add intelligence layer)
rag = VerbatimRAG(index, model="gpt-4o-mini")

# 4. Ask questions and get exact quotes
response = rag.query("What approaches were used?")
print(response.answer)

  from pkg_resources import DistributionNotFound, get_distribution
2025-11-17 12:17:03,612 - INFO - Created indexes for collection: verbatim_rag
2025-11-17 12:17:03,620 - INFO - Created documents collection: verbatim_rag_documents
2025-11-17 12:17:03,620 - INFO - Connected to Milvus Lite: ./demo.db
2025-11-17 12:17:03,678 - INFO - PyTorch version 2.8.0 available.
2025-11-17 12:17:04,431 - INFO - Load pretrained SparseEncoder: naver/splade-v3
2025-11-17 12:17:06,889 - INFO - Loaded SPLADE model: naver/splade-v3
Batches: 100%|██████████| 1/1 [00:00<00:00,  4.40it/s]]
2025-11-17 12:17:08,104 - INFO - Added 3 vectors to Milvus
2025-11-17 12:17:08,107 - INFO - Added 1 documents to Milvus
Adding documents: 100%|██████████| 1/1 [00:01<00:00,  1.20s/it]
Batches: 100%|██████████| 1/1 [00:00<00:00, 12.52it/s]


Extracting relevant spans...
Extracting spans (batch mode)...


2025-11-17 12:17:10,818 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Processing spans...
Generating response...


2025-11-17 12:17:12,197 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


The approaches used in this context include the following:

1. **Fact 1:** [1] We used two approaches: zero-shot LLM extraction and a fine-tuned ModernBERT classifier on 58k synthetic examples.


In [None]:
# Its very easy to use dense embeddings instead
from verbatim_rag.embedding_providers import SentenceTransformersProvider

embedder_dense = SentenceTransformersProvider(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# The embedding dimension needs to be defined
store_dense = LocalMilvusStore(
    "./demo_dense.db",
    enable_sparse=False,
    enable_dense=True,
    dense_dim=embedder_dense.get_dimension(),
)
index_dense = VerbatimIndex(vector_store=store_dense, dense_provider=embedder_dense)

**Want to use dense embeddings instead?**

## Part 3: Architecture Deep Dive - The Two-Layer System

Verbatim RAG is built on **two independent layers**. Understanding this separation is key to using the system effectively.

### Visual Overview

<p align="center">
  <img src="https://github.com/KRLabsOrg/verbatim-rag/blob/main/assets/verbatim_architecture.png?raw=true" alt="Verbatim RAG Architecture" width="800"/>
</p>

The architecture shows three main integration patterns:

1. **Full System**: User → VerbatimIndex + Verbatim Core → Answer
2. **Custom RAG Provider**: User → Your existing RAG → Verbatim Core → Answer
3. **VerbatimIndex Only**: User → VerbatimIndex (Chonkie + Milvus + Docling) → Retrieved chunks

### Layer 1: Storage (VerbatimIndex)

**Job:** Find relevant documents

| Component | What It Does | Options |
|-----------|--------------|----------|
| Chunker | Splits documents intelligently | Markdown, Simple, Chonkie |
| Embedder | Converts text to searchable vectors | SPLADE (sparse), SentenceTransformers (dense), OpenAI |
| Vector Store | Stores and retrieves efficiently | LocalMilvus, CloudMilvus |

**Input:** Documents  
**Output:** Relevant chunks for a query

### Layer 2: Verbatim Core (VerbatimRAG)

**Job:** Extract verbatim answers

| Component | What It Does | Options |
|-----------|--------------|----------|
| Span Extractor | Identifies exact text that answers questions | LLM-based, ModernBERT-based |
| Template Manager | Structures extracted spans | Static, Dynamic, Question-specific |
| Verification | Ensures spans exist verbatim in source | Automatic |

**Input:** Question + Relevant chunks  
**Output:** Verbatim answer with citations

### Why This Separation Matters

**1. Modularity:** Swap any component without affecting others
- Want GPU embeddings? Just change the embedder
- Want cloud storage? Just change the vector store
- Want custom extraction? Just change the span extractor

**2. Flexibility:** Use just the parts you need
- Already have retrieval? Use only Verbatim Core
- Need better chunking? Use only VerbatimIndex
- Want full system? Use both layers

**3. Integration:** Wrap your existing RAG
- Keep your LangChain/LlamaIndex retrieval
- Add Verbatim Core on top
- Get hallucination-free answers without rebuilding
---

## Part 4: Verbatim Core - How Verbatim Prevents Hallucinations

This is the core innovation. Let's see exactly how span extraction prevents hallucinations.

### The Problem with Free Generation

Traditional RAG lets the LLM generate freely:

In [None]:
# Simulation of what traditional RAG does
retrieved_context = "We created 58k examples. The system used synthetic data."

# LLM generates:
traditional_answer = "They generated approximately 58,000 synthetic training samples"
#                     ↑           ↑            ↑           ↑
#                 Invented   Approximated   Changed    Added word

print("Retrieved:", retrieved_context)
print("Generated:", traditional_answer)
print("\nProblems:")
print("- Changed 'We' to 'They'")
print("- Changed '58k' to 'approximately 58,000'")
print("- Changed 'created' to 'generated'")
print("- Added 'training' (not in source)")

### The Verbatim Solution: Extract, Don't Generate

Verbatim RAG constrains the LLM to extraction:

In [None]:
# Same context
retrieved_context = "We created 58k examples. The system used synthetic data."

# LLM extracts (doesn't generate):
extracted_spans = ["We created 58k examples"]

# Answer composed from exact quotes:
verbatim_answer = "[1] We created 58k examples."
#                  ↑    Every word is exact    ↑

print("Retrieved:", retrieved_context)
print("Extracted:", extracted_spans)
print("Answer:", verbatim_answer)
print("\nResult: Every word is verbatim from the source.")

### Step-by-Step: How Extraction Works

Let's see the extraction process in detail:

In [5]:
from verbatim_rag.extractors import LLMSpanExtractor
from verbatim_rag.vector_stores import SearchResult

# Question and retrieved chunks
question = "What methods were used?"

# Simulate retrieved chunks
chunks = [
    SearchResult(
        id="1",
        score=0.95,
        text="We used a zero-shot LLM approach for span extraction.",
        metadata={"title": "Methods"},
    ),
    SearchResult(
        id="2",
        score=0.90,
        text="We fine-tuned ModernBERT on 58k synthetic examples.",
        metadata={"title": "Methods"},
    ),
    SearchResult(
        id="3",
        score=0.60,
        text="The system runs on CPU.",
        metadata={"title": "Implementation"},
    ),
]

# Extract spans
extractor = LLMSpanExtractor(model="gpt-4o-mini")
spans = extractor.extract_spans(question, chunks)

print("Question:", question)
print("\nExtracted spans:")
for doc_text, span_list in spans.items():
    for span in span_list:
        print(f"  - {span}")

print("\nNote: 'The system runs on CPU' was rejected - doesn't answer the question")

Extracting spans (batch mode)...


2025-11-06 11:49:14,052 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Question: What methods were used?

Extracted spans:
  - We used a zero-shot LLM approach for span extraction.
  - We fine-tuned ModernBERT on 58k synthetic examples.

Note: 'The system runs on CPU' was rejected - doesn't answer the question


### Template Composition

Once spans are extracted, the template arranges them:

In [5]:
# Extracted spans (from above)
spans_dict = {
    "chunk_1": ["We used a zero-shot LLM approach for span extraction"],
    "chunk_2": ["We fine-tuned ModernBERT on 58k synthetic examples"],
}

# Template composes them
final_answer = """
The methods used include:

[1] We used a zero-shot LLM approach for span extraction.
[2] We fine-tuned ModernBERT on 58k synthetic examples.
"""

print("Extracted spans:")
for chunk_id, span_list in spans_dict.items():
    print(f"  {chunk_id}: {span_list[0]}")

print("\nComposed answer:")
print(final_answer)
print("Note: Every word is verbatim from source. No generation occurred.")

Extracted spans:
  chunk_1: We used a zero-shot LLM approach for span extraction
  chunk_2: We fine-tuned ModernBERT on 58k synthetic examples

Composed answer:

The methods used include:

[1] We used a zero-shot LLM approach for span extraction.
[2] We fine-tuned ModernBERT on 58k synthetic examples.

Note: Every word is verbatim from source. No generation occurred.


### Key Insight: The LLM Never Generates Facts

**Traditional RAG:**
- LLM understands the documents
- LLM generates an answer (risk of hallucination)

**Verbatim RAG:**
- LLM understands the documents
- LLM identifies relevant text spans (no risk - just selection)
- System composes answer from verbatim quotes

The generation step is replaced by verbatim composition.

### Choosing Your Extractor

You have two options for span extraction:

**Option 1: LLM-based Extractor** (default)

```python
rag = VerbatimRAG(index, model="gpt-4o-mini")
```

Pros:
- Zero-shot: works immediately without training
- Handles complex reasoning
- Best accuracy

Cons:
- Requires API calls (costs money)
- Slower

**Option 2: ModernBERT Extractor** (CPU-only)

In [6]:
from verbatim_rag.extractors import ModelSpanExtractor

# Use our fine-tuned ModernBERT model
extractor = ModelSpanExtractor(
    model_path="KRLabsOrg/verbatim-rag-modern-bert-v1",
    device="cpu",  # Works on CPU!
)

rag_bert = VerbatimRAG(index, extractor=extractor)

Loading model from KRLabsOrg/verbatim-rag-modern-bert-v1...
Loading tokenizer from KRLabsOrg/verbatim-rag-modern-bert-v1...
Tokenizer loaded successfully.


Pros:
- Runs entirely on CPU (no GPU needed)
- No API costs
- Fast inference (milliseconds)
- Works offline

Cons:
- Requires model download
- Less flexible than LLM

**Note:** With ModernBERT extractor + SPLADE embeddings, the **entire pipeline runs on CPU** with no LLM calls!

---

## Part 5: VerbatimIndex - Making Documents Searchable

Before we can extract verbatim answers, we need to find relevant documents. The storage layer handles this.

### Component 1: Document processing


Before we chunk, we need to get documents into our system. This is where **DocumentSchema** comes in.

In [29]:
from verbatim_rag.schema import DocumentSchema

# From a PDF URL (uses Docling to convert PDF → Markdown)
doc = DocumentSchema.from_url(
    url="https://aclanthology.org/2025.bionlp-share.8.pdf",
    title="KR Labs at ArchEHR-QA 2025",
    authors=["Adam Kovacs", "Paul Schmitt", "Gabor Recski"],
    year=2025,
    conference="BioNLP",
    category="nlp",
)

print(f"Document: {doc.title}")
print(f"Content length: {len(doc.content):,} chars")
print(f"Content preview:\n{doc.content[:200]}...")

2025-11-06 13:26:10,930 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-11-06 13:26:10,954 - INFO - Going to convert document batch...
2025-11-06 13:26:10,959 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e647edf348883bed75367b22fbe60347
2025-11-06 13:26:10,968 - INFO - Accelerator device: 'mps'
2025-11-06 13:26:16,594 - INFO - Accelerator device: 'mps'
2025-11-06 13:26:18,123 - INFO - Accelerator device: 'mps'
2025-11-06 13:26:18,732 - INFO - Processing document 2025.bionlp-share.8.pdf
2025-11-06 13:26:32,829 - INFO - Finished converting document 2025.bionlp-share.8.pdf in 22.90 sec.


Document: KR Labs at ArchEHR-QA 2025
Content length: 25,220 chars
Content preview:
## KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering

Ádám Kovács 1 , Paul Schmitt 2 , Gábor Recski 1 , 2

1 KR Labs lastname@krlabs.eu

2 TU Wien firstname.lastnam...


**What just happened?**

1. Docling downloaded the PDF
2. Converted it to clean markdown (preserving structure!)
3. Created a DocumentSchema with your metadata
4. All custom fields (`conference`, `category`) are preserved

In [30]:
# Create manually
# If you already have the content
doc = DocumentSchema(
    content="# My Document\n\nContent here...",
    title="Custom Document",
    user_id="user123",  # Custom field!
    dataset_id="project_a",  # Another custom field!
)

### Component 2: Smart Chunking

Most RAG systems chunk naively by character count. This loses critical context.

In [6]:
# Bad: Arbitrary fixed-size chunking
def naive_chunk(text, size=100):
    return [text[i : i + size] for i in range(0, len(text), size)]


markdown = """# Introduction
Machine learning has revolutionized AI by enabling systems to learn from data.

## Related Work
Previous approaches used rule-based methods which were limited in scope.

## Our Approach
We propose a novel architecture that combines deep learning with symbolic reasoning.
"""

chunks = naive_chunk(markdown, size=100)
print("Naive chunking (100 char chunks):")
for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i} ---")
    print(chunk[:50] + "..." if len(chunk) > 50 else chunk)

print("\nProblems:")
print("- Cut headings in half")
print("- Lost document structure")
print("- No context about which section text belongs to")

Naive chunking (100 char chunks):

--- Chunk 0 ---
# Introduction
Machine learning has revolutionized...

--- Chunk 1 ---
lated Work
Previous approaches used rule-based met...

--- Chunk 2 ---

We propose a novel architecture that combines dee...

Problems:
- Cut headings in half
- Lost document structure
- No context about which section text belongs to


### Verbatim's Structure-Aware Chunking

In [20]:
from verbatim_rag.chunker_providers import MarkdownChunkerProvider

chunker = MarkdownChunkerProvider(
    split_levels=(1, 2, 3, 4),  # Split on H1, H2, H3, H4
    include_preamble=True,  # Include text before first heading
)

chunks = chunker.chunk(markdown)

print("Structure-aware chunking:")
for i, (raw, enhanced) in enumerate(chunks):
    print(f"\n--- Chunk {i} ---")
    print("Raw (for display):")
    print(raw[:80] + "..." if len(raw) > 80 else raw)
    print("\nEnhanced (for embedding):")
    print(enhanced[:100] + "..." if len(enhanced) > 100 else enhanced)

Structure-aware chunking:

--- Chunk 0 ---
Raw (for display):
# Introduction
Machine learning has revolutionized AI by enabling systems to lea...

Enhanced (for embedding):
# Introduction
Machine learning has revolutionized AI by enabling systems to learn from data.



--- Chunk 1 ---
Raw (for display):
## Related Work
Previous approaches used rule-based methods which were limited i...

Enhanced (for embedding):
# Introduction

## Related Work
Previous approaches used rule-based methods which were limited in sc...

--- Chunk 2 ---
Raw (for display):
## Our Approach
We propose a novel architecture that combines deep learning with...

Enhanced (for embedding):
# Introduction

## Our Approach
We propose a novel architecture that combines deep learning with sym...


**Key Innovation:** Two versions of each chunk

1. **Raw chunk:** Original text (used for display/extraction)
2. **Enhanced chunk:** Original + ancestor headings (used for embedding/retrieval)

**Why this matters:** When someone searches for "novel approaches", the enhanced chunk matches better because it has full context:
- "Introduction" tells the model this is introductory content
- "Our Approach" tells the model this is about the contribution

### Component 2: Embeddings

Embeddings convert text to vectors for semantic search.

**Sparse Embeddings (SPLADE):**

In [21]:
from verbatim_rag.embedding_providers import SpladeProvider
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("naver/splade-v3")

embedder = SpladeProvider(
    model_name="naver/splade-v3",
    device="cpu",  # Works on CPU!
)

text = "Machine learning models require training data"
vector = embedder.embed_text(text)

print(f"Vector type: {type(vector)}")
print(f"Non-zero dimensions: {len(vector)}")
print("\nTop 5 terms:")
for idx, weight in sorted(vector.items(), key=lambda x: x[1], reverse=True)[:5]:
    token = tokenizer.convert_ids_to_tokens([idx])[0]
    print(f"{token}: {weight:.3f}")

2025-11-06 13:17:52,171 - INFO - Load pretrained SparseEncoder: naver/splade-v3
2025-11-06 13:17:54,365 - INFO - Loaded SPLADE model: naver/splade-v3
Batches: 100%|██████████| 1/1 [00:00<00:00,  7.52it/s]


Vector type: <class 'dict'>
Non-zero dimensions: 32

Top 5 terms:
machine: 2.419
training: 2.083
learning: 1.881
data: 1.821
train: 1.613


**Dense Embeddings:**

In [10]:
from verbatim_rag.embedding_providers import SentenceTransformersProvider

embedder_dense = SentenceTransformersProvider(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    device="cpu",
)

vector_dense = embedder_dense.embed_text(text)
print(f"Dense vector length: {len(vector_dense)}")
print("Dense vector: All 384 dimensions have values")

2025-11-06 11:50:34,103 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
2025-11-06 11:50:36,346 - INFO - Loaded SentenceTransformers model: sentence-transformers/all-MiniLM-L6-v2


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Dense vector length: 384
Dense vector: All 384 dimensions have values


### Decision Guide: Sparse vs Dense

| Factor | Sparse (SPLADE) | Dense (SentenceTransformers) |
|--------|-----------------|------------------------------|
| **Hardware** | Fast on CPU | Faster with GPU |
| **Accuracy** | Great for exact matches | Great for semantic similarity |
| **Interpretability** | Higher (see which terms match) | Low (black box) |
| **Model size** | ~100MB | ~100-500MB |
| **Use case** | Precise matches | Generic text |

**Recommendation:**
- Start with SPLADE (sparse) for CPU-only deployments
- Use dense if you have GPU and need max semantic similarity
- Use both (hybrid) for best of both worlds

### Component 3: Vector Stores

Vector stores handle efficient storage and retrieval.

**LocalMilvusStore:**

In [None]:
from verbatim_rag.vector_stores import LocalMilvusStore

store_local = LocalMilvusStore(
    db_path="./my_index.db",
    collection_name="my_collection",
    enable_sparse=True,
    enable_dense=False,
)

print("LocalMilvusStore:")
print("- Stores data locally in a file")
print("- No cloud/network needed")
print("- Great for development and small datasets")

**CloudMilvusStore** (for production):

```python
from verbatim_rag.vector_stores import CloudMilvusStore

store_cloud = CloudMilvusStore(
    uri="https://your-milvus-instance.com",
    token="your-token",
    collection_name="production_collection"
)
```

Use for:
- Production deployments
- Large-scale datasets
- Multi-tenant systems
- High availability requirements

---

## Part 6: Building Your Index Step-by-Step

Now let's build the storage layer and add real research papers.

### Step 1: Create DocumentSchemas from PDFs

We'll add 6 research papers from ACL Anthology. Docling automatically converts PDFs to clean markdown:

In [None]:
from verbatim_rag.schema import DocumentSchema

# Add 6 research papers from ACL Anthology
# These papers demonstrate Verbatim RAG working with real academic literature

paper = DocumentSchema.from_url(
    url="https://aclanthology.org/L16-1417.pdf",
    title="Building Concept Graphs from Monolingual Dictionary Entries",
    doc_type="academic_paper",
    authors=["Gabor Recski"],
    conference="LREC",
    year=2016,
    category="nlp",
    dataset_id="anthology",
)

paper2 = DocumentSchema.from_url(
    url="https://aclanthology.org/2025.bionlp-share.8.pdf",
    title="KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering",
    doc_type="academic_paper",
    authors=["Adam Kovacs", "Paul Schmitt", "Gabor Recski"],
    conference="BioNLP",
    year=2025,
    category="nlp",
    dataset_id="anthology",
)

paper3 = DocumentSchema.from_url(
    url="https://aclanthology.org/2020.lrec-1.448.pdf",
    title="Better Together: Modern Methods Plus Traditional Thinking in NP Alignment",
    doc_type="academic_paper",
    authors=["Adam Kovacs", "Judit Acs", "Andras Kornai", "Gabor Recski"],
    conference="LREC",
    year=2020,
    category="nlp",
    dataset_id="anthology",
)

paper4 = DocumentSchema.from_url(
    url="https://aclanthology.org/2020.msr-1.2.pdf",
    title="BME-TUW at SR'20: Lexical grammar induction for surface realization",
    doc_type="academic_paper",
    authors=[
        "Gábor Recski",
        "Ádám Kovács",
        "Kinga Gémes",
        "Judit Ács",
        "Andras Kornai",
    ],
    conference="MSR",
    year=2020,
    category="nlp",
    dataset_id="anthology",
)

paper5 = DocumentSchema.from_url(
    url="https://aclanthology.org/D19-6304.pdf",
    title="BME-UW at SRST-2019: Surface realization with Interpreted Regular Tree Grammars",
    doc_type="academic_paper",
    authors=["Ádám Kovács", "Evelin Ács", "Judit Ács", "Andras Kornai", "Gábor Recski"],
    conference="SRST",
    year=2019,
    category="nlp",
    dataset_id="anthology",
)

paper6 = DocumentSchema.from_url(
    url="https://aclanthology.org/2020.semeval-1.15.pdf",
    title="BMEAUT at SemEval-2020 Task 2: Lexical entailment with semantic graphs",
    doc_type="academic_paper",
    authors=["Ádám Kovács", "Kinga Gémes", "Andras Kornai", "Gábor Recski"],
    conference="SemEval",
    year=2020,
    category="nlp",
    dataset_id="anthology",
)

print("Created 6 research papers:")
print(f"  1. {paper.title} ({paper.metadata['year']})")
print(f"  2. {paper2.title} ({paper2.metadata['year']})")
print(f"  3. {paper3.title} ({paper3.metadata['year']})")
print(f"  4. {paper4.title} ({paper4.metadata['year']})")
print(f"  5. {paper5.title} ({paper5.metadata['year']})")
print(f"  6. {paper6.title} ({paper6.metadata['year']})")

**What happened:**
1. Docling downloaded each PDF
2. Converted them to clean markdown (preserving structure)
3. Created DocumentSchemas with metadata
4. All custom fields (conference, category, dataset_id) are preserved

**Note:** All custom fields flow through the pipeline. You can filter by them later!

### Step 2: Configure Components

In [12]:
from verbatim_rag.vector_stores import LocalMilvusStore
from verbatim_rag.embedding_providers import SpladeProvider
from verbatim_rag.chunker_providers import MarkdownChunkerProvider

# Create vector store
store = LocalMilvusStore(
    db_path="./my_papers.db",
    collection_name="research_papers",
    enable_sparse=True,
    enable_dense=False,
)

# Create embedder
embedder = SpladeProvider(model_name="naver/splade-v3", device="cpu")

# Create chunker
chunker = MarkdownChunkerProvider(split_levels=(1, 2, 3, 4), include_preamble=True)

print("Components configured!")
print(f"  Vector store: {type(store).__name__}")
print(f"  Embedder: {type(embedder).__name__}")
print(f"  Chunker: {type(chunker).__name__}")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2025-11-06 11:52:56,779 - INFO - Created indexes for collection: research_papers
2025-11-06 11:52:56,785 - INFO - Created documents collection: research_papers_documents
2025-11-06 11:52:56,786 - INFO - Connected to Milvus Lite: ./my_papers.db
2025-11-06 11:52:56,797 - INFO - Load pretrained SparseEncoder: naver/splade-v3
2025-11-06 11:52:58,907 - INFO - Loaded SPLADE model: naver/splade-v3


Components configured!
  Vector store: LocalMilvusStore
  Embedder: SpladeProvider
  Chunker: MarkdownChunkerProvider


### Step 3: Assemble Index

In [13]:
from verbatim_rag import VerbatimIndex

# Assemble the storage layer
index = VerbatimIndex(
    vector_store=store, sparse_provider=embedder, chunker_provider=chunker
)

print("Storage layer ready!")

Storage layer ready!


### Step 4: Add All 6 Papers to the Index

In [14]:
# Add all 6 papers in a single batch
index.add_documents([paper, paper2, paper3, paper4, paper5, paper6])

print("All 6 papers indexed!")
print(f"Total documents: {index.inspect()['total_documents']}")
print(f"Total chunks: {index.inspect()['total_chunks']}")

Batches: 100%|██████████| 1/1 [00:03<00:00,  3.19s/it]]
2025-11-06 11:53:15,911 - INFO - Added 13 vectors to Milvus
2025-11-06 11:53:15,914 - INFO - Added 1 documents to Milvus
Batches: 100%|██████████| 1/1 [00:04<00:00,  4.75s/it]7.11s/it]
2025-11-06 11:53:26,409 - INFO - Added 19 vectors to Milvus
2025-11-06 11:53:26,410 - INFO - Added 1 documents to Milvus
Batches: 100%|██████████| 1/1 [00:02<00:00,  2.66s/it]9.10s/it]
2025-11-06 11:53:33,089 - INFO - Added 13 vectors to Milvus
2025-11-06 11:53:33,090 - INFO - Added 1 documents to Milvus
Batches: 100%|██████████| 1/1 [00:03<00:00,  3.67s/it]8.00s/it]
2025-11-06 11:53:41,654 - INFO - Added 16 vectors to Milvus
2025-11-06 11:53:41,655 - INFO - Added 1 documents to Milvus
Batches: 100%|██████████| 1/1 [00:02<00:00,  2.44s/it]8.22s/it]
2025-11-06 11:53:47,681 - INFO - Added 12 vectors to Milvus
2025-11-06 11:53:47,682 - INFO - Added 1 documents to Milvus
Batches: 100%|██████████| 1/1 [00:03<00:00,  3.18s/it]7.43s/it]
2025-11-06 11:53:55

All 6 papers indexed!
Total documents: 6
Total chunks: 88





## Part 7: Working With Your Index

Once you've indexed documents, you should inspect and query your index.

### Inspecting What You Built

In [16]:
# Get overview statistics
stats = index.inspect()

print(f"Total documents: {stats['total_documents']}")
print(f"Total chunks: {stats['total_chunks']}")
print(f"Average chunks per doc: {stats['total_chunks'] / stats['total_documents']:.1f}")

print("\nDocument types:")
for doc_type, count in stats["doc_types"].items():
    print(f"  {doc_type}: {count}")

print("\nSample documents:")
for doc in stats["sample_documents"][:3]:
    print(f"  - {doc['title']} ({doc.get('year', 'N/A')})")

Total documents: 6
Total chunks: 88
Average chunks per doc: 14.7

Document types:
  academic_paper: 6

Sample documents:
  - BME-TUW at SR'20: Lexical grammar induction for surface realization (N/A)
  - BME-UW at SRST-2019: Surface realization with Interpreted Regular Tree Grammars (N/A)
  - KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering (N/A)


### Examine Individual Chunks

In [None]:
# Get first 5 chunks
chunks = index.get_all_chunks(limit=5)

for i, chunk in enumerate(chunks):
    print(f"\n{'=' * 60}")
    print(f"Chunk {i}")
    print(f"{'=' * 60}")

    print(f"\nID: {chunk.id}")
    print(f"Score: {chunk.score}")

    print("\n--- Raw Text (for extraction/display) ---")
    print(chunk.text[:200] + "...")

    print("\n--- Enhanced Text (what was embedded) ---")
    print(chunk.enhanced_text[:300] + "...")

    print("\n--- Metadata ---")
    metadata_keys = ["title", "authors", "year", "conference", "chunk_number"]
    for key in metadata_keys:
        if key in chunk.metadata:
            print(f"  {key}: {chunk.metadata[key]}")

**This shows you:**
- How your chunks look after processing
- What metadata was preserved
- The difference between raw and enhanced text

**Warning:** Not inspecting chunks is like deploying code without testing. Always check a few chunks!

### Semantic Search

In [None]:
# Search for information about 4lang (mentioned in the 2016 paper)
results = index.query(text="what is 4lang", k=5)

print("Top 5 results for '4lang':")
for i, result in enumerate(results, 1):
    print(f"\n{i}. {result.metadata['title']}")
    print(f"   Score: {result.score:.3f}")
    print(f"   Year: {result.metadata.get('year', 'N/A')}")
    print(f"   Preview: {result.text[:150]}...")

### Filtering by Metadata

Remember those custom fields? Now you can filter by them:

In [None]:
# Get only 2025 papers
chunks_2025 = index.query(
    text=None,  # No semantic search, just filter
    filter='metadata["year"] == 2025',
    k=100,
)
print(f"Found {len(chunks_2025)} chunks from 2025")

# Get papers from specific conference
chunks_bionlp = index.query(filter='metadata["conference"] == "BioNLP"', k=50)
print(f"Found {len(chunks_bionlp)} chunks from BioNLP")

### Combining Search + Filters

The real power comes from combining semantic search with metadata filters:

In [19]:
# Search for "neural networks" in papers from 2024 onwards
results = index.query(
    text="neural network architectures", filter='metadata["year"] >= 2024', k=10
)

print("Top results for 'neural networks' (2024+):")
for i, result in enumerate(results[:3], 1):
    print(f"\n{i}. {result.metadata['title']}")
    print(f"   Score: {result.score:.3f}")
    print(f"   Year: {result.metadata.get('year', 'N/A')}")

Batches: 100%|██████████| 1/1 [00:00<00:00, 13.45it/s]


Top results for 'neural networks' (2024+):

1. KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering
   Score: 9.237
   Year: 2025

2. KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering
   Score: 8.142
   Year: 2025

3. KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering
   Score: 2.726
   Year: 2025


This is much more precise than pure semantic search!

---

## Part 8: Integration Patterns - Wrapping Your Existing RAG

You don't have to rebuild everything. Verbatim RAG can wrap your existing retrieval system.

### The RAGProvider Interface

To integrate with existing systems, implement this simple interface:

In [None]:
from verbatim_rag.providers import RAGProvider
from typing import List, Dict, Any, Optional


class MyCustomRAGProvider(RAGProvider):
    """Wrap your existing RAG system."""

    def retrieve(
        self, question: str, k: int = 5, filter: Optional[str] = None
    ) -> List[Dict[str, Any]]:
        """
        Return context as list of dicts.

        Must return:
        [
            {
                "content": str (required),
                "title": str (optional),
                "source": str (optional),
                "metadata": dict (optional)
            },
            ...
        ]
        """
        # Your retrieval logic here
        raise NotImplementedError

That's it! Just one method that returns a list of dicts with:
- `content` (required): The text content
- `title` (optional): Document title
- `source` (optional): Document source/URL
- `metadata` (optional): Any additional metadata

### Example: Wrapping LangChain

Here's a complete working example:

In [None]:
!pip install langchain langchain-openai langchain-community openai faiss-cpu langchain-text-splitters

In [26]:
from verbatim_rag.providers import RAGProvider
from verbatim_rag import verbatim_query

# Your existing LangChain setup
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

from typing import List, Dict, Any, Optional

# 1. Create sample documents
docs = [
    Document(
        page_content="""
# Methods
We used two approaches: zero-shot LLM extraction and a fine-tuned ModernBERT classifier on 58k synthetic examples.

# Results
The system achieved 42.01% accuracy on ArchEHR-QA.
        """,
        metadata={"title": "Research Paper 1", "source": "paper1.pdf", "year": 2025},
    )
]

# 2. Index with LangChain (your existing setup)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
splits = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})


# 3. Wrap your LangChain retriever
class LangChainRAGProvider(RAGProvider):
    def __init__(self, langchain_retriever):
        self.retriever = langchain_retriever

    def retrieve(
        self, question: str, k: int = 5, filter: Optional[str] = None
    ) -> List[Dict[str, Any]]:
        # Use LangChain's retrieval
        docs = self.retriever.invoke(question)

        # Convert to Verbatim format
        context = []
        for doc in docs[:k]:
            context.append(
                {
                    "content": doc.page_content,
                    "title": doc.metadata.get("title", ""),
                    "source": doc.metadata.get("source", ""),
                    "metadata": doc.metadata,
                }
            )
        return context


# 4. Use Verbatim with your existing LangChain RAG
provider = LangChainRAGProvider(retriever)
response = verbatim_query(provider, "What methods were used?", k=5)

print(response.answer)

2025-11-06 13:23:43,998 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-06 13:23:44,916 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


Extracting spans (batch mode)...


2025-11-06 13:23:47,236 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-11-06 13:23:49,234 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


What methods were used in the study? Here are the key approaches:

1. **Fact 1:** [1] We used two approaches: zero-shot LLM extraction and a fine-tuned ModernBERT classifier on 58k synthetic examples.


**What you get:**
- Keep your existing LangChain + FAISS setup
- Keep your OpenAI embeddings
- Add hallucination-free span extraction
- Get precise, verifiable citations

**Zero changes** to your current indexing pipeline!

---

## Part 9: Standalone Components

You can also use individual components without the full system.

### VerbatimTransform (Most Common)

The core transformation component that takes ANY context and produces verbatim answers:

In [27]:
from verbatim_rag import VerbatimTransform

# Your retrieval (from anywhere!)
context = [
    {"content": "Our system achieved 42.01% accuracy on ArchEHR-QA."},
    {"content": "Baselines achieved 35.4% on the same benchmark."},
]

# Extract and compose verbatim answer
vt = VerbatimTransform()
response = vt.transform(question="What were the results?", context=context)

print(response.answer)

# Inspect extracted spans
print("\nExtracted spans:")
for doc in response.documents:
    for h in doc.highlights:
        print(f"  - {h.text}")

Extracting spans (batch mode)...


2025-11-06 13:23:58,767 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-11-06 13:24:00,563 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


The results of the evaluation are as follows:

1. **Fact 1:** [1] Our system achieved 42.01% accuracy on ArchEHR-QA.
2. **Fact 2:** [2] Baselines achieved 35.4% on the same benchmark.

Extracted spans:
  - Our system achieved 42.01% accuracy on ArchEHR-QA.
  - Baselines achieved 35.4% on the same benchmark.


**Use case:** You have retrieval working, just need extraction and composition.

### LLMSpanExtractor

Just want span extraction?

In [28]:
from verbatim_rag.extractors import LLMSpanExtractor
from verbatim_rag.vector_stores import SearchResult

# Your chunks (from any retrieval system)
chunks = [
    SearchResult(
        id="1", score=0.95, text="Our system achieved 42.01% accuracy.", metadata={}
    )
]

# Extract ONLY the relevant spans
extractor = LLMSpanExtractor(model="gpt-4o-mini")
relevant_spans = extractor.extract_spans("What is the accuracy?", chunks)

# relevant_spans is a dict: {document_text: [span1, span2, ...]}
for doc_text, spans in relevant_spans.items():
    for span in spans:
        if span.strip():
            print(f"  - {span}")

Extracting spans (batch mode)...


2025-11-06 13:24:04,801 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


  - Our system achieved 42.01% accuracy.


**Use case:** You just need the extraction logic, will handle composition yourself.

---

## Part 10: Advanced Configuration

### Custom Answer Templates

Want full control over answer formatting?

In [None]:
# Define your custom template
template = """
Thanks for your question!

You can find the answer in the following sources:
[RELEVANT_SENTENCES]

Please let me know if you need more details.
"""

# Use static mode
rag.template_manager.use_static_mode(template)

response = rag.query("What methods were used?")
print(response.answer)

The `[RELEVANT_SENTENCES]` placeholder is automatically replaced with numbered citations.

**When to use:**
- Static mode: Consistent formatting (customer support, reports)
- Contextual mode (default): LLM adapts template to each question

### Use Your Own LLM

In [None]:
from verbatim_rag.core import LLMClient

# Configure any OpenAI-compatible endpoint
llm_client = LLMClient(
    model="anthropic/claude-3-5-sonnet",
    api_base="https://api.anthropic.com/v1",
)

rag_claude = VerbatimRAG(index, llm_client=llm_client)

### Hybrid Search (Sparse + Dense)

Use both embedding types for best accuracy:

In [None]:
from verbatim_rag.embedding_providers import (
    SpladeProvider,
    SentenceTransformersProvider,
)

sparse = SpladeProvider("naver/splade-v3", device="cpu")
dense = SentenceTransformersProvider("all-MiniLM-L6-v2", device="cpu")

store = LocalMilvusStore(
    "./hybrid.db",
    enable_sparse=True,
    enable_dense=True,
    dense_dim=384,  # Dimension of all-MiniLM-L6-v2
)

index = VerbatimIndex(vector_store=store, sparse_provider=sparse, dense_provider=dense)

# Now retrieval uses both sparse and dense embeddings!

---

## Summary

**1. The Problem:**
- Traditional RAG systems hallucinate even with perfect retrieval
- LLMs round numbers, paraphrase terms, combine facts incorrectly

**2. The Solution:**
- Verbatim RAG extracts exact text spans (no generation)
- Composes answers from verbatim quotes
- Every claim is verifiable

**3. The Architecture:**
- Two independent layers: VerbatimIndex + Verbatim Core
- Modular design: swap any component
- Use parts or the whole system

**4. Integration Options:**
- Full system
- Wrap existing RAG (LangChain, LlamaIndex)
- Standalone components (just extraction)

**5. Production Deployment:**
- CPU-only option (SPLADE + ModernBERT)
- Cloud storage (CloudMilvusStore)

### Next Steps

1. **Try it with your documents**
2. **Experiment with different extractors** (LLM vs ModernBERT)
3. **Integrate with your existing RAG** (if you have one)
4. **Deploy to production** (use cloud storage)

### Resources

- **GitHub**: [github.com/KRLabsOrg/verbatim-rag](https://github.com/KRLabsOrg/verbatim-rag)
- **Research Paper**: [ACL Anthology 2025](https://aclanthology.org/2025.bionlp-share.8/)
- **Models**: [HuggingFace](https://huggingface.co/KRLabsOrg)

### Citation

If you use Verbatim RAG in your research, please cite our paper:

```bibtex
@inproceedings{kovacs-etal-2025-kr,
    title = "{KR} Labs at {A}rch{EHR}-{QA} 2025: A Verbatim Approach for Evidence-Based Question Answering",
    author = "Kovacs, Adam and Schmitt, Paul and Recski, Gabor",
    booktitle = "Proceedings of the 24th Workshop on Biomedical Language Processing",
    year = "2025",
    url = "https://aclanthology.org/2025.bionlp-share.8/",
    pages = "69--74"
}
```