# Build Verbatim: A Developer's Guide to hallucination free RAG

*A hands-on tutorial for developers who want to build trustworthy, verifiable RAG systems*

<p align="center">
  <img src="https://github.com/KRLabsOrg/verbatim-rag/blob/main/assets/chiliground.png?raw=true" alt="ChiliGround Logo" width="400"/>
  <br><em>Chill, I Ground! üå∂ Ô∏è</em>
</p>

## The Hallucination Problem

If you've built RAG systems before, you've probably seen this:

**Question:** "How much synthetic data was generated?"

**Traditional RAG Answer:** "They generated around 58,000 synthetic training examples..."

**The actual source:** "We constructed a comprehensive dataset of 60k synthetic training examples..."

Notice the problem? The LLM rounded "58k" to "around 60k". In medical, legal, or financial domains, this approximation could be catastrophic.

Even with retrieval-augmented generation, LLMs can still:
- Round or approximate numbers
- Combine information incorrectly
- Fill in gaps with training knowledge
- Generate plausible-sounding but false information

**What if we could force the LLM to only use exact text from the source documents?**

That's exactly what Verbatim RAG does.

---

## The Verbatim Solution

Instead of letting the LLM freely generate text based on retrieved documents, Verbatim RAG:

1. **Extracts exact text spans** from source documents
2. **Composes answers** entirely from these verbatim passages
3. **Provides precise citations** linking back to sources

The result? Zero hallucinations. Every claim is directly traceable to the source text.

But here's what makes Verbatim RAG special: **it's modular and RAG-agnostic**. You can use it as a complete RAG system, or just use its hallucination-prevention components in your existing pipeline.

In this guide, we'll build a Verbatim RAG system from scratch, understanding each component along the way.

## Understanding the Architecture

Verbatim RAG is modular - each step is configurable so you can pick what works best for your use case.

Here's the complete pipeline:

```
Document (pdf, docx, csv, etc..) ‚Üí Markdown ‚Üí DocumentSchema ‚Üí Chunking ‚Üí Enhanced Chunks ‚Üí Embedding ‚Üí Vector Store ‚Üí Query ‚Üí Extract Spans ‚Üí Compose Answer
```

Let's unpack each step.

## Part 0: Installation

First, install Verbatim RAG and its dependencies:

```bash
pip install verbatim-rag
```


## Part 1: The Complete Picture (5-Minute Quickstart)

Let's see Verbatim RAG in action, then understand how it works.

### Working Example

Verbatim RAG is **fully modular** - you can configure every component of the pipeline.

In [None]:
from verbatim_rag import VerbatimRAG, VerbatimIndex
from verbatim_rag.vector_stores import LocalMilvusStore
from verbatim_rag.embedding_providers import SpladeProvider
from verbatim_rag.schema import DocumentSchema

# Create index (storage layer)
store = LocalMilvusStore("./demo.db", enable_sparse=True, enable_dense=False)
embedder = SpladeProvider("naver/splade-v3", device="cpu")
index = VerbatimIndex(vector_store=store, sparse_provider=embedder)

# Add a document
doc = DocumentSchema(
    content="""
# Methods
We used two approaches: zero-shot LLM extraction and a fine-tuned ModernBERT classifier on 58k synthetic examples.
    
# Results
The system achieved 42.01% accuracy on ArchEHR-QA.
""",
    title="Research Paper",
)
index.add_documents([doc])

# Create RAG (add intelligence layer)
rag = VerbatimRAG(index, model="gpt-4o-mini")

# Ask questions ‚Üí get exact quotes
response = rag.query("What approaches were used?")
print(response.answer)


That's it! Now you have a complete RAG system with hallucination prevention. Each component is independent and swappable.

### The Architecture: Two Layers

**Storage Layer (VerbatimIndex):**
1. **Chunker** - Splits documents into searchable pieces
2. **Embedder** - Converts text to vectors for semantic search (dense/sparse)
3. **Vector Store** - Stores and retrieves your data (local/cloud)

**Intelligence Layer (VerbatimRAG):**
1. **Span Extractor** - Identifies exact text that answers questions (LLM/model-based)
2. **Template Manager** - Structures answers from extracted spans (static/contextual)
3. **LLM Client** - Powers extraction and composition

Roadmap: How We'll Build This

- **Parts 2-4:** Configure storage components (chunker, embedder, vector store)
- **Part 5:** Assemble storage layer (`VerbatimIndex`) - now you can index & retrieve
- **Parts 6-7:** Use the storage layer (inspect chunks, semantic search)
- **Part 8:** Add intelligence layer (`VerbatimRAG`) - now you get hallucination-free answers
- **Parts 9-11:** Integration patterns and complete examples

By the end, you'll understand every component and how to customize them.

### Choose Your Path

**Already have a RAG system?**
‚Üí Skip to [Part 9](#part-9-integration-patterns---wrapping-existing-rag) - Wrap your LangChain/LlamaIndex/custom RAG

**Just want extraction for your existing retrieval?**
‚Üí Skip to [Part 10](#part-10-using-components-standalone) - Use VerbatimTransform standalone

**Want to see a full example first?**
‚Üí Jump to [Part 11](#part-11-building-a-complete-system) - Research paper search engine

**Ready to learn each component?**
‚Üí Continue to Part 2


## Part 2: Chunking - The Foundation

Chunking is where most RAG systems make or break. Bad chunking ‚Üí bad retrieval ‚Üí bad answers.

### Why Chunking Matters

Let's say you have a research paper with this structure:

```markdown
# Introduction
Machine learning has revolutionized AI...

## Related Work
Previous approaches used rule-based methods...

## Our Approach
We propose a novel architecture...
```

If you chunk this into arbitrary 512-character blocks, you might get:

```
Chunk 1: "...has revolutionized AI. Previous approaches used rule-based..."
```

**Problem:** This chunk lost all context! Which section is it from? What's the relationship between these sentences?

### Enter: The Markdown Chunker

In [6]:
from verbatim_rag.chunker_providers import MarkdownChunkerProvider

chunker = MarkdownChunkerProvider(
    split_levels=(1, 2, 3, 4),  # Split on H1, H2, H3, H4
    include_preamble=True,  # Include text before first heading
)

markdown = """# Introduction
Machine learning has revolutionized AI.

## Related Work
Previous approaches used rule-based methods.

## Our Approach
We propose a novel architecture.
"""

chunks = chunker.chunk(markdown)


This chunker does something clever: it returns **two versions** of each chunk:

1. **Raw chunk** - The original text (used for display/extraction)
2. **Enhanced chunk** - Original text **plus ancestor headings** (used for embedding/retrieval)

Let's see what we get:

In [7]:
for i, (raw, enhanced) in enumerate(chunks):
    print(f"\n--- Chunk {i} ---")
    print(f"Raw:\n{raw}\n")
    print(f"Enhanced:\n{enhanced}")


--- Chunk 0 ---
Raw:
# Introduction
Machine learning has revolutionized AI.



Enhanced:
# Introduction
Machine learning has revolutionized AI.



--- Chunk 1 ---
Raw:
## Related Work
Previous approaches used rule-based methods.



Enhanced:
# Introduction

## Related Work
Previous approaches used rule-based methods.



--- Chunk 2 ---
Raw:
## Our Approach
We propose a novel architecture.


Enhanced:
# Introduction

## Our Approach
We propose a novel architecture.




## Part 3: Document Processing

Before we chunk, we need to get documents into our system. This is where **DocumentSchema** comes in.

### Creating a DocumentSchema

DocumentSchema is your structured document with metadata. It's like a typed dict, but better:

In [14]:
from verbatim_rag.schema import DocumentSchema

# From a PDF URL (uses Docling to convert PDF ‚Üí Markdown)
doc = DocumentSchema.from_url(
    url="https://aclanthology.org/2025.bionlp-share.8.pdf",
    title="KR Labs at ArchEHR-QA 2025",
    authors=["Adam Kovacs", "Paul Schmitt", "Gabor Recski"],
    year=2025,
    conference="BioNLP",
    category="nlp",
)

print(f"Document: {doc.title}")
print(f"Content length: {len(doc.content):,} chars")
print(f"Content preview:\n{doc.content[:200]}...")

2025-10-24 17:16:47,154 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-10-24 17:16:47,218 - INFO - Going to convert document batch...
2025-10-24 17:16:47,219 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e647edf348883bed75367b22fbe60347
2025-10-24 17:16:47,248 - INFO - Loading plugin 'docling_defaults'
2025-10-24 17:16:47,250 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-24 17:16:47,256 - INFO - Loading plugin 'docling_defaults'
2025-10-24 17:16:47,260 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-10-24 17:16:49,512 - INFO - Accelerator device: 'mps'
2025-10-24 17:16:52,097 - INFO - Accelerator device: 'mps'
2025-10-24 17:16:53,140 - INFO - Accelerator device: 'mps'
2025-10-24 17:16:53,746 - INFO - Processing document 2025.bionlp-share.8.pdf
2025-10-24 17:17:01,365 - INFO - Finished converting document 2025.bionlp-share.8.pdf in 15.19 sec.


Document: KR Labs at ArchEHR-QA 2025
Content length: 25,220 chars
Content preview:
## KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering

√Åd√°m Kov√°cs 1 , Paul Schmitt 2 , G√°bor Recski 1 , 2

1 KR Labs lastname@krlabs.eu

2 TU Wien firstname.lastnam...



**What just happened?**

1. Docling downloaded the PDF
2. Converted it to clean markdown (preserving structure!)
3. Created a DocumentSchema with your metadata
4. All custom fields (`conference`, `category`) are preserved

### Manual Creation

In [None]:
# If you already have the content
doc = DocumentSchema(
    content="# My Document\n\nContent here...",
    title="Custom Document",
    user_id="user123",  # Custom field!
    dataset_id="project_a",  # Another custom field!
)

**Key Insight:** All custom fields are preserved throughout the pipeline. When you query, you can filter by any of these fields!

## Part 4: Embeddings - Making Text Searchable

Now we need to convert text into vectors so we can search by meaning, not just keywords.

### Dense vs Sparse Embeddings

**Dense embeddings**
- Every dimension has a value
- 384-1536 dimensions
- Examples: `all-MiniLM-L6-v2`, OpenAI's `text-embedding-3-small`

**Sparse embeddings**
- Most dimensions are zero
- Only important terms get non-zero weights
- More interpretable!
- Examples: SPLADE, BM25

Let's try SPLADE:

In [15]:
from verbatim_rag.embedding_providers import SpladeProvider

embedder = SpladeProvider(
    model_name="naver/splade-v3",
    device="cpu",  # Works on CPU!
)

# Embed a single text
text = "Machine learning models require training data"
vector = embedder.embed_text(text)

print(f"Vector type: {type(vector)}")
print(f"Non-zero dimensions: {len(vector)}")
print("\nTop 5 terms:")
for idx, weight in sorted(vector.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(f"  Dimension {idx}: {weight:.3f}")

2025-10-24 17:46:04,290 - INFO - Load pretrained SparseEncoder: naver/splade-v3
2025-10-24 17:46:06,914 - INFO - Loaded SPLADE model: naver/splade-v3
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  5.44it/s]


Vector type: <class 'dict'>
Non-zero dimensions: 32

Top 5 terms:
  Dimension 3698: 2.419
  Dimension 2731: 2.083
  Dimension 4083: 1.881
  Dimension 2951: 1.821
  Dimension 3345: 1.613



### Dense Embeddings

Want dense embeddings instead? Just swap the embedder:

In [16]:
from verbatim_rag.embedding_providers import SentenceTransformersProvider

embedder = SentenceTransformersProvider(
    model_name="sentence-transformers/all-MiniLM-L6-v2", device="cpu"
)

vector = embedder.embed_text(text)
print(f"Dense vector length: {len(vector)}")  # 384 dimensions

2025-10-24 17:46:56,319 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
2025-10-24 17:46:59,749 - INFO - Loaded SentenceTransformers model: sentence-transformers/all-MiniLM-L6-v2


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Dense vector length: 384


## Part 5: Putting It All Together

In Part 1, we showed you the **complete VerbatimRAG system**. Now let's build it step by step.


### Step 1: Build the Storage Layer (VerbatimIndex)

First, we assemble the index with our configured components:

In [17]:
from verbatim_rag import VerbatimIndex
from verbatim_rag.vector_stores import LocalMilvusStore
from verbatim_rag.embedding_providers import SpladeProvider
from verbatim_rag.chunker_providers import MarkdownChunkerProvider

# Create each provider
store = LocalMilvusStore(
    db_path="./my_papers.db",
    collection_name="research_papers",
    enable_sparse=True,
    enable_dense=False,
)

embedder = SpladeProvider("naver/splade-v3", device="cpu")
chunker = MarkdownChunkerProvider()

# Assemble the storage layer
index = VerbatimIndex(
    vector_store=store, sparse_provider=embedder, chunker_provider=chunker
)

print("‚úì Storage layer ready!")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2025-10-24 17:47:40,752 - INFO - Created indexes for collection: research_papers
2025-10-24 17:47:40,757 - INFO - Created documents collection: research_papers_documents
2025-10-24 17:47:40,758 - INFO - Connected to Milvus Lite: ./my_papers.db
2025-10-24 17:47:40,770 - INFO - Load pretrained SparseEncoder: naver/splade-v3
2025-10-24 17:47:42,936 - INFO - Loaded SPLADE model: naver/splade-v3


‚úì Storage layer ready!


### Step 2: Add the Intelligence Layer (VerbatimRAG)

Now let's add hallucination prevention:

In [18]:
from verbatim_rag import VerbatimRAG

# Build complete RAG system with extraction
rag = VerbatimRAG(
    index=index,
    model="gpt-4o-mini",  # For span extraction
    max_display_spans=5,
)

print("‚úì Complete RAG system ready!")

‚úì Complete RAG system ready!



Now you have:
- **VerbatimIndex**: Storage layer (retrieves relevant documents)
- **VerbatimRAG**: Complete system (retrieves + extracts verbatim spans)

üí° **Key Difference:**
- `index.query("question")` ‚Üí Returns relevant chunks
- `rag.query("question")` ‚Üí Returns verbatim answer with citations

### Add some documents to the index

In [None]:
from verbatim_rag import VerbatimIndex
from verbatim_rag.schema import DocumentSchema

paper = DocumentSchema.from_url(
    url="https://aclanthology.org/L16-1417.pdf",
    title="Building Concept Graphs from Monolingual Dictionary Entries",
    doc_type="academic_paper",
    authors=["Gabor Recski"],
    conference="LREC",
    year=2016,
    category="nlp",
    dataset_id="anthology",
)

paper2 = DocumentSchema.from_url(
    url="https://aclanthology.org/2025.bionlp-share.8.pdf",
    title="KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering",
    doc_type="academic_paper",
    authors=["Adam Kovacs", "Paul Schmitt", "Gabor Recski"],
    conference="BioNLP",
    year=2025,
    category="nlp",
    dataset_id="anthology",
)

paper3 = DocumentSchema.from_url(
    url="https://aclanthology.org/2020.lrec-1.448.pdf",
    title="Better Together: Modern Methods Plus Traditional Thinking in NP Alignment",
    doc_type="academic_paper",
    authors=["Adam Kovacs", "Judit Acs", "Andras Kornai", "Gabor Recski"],
    conference="LREC",
    year=2020,
    category="nlp",
    dataset_id="anthology",
)

paper4 = DocumentSchema.from_url(
    url="https://aclanthology.org/2020.msr-1.2.pdf",
    title="BME-TUW at SR‚Äô20: Lexical grammar induction for surface realization",
    doc_type="academic_paper",
    authors=[
        "G√°bor Recski",
        "√Åd√°m Kov√°cs",
        "Kinga G√©mes",
        "Judit √Åcs",
        "Andras Kornai",
    ],
    conference="MSR",
    year=2020,
    category="nlp",
    dataset_id="anthology",
)

paper5 = DocumentSchema.from_url(
    url="https://aclanthology.org/D19-6304.pdf",
    title="BME-UW at SRST-2019: Surface realization with Interpreted Regular Tree Grammars",
    doc_type="academic_paper",
    authors=["√Åd√°m Kov√°cs", "Evelin √Åcs", "Judit √Åcs", "Andras Kornai", "G√°bor Recski"],
    conference="SRST",
    year=2019,
    category="nlp",
    dataset_id="anthology",
)

paper6 = DocumentSchema.from_url(
    url="https://aclanthology.org/2020.semeval-1.15.pdf",
    title="BMEAUT at SemEval-2020 Task 2: Lexical entailment with semantic graphs",
    doc_type="academic_paper",
    authors=["√Åd√°m Kov√°cs", "Kinga G√©mes", "Andras Kornai", "G√°bor Recski"],
    conference="SemEval",
    year=2020,
    category="nlp",
    dataset_id="anthology",
)

In [None]:
index.add_documents([paper, paper2, paper3, paper4, paper5, paper6])


### What Happens When You Add Documents?

```python
# Add documents
index.add_documents([doc])
```

Behind the scenes, `VerbatimIndex` orchestrates a complex pipeline:

1. **Chunking:** Calls `chunker.chunk(doc.content)` ‚Üí Returns `(raw, enhanced)` tuples
2. **Metadata enrichment:** Appends document metadata to enhanced chunks:
   ```
   # Introduction
   Machine learning...

   ---
   Document: KR Labs at ArchEHR-QA 2025
   Source: https://aclanthology.org/2025.bionlp-share.8.pdf
   Authors: ['Adam Kovacs', 'Paul Schmitt', 'Gabor Recski']
   Year: 2025
   Conference: BioNLP
   ```
3. **Batch embedding:** Calls `embedder.embed_batch(enhanced_chunks)`
4. **Storage:** Stores vectors + metadata in Milvus

üí° **Key Insight:** Raw text is stored for display/extraction, enhanced text is embedded for retrieval!


## Part 6: Inspecting Your Index

Once you've indexed documents, you should inspect what got created. This is crucial for debugging and understanding your system.

### Get Overview Statistics

In [21]:
stats = index.inspect()
print(f"Total documents: {stats['total_documents']}")
print(f"Total chunks: {stats['total_chunks']}")
print(f"Average chunks per doc: {stats['total_chunks'] / stats['total_documents']:.1f}")

print("\nDocument types:")
for doc_type, count in stats["doc_types"].items():
    print(f"  {doc_type}: {count}")

print("\nSample documents:")
for doc in stats["sample_documents"][:3]:
    print(f"  - {doc['title']} ({doc.get('year', 'N/A')})")

Total documents: 6
Total chunks: 88
Average chunks per doc: 14.7

Document types:
  academic_paper: 6

Sample documents:
  - BMEAUT at SemEval-2020 Task 2: Lexical entailment with semantic graphs (N/A)
  - BME-UW at SRST-2019: Surface realization with Interpreted Regular Tree Grammars (N/A)
  - Building Concept Graphs from Monolingual Dictionary Entries (N/A)



### Examine Individual Chunks

Let's look at what actually got stored:

In [None]:
chunks = index.get_all_chunks(limit=5)

for i, chunk in enumerate(chunks):
    print(f"\n{'=' * 60}")
    print(f"Chunk {i}")
    print(f"{'=' * 60}")

    print(f"\nID: {chunk.id}")
    print(f"Score: {chunk.score}")

    print("\n--- Raw Text (for extraction/display) ---")
    print(chunk.text[:200] + "...")

    print("\n--- Enhanced Text (what was embedded) ---")
    print(chunk.enhanced_text[:300] + "...")

    print("\n--- Metadata ---")
    metadata_keys = ["title", "authors", "year", "conference", "chunk_number"]
    for key in metadata_keys:
        if key in chunk.metadata:
            print(f"  {key}: {chunk.metadata[key]}")


**This will show you:**
- How your chunks look after processing
- What metadata was preserved
- The difference between raw and enhanced text

### Filter by Metadata

Remember those custom fields you added? Now you can filter by them:

In [23]:
# Get only 2025 papers
chunks_2025 = index.query(
    text=None,  # No semantic search, just filter
    filter='metadata["year"] == 2025',
    k=100,
)
print(f"Found {len(chunks_2025)} chunks from 2025")

Found 19 chunks from 2025


In [27]:
chunks_bionlp = index.query(filter='metadata["conference"] == "BioNLP"', k=50)

In [29]:
# Get chunks from a specific document
doc_chunks = index.get_chunks_by_document("13354f18-c365-484c-bf6d-16a26401b73d")