# RAG Pipeline with Vertector

Complete RAG pipeline including:
- Document chunking
- Vector store integration
- Semantic search
- Batch ingestion

## Setup

In [1]:
from pathlib import Path
from vertector_data_ingestion import (
    UniversalConverter,
    LocalMpsConfig,
    HybridChunker,
    ChromaAdapter,
    ExportFormat,
    setup_logging,
)
from vertector_data_ingestion.models.config import ChunkingConfig

setup_logging(log_level="INFO")

[32m2026-01-03 02:06:27[0m | [1mINFO    [0m | [36mvertector_data_ingestion.monitoring.logger[0m:[36msetup_logging[0m:[36m51[0m - [1mLogging initialized at INFO level[0m


## Basic RAG Pipeline

In [2]:
# Configure chunker with Qwen3-Embedding-0.6B (smaller, faster)
chunk_config = ChunkingConfig(
    tokenizer="Qwen/Qwen3-Embedding-0.6B",
    max_tokens=1024,
)

converter = UniversalConverter(LocalMpsConfig())
doc_path = Path("../test_documents/arxiv_sample.pdf")

if doc_path.exists():
    # Step 1: Convert
    print("Step 1: Converting Document")
    doc = converter.convert(doc_path)
    print(f"✓ Converted: {doc.metadata.num_pages} pages")
    
    # Step 2: Chunk with custom config
    print("\nStep 2: Creating Chunks")
    print(f"Using tokenizer: {chunk_config.tokenizer}")
    chunker = HybridChunker(config=chunk_config)
    chunks = chunker.chunk_document(doc)
    print(f"✓ Created: {chunks.total_chunks} chunks")
    
    # Step 3: Store with matching embedding model
    print("\nStep 3: Storing in Vector DB")
    vector_store = ChromaAdapter(
        collection_name="rag_pipeline",
        embedding_model="Qwen/Qwen3-Embedding-0.6B"
    )
    vector_store.add_chunks(chunks.chunks, batch_size=4)
    print(f"✓ Stored: {len(chunks.chunks)} chunks")
    
    # Step 4: Search
    print("\nStep 4: Semantic Search")
    results = vector_store.search("Who is the main author of this paper?", top_k=3)
    for i, result in enumerate(results, 1):
        print(f"\nResult {i}:")
        print(f"  Score: {result['score']:.3f}")
        print(f"  Text: {result['text'][:100]}...")
else:
    print(f"File not found: {doc_path}")

[32m2026-01-03 02:06:33[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2026-01-03 02:06:33[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2026-01-03 02:06:34[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.pipeline_router[0m:[36m__init__[0m:[36m55[0m - [1mHardware detected: mps[0m
[32m2026-01-03 02:06:34[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m__init__[0m:[36m44[0m - [1mInitialized UniversalConverter on mps[0m
[32m2026-01-03 02:06:34[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_ensure_models_available[0m:[36m67[0m - [1mChecking model availability...[0m


Step 1: Converting Document
Consider using the pymupdf_layout package for a greatly improved page layout analysis.


[32m2026-01-03 02:06:38[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.pipeline_router[0m:[36mdetermine_pipeline[0m:[36m99[0m - [1mUsing Classic pipeline for PDF with tables: arxiv_sample.pdf[0m
[32m2026-01-03 02:06:38[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_convert_with_retry[0m:[36m175[0m - [1mConverting arxiv_sample.pdf with classic pipeline[0m
2026-01-03 02:06:38,377 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2026-01-03 02:06:38,459 - INFO - Going to convert document batch...
2026-01-03 02:06:38,460 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 870e160bad93d15722a8ae8d62725e09
2026-01-03 02:06:38,480 - INFO - Loading plugin 'docling_defaults'
2026-01-03 02:06:38,482 - INFO - Registered picture descriptions: ['vlm', 'api']
2026-01-03 02:06:38,490 - INFO - Loading plugin 'docling_defaults'
2026-01-03 02:06:38,502 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrm

✓ Converted: 9 pages

Step 2: Creating Chunks
Using tokenizer: Qwen/Qwen3-Embedding-0.6B


[32m2026-01-03 02:07:11[0m | [1mINFO    [0m | [36mvertector_data_ingestion.chunkers.hybrid_chunker[0m:[36m__init__[0m:[36m50[0m - [1mInitialized HybridChunker with max_tokens=1024, merge_peers=True[0m
[32m2026-01-03 02:07:11[0m | [1mINFO    [0m | [36mvertector_data_ingestion.chunkers.hybrid_chunker[0m:[36mchunk_document[0m:[36m68[0m - [1mChunking document: arxiv_sample.pdf (9 pages)[0m
[32m2026-01-03 02:07:12[0m | [1mINFO    [0m | [36mvertector_data_ingestion.chunkers.hybrid_chunker[0m:[36mchunk_document[0m:[36m99[0m - [1mCreated 29 chunks[0m


✓ Created: 29 chunks

Step 3: Storing in Vector DB


[32m2026-01-03 02:07:19[0m | [1mINFO    [0m | [36mvertector_data_ingestion.vector.chroma_adapter[0m:[36m__init__[0m:[36m42[0m - [1mLoading embedding model: Qwen/Qwen3-Embedding-0.6B[0m
2026-01-03 02:07:19,303 - INFO - Use pytorch device_name: mps
2026-01-03 02:07:19,304 - INFO - Load pretrained SentenceTransformer: Qwen/Qwen3-Embedding-0.6B
2026-01-03 02:07:25,125 - INFO - 1 prompt is loaded, with the key: query
2026-01-03 02:07:25,178 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
[32m2026-01-03 02:07:26[0m | [1mINFO    [0m | [36mvertector_data_ingestion.vector.chroma_adapter[0m:[36m__init__[0m:[36m53[0m - [1mChromaDB initialized in-memory[0m
[32m2026-01-03 02:07:26[0m | [1mINFO    [0m | [36mvertector_data_ingestion.vector.chroma_adapter[0m:[36m__init__[0m:[36m61[0m - [1mUsing collection: rag_pipeline[0m
[32m2026-01-03 02:07:26[0m | [1mINFO    [0m | [36mvertector_data_in

✓ Stored: 29 chunks

Step 4: Semantic Search


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Result 1:
  Score: -0.243
  Text: PDF document conversion, layout segmentation, object-detection, data set, Machine Learning...

Result 2:
  Score: -0.268
  Text: Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar. 2022. DocLayNet: ...

Result 3:
  Score: -0.325
  Text: Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com
Christoph Auer IBM Researc...


In [12]:
chunks.chunks[0].chunk_index

0

## Batch Document Ingestion

In [None]:
documents_dir = Path("../test_documents/")

if documents_dir.exists():
    pdf_files = list(documents_dir.glob("*.pdf"))[:5]
    
    if pdf_files:
        print(f"Ingesting {len(pdf_files)} documents...\n")
        
        # Configure with Qwen3-Embedding-0.6B (default)
        chunk_config = ChunkingConfig(
            tokenizer="Qwen/Qwen3-Embedding-0.6B",
            max_tokens=1024,
        )
        
        converter = UniversalConverter()
        chunker = HybridChunker(config=chunk_config)
        vector_store = ChromaAdapter(
            collection_name="rag_pipeline",
            embedding_model="Qwen/Qwen3-Embedding-0.6B"
        )
        
        # Convert all documents
        docs = converter.convert(pdf_files, parallel=True)
        
        # Chunk and store all
        all_chunks = []
        for doc in docs:
            chunks = chunker.chunk_document(doc)
            for chunk in chunks.chunks:
                chunk.metadata["source_file"] = doc.metadata.source_path.name
            all_chunks.extend(chunks.chunks)
        
        vector_store.add_chunks(all_chunks, batch_size=4)
        print(f"\n✓ Ingested {len(all_chunks)} chunks from {len(docs)} documents")
    else:
        print("No PDF files found")
else:
    print("Create a 'documents/' directory")

## Advanced Search

In [None]:
vector_store = ChromaAdapter(
    collection_name="rag_pipeline",
    embedding_model="Qwen/Qwen3-Embedding-0.6B"
)

queries = [
    "Who are the authors of this paper?",
    "What are the main findings?",
    "What is the paper about?",
]

for query in queries:
    results = vector_store.search(query, top_k=1)
    if results:
        print(f"Q: {query}")
        print(f"A: {results[0]['text']}\n")

## Summary

Demonstrated:
- Complete RAG pipeline
- Batch document ingestion with unified `convert()`
- Vector search

Next: `04_multimodal_integration.ipynb`