# **BUILD A RAG PIPELINE USING CHONKIE CHUNKER** & **RETAB**

[Chonkie](https://chonkie.ai/) is a powerful and flexible text chunking library designed specifically for RAG pipelines.

_Chunking consists in splitting the text into manageable blocks (sentences, paragraphs, etc..) called “chunks” for embedding_

**More information on Chonkie [here](https://chonkie.ai/).**

In [1]:
# %pip install retab
# %pip install chonkie

In [2]:
# Parse a Document with retab
from dotenv import load_dotenv
from retab import Retab

load_dotenv() # You need to create a .env file containing your RETAB_API_KEY=sk_retab_***

client = Retab()

# Parse the document
response = client.documents.parse(
    document="../assets/docs/ETF-facts.pdf",
    model="gemini-2.5-flash",
    table_parsing_format="markdown",  # Better for RAG
    image_resolution_dpi=150          # Higher quality for technical docs
)

print(response)

document=BaseMIMEData(filename='ETF-facts.pdf', url='data:application/pdf;base64,JVBERi0xLjMKJbrfrOAKMy...', content='JVBERi0xLjMKJbrfrOAKMyAwIG9iago8PC9UeXBlIC9QYWdlCi...', mime_type='application/pdf', extension='pdf') usage=RetabUsage(page_count=3, credits=3.0) pages=["SERIES B • CDN$ • ISC • AS AT JULY 09, 2025\n\n## FUND CODE\nISC\n7882\n\n# Fidelity All-American Equity ETF Fund\n\n## KEY FACTS\nSeries Inception **June 03, 2025**\nNAV **$10.25**\nBenchmark **S&P 500 Index**\nFund aggregate assets\n(all series) as at\nDistributions **Annually**\nAlso available through **ETF CDN$**\n\n### Risk classification\n[Figure: Risk classification bar showing categories: Low, Low to Medium, Medium, Medium to High, High]\n\n## PORTFOLIO MANAGERS\n\nManaged by Geode - Geode Capital Management is a global systematic investment manager. With a robust infrastructure and talented investment professionals, Geode offers clients the scale of a large asset management firm, with the benefits of a versati

In this example, we will use **Chonkie's Sentence Chunker**, that splits text into chunks while preserving complete sentences, ensuring that each chunk maintains proper sentence boundaries and context.

You can find more information [here](https://docs.chonkie.ai/python-sdk/chunkers/sentence-chunker).
​


In [3]:
# Initialize chunker for RAG
from chonkie import SentenceChunker

chunker = SentenceChunker(
    tokenizer_or_token_counter="gpt2",
    chunk_size=512,
    chunk_overlap=128,
    min_sentences_per_chunk=1
)

# Process each page and create chunks
all_chunks = []
for page_num, page_text in enumerate(response.pages, 1):
    chunks = list(chunker(page_text))
    
    for chunk_idx, chunk in enumerate(chunks):
        chunk_data = {
            "page": page_num,
            "chunk_id": f"page_{page_num}_chunk_{chunk_idx}",
            "text": str(chunk),
            "document": response.document.filename
        }
        all_chunks.append(chunk_data)

print(f"Created {len(all_chunks)} chunks from {response.usage.page_count} pages")

Created 6 chunks from 3 pages


