# **BUILD A RAG PIPELINE USING CHONKIE CHUNKER** & **RETAB**

[Chonkie](https://chonkie.ai/) is a powerful and flexible text chunking library designed specifically for RAG pipelines.

_Chunking consists in splitting the text into manageable blocks (sentences, paragraphs, etc..) called “chunks” for embedding_

**More information on Chonkie [here](https://chonkie.ai/).**

In [1]:
# %pip install retab
# %pip install chonkie
# %pip install faiss-cpu

### **1. RETAB PREPROCESSING**

In [2]:
# Parse a Document with retab
from dotenv import load_dotenv
from retab import Retab

load_dotenv() # You need to create a .env file containing your RETAB_API_KEY=sk_retab_***

client = Retab()

# Parse the document
response = client.documents.parse(
    document="../assets/docs/Americas-AI-Action-Plan.pdf",
    model="gemini-2.5-flash",
    table_parsing_format="markdown",  # Better for RAG
    image_resolution_dpi=150          # Higher quality for technical docs
)

print(response)

document=BaseMIMEData(filename='Americas-AI-Action-Plan.pdf', url='data:application/pdf;base64,JVBERi0xLjYNJeLjz9MNCj...', content='JVBERi0xLjYNJeLjz9MNCjk2MSAwIG9iag08PC9MaW5lYXJpem...', mime_type='application/pdf', extension='pdf') usage=RetabUsage(page_count=28, credits=28.0) pages=['THE WHITE HOUSE\n\n[Figure: Executive Office of the President of the United States Seal, with "OFFICE OF SCIENCE AND TECHNOLOGY POLICY" text]\n\n*Winning the Race*\nAMERICA\'S\nAI ACTION PLAN\n\nJULY 2025', "AMERICA'S AI ACTION PLAN\n\n*“Today, a new frontier of scientific discovery lies before us, defined by transformative technologies such as artificial intelligence… Breakthroughs in these fields have the potential to reshape the global balance of power, spark entirely new industries, and revolutionize the way we live and work. As our global competitors race to exploit these technologies, it is a national security imperative for the United States to achieve and maintain unquestioned and unchallenged g

In [3]:
print(dir(response))
print(response.pages)

['__abstractmethods__', '__annotations__', '__class__', '__class_getitem__', '__class_vars__', '__copy__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__fields__', '__fields_set__', '__firstlineno__', '__format__', '__ge__', '__get_pydantic_core_schema__', '__get_pydantic_json_schema__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__pretty__', '__private_attributes__', '__pydantic_complete__', '__pydantic_computed_fields__', '__pydantic_core_schema__', '__pydantic_custom_init__', '__pydantic_decorators__', '__pydantic_extra__', '__pydantic_fields__', '__pydantic_fields_set__', '__pydantic_generic_metadata__', '__pydantic_init_subclass__', '__pydantic_parent_namespace__', '__pydantic_post_init__', '__pydantic_private__', '__pydantic_root_model__', '__pydantic_serializer__', '__pydantic_setattr_handlers__', '__pydantic_validator

### **2. CHONKIE CHUNKING**

In this example, we will use **Chonkie's Token Chunker**, that splits text into chunks based on token count, ensuring each chunk stays within specified token limits.

You can find more information [here](https://docs.chonkie.ai/python-sdk/chunkers/token-chunker).
​


In [4]:
from chonkie import TokenChunker
import tiktoken

parsed_content = response.pages

tokenizer = tiktoken.get_encoding("gpt2")

chunker = TokenChunker(
    tokenizer=tokenizer,
    chunk_size=512,    # Maximum tokens per chunk
    chunk_overlap=128  # Overlap between chunks
)

all_chunks = []

for page_num, page_text in enumerate(response.pages):
    chunks = chunker.chunk(page_text)
    for chunk in chunks:
        all_chunks.append({
            "page_num": page_num,
            "chunk_text": chunk.text,
            "token_count": chunk.token_count,
            "start_index": chunk.start_index,
            "end_index": chunk.end_index
        })

# Example: print info about the first 3 chunks
for chunk in all_chunks[:3]:
    print(f"\n[Page {chunk['page_num']}]")
    print(f"Chunk: {chunk['chunk_text'][:100]}...")  # Print first 100 chars
    print(f"Tokens: {chunk['token_count']}, Range: {chunk['start_index']}-{chunk['end_index']}")


[Page 0]
Chunk: THE WHITE HOUSE

[Figure: Executive Office of the President of the United States Seal, with "OFFICE ...
Tokens: 60, Range: 0-195

[Page 1]
Chunk: AMERICA'S AI ACTION PLAN

*“Today, a new frontier of scientific discovery lies before us, defined by...
Tokens: 140, Range: 0-676

[Page 2]
Chunk: AMERICA'S AI ACTION PLAN

# Table of Contents

Introduction ...........................................
Tokens: 512, Range: 0-3188


### **3. OPENAI EMBEDDING**

In [5]:
import openai

texts = [chunk['chunk_text'] for chunk in all_chunks]
embeddings = []

batch_size = 100
for i in range(0, len(texts), batch_size):
    batch = texts[i:i+batch_size]
    response = openai.embeddings.create(
        input=batch,
        model="text-embedding-3-small"
    )
    embeddings.extend([d.embedding for d in response.data])


### **4. FAISS RETRIEVAL**

In [6]:
import numpy as np
import faiss

embeddings = np.array(embeddings).astype('float32')
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

### **5. RETRIEVE**

In [7]:
# Define retrieval function
def retrieve(query, k=4):
    q_emb = openai.embeddings.create(
        input=[query],
        model="text-embedding-3-small"
    ).data[0].embedding
    D, I = index.search(np.array([q_emb], dtype='float32'), k)
    return [texts[i] for i in I[0]]

# Generate answer (OpenAI GPT-4O example)
def answer_query(query, k=4):
    context = "\n\n---\n\n".join(retrieve(query, k))
    prompt = f"""Use the context below to answer the user's question.

Context: {context}

Question: {query}

Answer:"""
    resp = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.2
    )
    return resp.choices[0].message.content.strip()

In [8]:
# Example 
query = "What are the main pillars of America's AI Action Plan?"
print(answer_query(query))

The main pillars of America's AI Action Plan are:

1. Accelerate AI Innovation
2. Build American AI Infrastructure
3. Lead in International AI Diplomacy and Security
