# Day 12: Chunking â€” Handling Large Documents

RAG works great with short documents.

But what about a 50-page PDF?

You can't embed it as one unit. You need to **chunk** it.

## Setup

In [28]:
from google import genai
import os
from dotenv import load_dotenv

load_dotenv(dotenv_path='../.env')
API_KEY = os.environ["GEMINI_API_KEY"]
client = genai.Client(api_key=API_KEY)

## A Long Document

In [29]:
long_document = """
Machine Learning: A Comprehensive Overview

Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed.

Supervised learning uses labeled data. The algorithm learns to map inputs to known outputs. Common applications include spam detection and image classification.

Unsupervised learning works with unlabeled data. The algorithm finds hidden patterns. Common applications include customer segmentation and anomaly detection.

Reinforcement learning involves an agent learning through trial and error. The agent takes actions and receives rewards or penalties.
"""

print(f"Document: {len(long_document)} characters, ~{len(long_document.split())} words")

Document: 637 characters, ~84 words


## Chunking by Paragraph

In [30]:
def chunk_by_paragraph(text):
    """Split text into paragraph-based chunks."""
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
    return paragraphs

chunks = chunk_by_paragraph(long_document)

print(f"Created {len(chunks)} chunks:\n")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk[:60]}...")

Created 5 chunks:

Chunk 1: Machine Learning: A Comprehensive Overview...
Chunk 2: Machine learning is a subset of artificial intelligence that...
Chunk 3: Supervised learning uses labeled data. The algorithm learns ...
Chunk 4: Unsupervised learning works with unlabeled data. The algorit...
Chunk 5: Reinforcement learning involves an agent learning through tr...


## Index and Search Chunks

In [31]:
import numpy as np

def cosine_similarity(vec1, vec2):
    vec1, vec2 = np.array(vec1), np.array(vec2)
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Index all chunks
chunk_embeddings = []
for chunk in chunks:
    response = client.models.embed_content(model="gemini-embedding-001", contents=chunk)
    chunk_embeddings.append(response.embeddings[0].values)

print(f"âœ… Indexed {len(chunks)} chunks")

âœ… Indexed 5 chunks


In [32]:
# Search
query = "How does an AI agent learn from rewards?"

query_emb = client.models.embed_content(model="gemini-embedding-001", contents=query).embeddings[0].values

scores = [(cosine_similarity(query_emb, emb), i) for i, emb in enumerate(chunk_embeddings)]
scores.sort(reverse=True)

print(f"ðŸ”Ž Query: '{query}'\n")
print(f"Best match: {chunks[scores[0][1]]}")

ðŸ”Ž Query: 'How does an AI agent learn from rewards?'

Best match: Reinforcement learning involves an agent learning through trial and error. The agent takes actions and receives rewards or penalties.


## Key Takeaways

1. Long documents need **chunking** before embedding
2. Each chunk gets its own embedding
3. Search finds the **relevant chunk**, not the whole document

---

**Next:** Day 13 â€” Vector Databases