```{contents}
```

## Chunking

Chunking means **splitting large documents into smaller pieces (chunks)** so that LLMs and vector databases can process them efficiently.

Example:
A 20-page PDF → broken into small sections (chunks) of around 200–1000 tokens.

Chunking determines **how documents are split** into pieces before sending them to an embedding model or LLM.
Good chunking improves:

* retrieval accuracy
* context relevancy
* summary quality
* hallucination reduction


---

### Why Chunking Is Needed

* Embedding models have token limits (e.g., 512 or 1024 tokens)
* LLMs cannot process huge documents all at once
* Retrieval works best on smaller, complete meaningful units

Bad chunking → irrelevant retrieval → hallucinations
Good chunking → precise answers → high RAG quality

---

### Common Chunking Strategies

Below are the **8 major chunking strategies** used in RAG systems.

---

#### Fixed-Size Chunking

Split by token or character count:

* 200 tokens
* 500 tokens
* 1000 characters

**Pros**

* Simple
* Fast
* Works for uniform text

**Cons**

* Splits sentences awkwardly
* Loses context
* Causes hallucination when meaning is broken

**Use Case**

General documents where structure doesn’t matter.

---

#### Sliding Window / Overlapping Chunking

Chunks overlap each other:

Example (chunk size = 200, overlap = 50)

```
Chunk 1: tokens 1–200
Chunk 2: tokens 150–350
Chunk 3: tokens 300–500
```

**Pros**

* Prevents text cuts across important sentences
* Gives more context continuity
* Works very well for QA and search

**Cons**

* More storage
* More embeddings
* More compute

**Use Case**

RAG for:

* customer support
* medical/legal documents
* long articles

---

#### Sentence-Based Chunking

Split text by sentence boundaries using NLP tools (Spacy, NLTK).

**Pros**

* Keeps semantic meaning intact
* No abrupt mid-sentence cuts
* Higher retrieval accuracy

**Cons**

* Sentences may be too short
* Hard for embedding models that prefer 200–300 tokens

**Use Case**

Summarization, QA, reasoning tasks

---

#### Paragraph-Based Chunking

Split by paragraphs (`\n\n` or heading markers).

**Pros**

* Natural semantic boundaries
* Easy and fast

**Cons**

* Paragraphs vary in length
* Some paragraphs may be very long (bad for embeddings)

**Use Case**

Documents, reports, books, blogs

---

#### Semantic Chunking (LLM-aware chunking)

Uses LLM or embeddings to decide where to split.

**How:**

* Detect topic shift
* Identify semantic similarity boundaries
* Create chunks based on meaning, not size

**Pros**

* Best chunk quality
* Very accurate retrieval
* Avoids topic mixing

**Cons**

* Slower
* Needs embeddings or LLM calls
* More costly

**Use Case**

High-quality enterprise RAG (GraphRAG, agentic RAG)

---

#### Recursive Character/Text Splitter (LangChain Style)

Strategy:

* Try splitting by headings
* If too long, split by paragraphs
* If still too long, split by sentences
* If still too long, split by chunk size

**Pros**

* Balances structure + size
* Stable and widely used

**Cons**

* Still generic, not semantic

**Use Case**

Standard RAG implementations

---

#### Hybrid Chunking

Combines multiple methods:

* semantic + sliding window
* paragraph + token splitting
* sentence + overlap

**Pros**

* Best trade-off
* High retrieval accuracy
* Maintains context

**Use Case**

Production RAG systems

---

#### Structure-Aware Chunking

Splits based on structure in files such as:

* Markdown (`#`, `##`, lists)
* HTML (`<h1>`, `<p>`)
* PDFs using layout
* Tables separated into cell-level chunks

**Pros**

* Very clean, meaningful chunks
* Preserves document flow

**Cons**

* Harder to implement

**Use Case**

Technical docs (API docs, legal docs, research papers)

---

**Choosing the Right Chunking Strategy**

| Goal                          | Best Strategy                 |
| ----------------------------- | ----------------------------- |
| QA over long text             | Sliding window + paragraph    |
| Precise factual retrieval     | Semantic chunking             |
| Fast + simple                 | Fixed size                    |
| Very clean document structure | Structure-aware               |
| Cost-optimized                | Sentence-based + mild overlap |
| Enterprise RAG                | Hybrid (semantic + window)    |

---

**Ideal Chunk Size (Rule of Thumb)**

* **200–400 tokens** = ideal for embedding models
* **Overlap 10–20%**
* Avoid chunks > 512 tokens

Reason:
Smaller chunks → higher precision
Bigger chunks → more recall but slower + expensive

---

**Example Chunking (Sliding Window)**

Text:

```
Machine learning is a field of AI...
It uses statistical methods...
Neural networks are widely used...
```

Chunk size = 20 tokens, overlap = 5 tokens

```
Chunk 1: tokens 1–20
Chunk 2: tokens 15–35
Chunk 3: tokens 30–50
```

This ensures continuity.

---

**One-Sentence Summary**

**Chunking strategies determine how documents are split into meaningful pieces for embedding and retrieval. Better chunking = better RAG accuracy + fewer hallucinations.**

| Document Type        | Best Chunking Strategy        | Why                          |
| -------------------- | ----------------------------- | ---------------------------- |
| PDFs                 | Recursive or Hybrid           | maintain structure + meaning |
| Legal contracts      | Semantic                      | meaning-heavy, multi-hop     |
| Medical records      | Semantic                      | context-sensitive            |
| Blogs/articles       | Recursive                     | paragraph structure          |
| Code                 | Fixed or Token-based          | predictable formatting       |
| Transcripts          | Fixed or Semantic             | repetitive, long sequences   |
| Enterprise knowledge | Hybrid (recursive + semantic) | mixed formats                |
| Scientific Papers    | Semantic or Hybrid            | multi-hop reasoning          |


In [2]:
from langchain_classic.text_splitter import CharacterTextSplitter

text = """
LangChain provides multiple chunking techniques for splitting large documents.
Chunking is essential for RAG because embedding models have token limits.
"""

splitter = CharacterTextSplitter(
    chunk_size=50,
    chunk_overlap=10,
)

chunks = splitter.split_text(text)

print("Fixed-size Chunks:")
for i, c in enumerate(chunks):
    print(f"\nChunk {i+1}:\n{c}")


Fixed-size Chunks:

Chunk 1:
LangChain provides multiple chunking techniques for splitting large documents.
Chunking is essential for RAG because embedding models have token limits.


In [5]:
from langchain_classic.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=80,
    chunk_overlap=20,
    separators=["\n\n", "\n", " ", ""]
)

chunks = splitter.split_text(text)

print("\nRecursive Chunks:")
for i, c in enumerate(chunks):
    print(f"\nChunk {i+1}:\n{c}")



Recursive Chunks:

Chunk 1:
LangChain provides multiple chunking techniques for splitting large documents.

Chunk 2:
Chunking is essential for RAG because embedding models have token limits.


In [6]:
from langchain_classic.text_splitter import TokenTextSplitter

token_splitter = TokenTextSplitter(
    chunk_size=30,
    chunk_overlap=5
)

chunks = token_splitter.split_text(text)

print("\nToken-based Chunks:")
for i, c in enumerate(chunks):
    print(f"\nChunk {i+1}:\n{c}")



Token-based Chunks:

Chunk 1:

LangChain provides multiple chunking techniques for splitting large documents.
Chunking is essential for RAG because embedding models have token limits

Chunk 2:
ding models have token limits.



In [3]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings

emb = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

semantic_splitter = SemanticChunker(embeddings=emb)

chunks = semantic_splitter.split_text(text)

print("\nSemantic Chunks:")
for i, c in enumerate(chunks):
    print(f"\nChunk {i+1}:\n{c}")



Semantic Chunks:

Chunk 1:

LangChain provides multiple chunking techniques for splitting large documents. Chunking is essential for RAG because embedding models have token limits.

Chunk 2:

