```{contents}
```
## Text Splitter



A **Text Splitter** in LangChain is a component that **breaks large documents into smaller, overlapping chunks** that are suitable for:

* Embedding
* Retrieval
* LLM context windows

> Text splitters operate on **Document objects** and return **smaller Document objects**.

They do **not** call LLMs.

---

### Why Text Splitting Is Necessary

LLMs have:

* Context window limits
* Degraded reasoning on very long inputs

Without splitting:

* Tokens overflow
* Important context is lost
* Retrieval becomes inaccurate

Text splitting ensures:

* Each chunk fits the model context
* Semantic meaning is preserved
* Retrieval quality improves

---

### Where Text Splitter Fits in RAG

```
Document Loader
   ↓
Documents
   ↓
Text Splitter
   ↓
Chunks
   ↓
Embeddings
   ↓
Vector Store
   ↓
Retriever
```

Text splitting is an **ingestion-time operation**.

---

### Core Concepts

#### Chunk Size

Maximum number of characters or tokens per chunk.

Example:

* `chunk_size = 500`

---

#### Chunk Overlap

Number of characters or tokens shared between adjacent chunks.

Example:

* `chunk_overlap = 50`

Purpose:

* Prevents context loss at boundaries

---

### Basic Text Splitter Demonstration

#### RecursiveCharacterTextSplitter (Most Used)



In [2]:
from langchain_classic.text_splitter import RecursiveCharacterTextSplitter

# Sample long text for demonstration
long_text = """
Large language models (LLMs) are very large deep learning models that are pre-trained on vast amounts of data. 
The underlying transformer is a set of neural networks that consist of an encoder and a decoder with self-attention capabilities. 
The encoder and decoder extract meanings from a sequence of text and understand the relationships between words and phrases in it.

Transformer LLMs are capable of unsupervised training, although a more precise explanation is that transformers perform self-learning. 
It is through this process that transformers learn to understand basic grammar, languages, and knowledge.

Unlike earlier recurrent neural networks (RNN) that sequentially process inputs, transformers process entire sequences in parallel. 
This allows the data scientists to use GPUs for training transformer-based LLMs, significantly reducing the training time.
"""

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

chunks = splitter.split_text(long_text)







---

### Splitting Documents (Recommended)



In [3]:
from langchain_core.documents import Document

# Create sample documents
documents = [
	Document(
		page_content=long_text,
		metadata={"source": "llm_overview.txt", "page": 1}
	)
]

chunks = splitter.split_documents(documents)

# Display the chunks
for i, chunk in enumerate(chunks):
	print(f"Chunk {i + 1}:")
	print(chunk.page_content)
	print(f"Metadata: {chunk.metadata}")
	print("-" * 50)


Chunk 1:
Large language models (LLMs) are very large deep learning models that are pre-trained on vast amounts of data. 
The underlying transformer is a set of neural networks that consist of an encoder and a decoder with self-attention capabilities. 
The encoder and decoder extract meanings from a sequence of text and understand the relationships between words and phrases in it.
Metadata: {'source': 'llm_overview.txt', 'page': 1}
--------------------------------------------------
Chunk 2:
Transformer LLMs are capable of unsupervised training, although a more precise explanation is that transformers perform self-learning. 
It is through this process that transformers learn to understand basic grammar, languages, and knowledge.
Metadata: {'source': 'llm_overview.txt', 'page': 1}
--------------------------------------------------
Chunk 3:
Unlike earlier recurrent neural networks (RNN) that sequentially process inputs, transformers process entire sequences in parallel. 
This allows the da



Each output chunk is a `Document` with inherited metadata.

---

### How Recursive Splitting Works

The splitter tries separators **in order**:

1. Paragraph (`\n\n`)
2. Line (`\n`)
3. Sentence (`.`)
4. Word (` `)
5. Character fallback

This preserves **semantic boundaries** as much as possible.

---

### Common Text Splitter Types

#### CharacterTextSplitter



In [4]:

from langchain_classic.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=30
)




Simple but may cut sentences.

---

### RecursiveCharacterTextSplitter (Recommended)

Best balance of:

* Simplicity
* Semantic preservation

---

### TokenTextSplitter

```python
from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=256,
    chunk_overlap=32
)
```

Uses tokens instead of characters.
Important for strict token limits.

---

### Language-Specific Splitters

```python
from langchain.text_splitter import PythonCodeTextSplitter
```

Used for:

* Source code
* Structured formats

---

### Text Splitter vs Document Loader

| Aspect    | Document Loader | Text Splitter     |
| --------- | --------------- | ----------------- |
| Purpose   | Read data       | Chunk data        |
| Input     | Raw source      | Documents         |
| Output    | Documents       | Smaller Documents |
| LLM usage | ❌               | ❌                 |

---

### Text Splitter vs Retriever

| Aspect       | Text Splitter  | Retriever  |
| ------------ | -------------- | ---------- |
| When         | Ingestion time | Query time |
| Function     | Chunking       | Searching  |
| LLM involved | ❌              | ❌          |

---

### Metadata Handling (Critical)

Each chunk retains metadata:

```python
Document(
    page_content="chunk text",
    metadata={
        "source": "file.pdf",
        "page": 2,
        "chunk_id": 3
    }
)
```

This enables:

* Source attribution
* Filtering
* Debugging

---

### Choosing Chunk Size (Guidelines)

| Use Case             | Chunk Size |
| -------------------- | ---------- |
| General RAG          | 300–800    |
| Dense technical text | 200–400    |
| Narrative text       | 800–1200   |
| Code                 | 100–300    |

---

### Choosing Chunk Overlap

Typical values:

* 10–20% of chunk size

Too small:

* Context loss

Too large:

* Redundant embeddings
* Higher cost

---

### Common Mistakes

#### Very large chunks

❌ Poor retrieval precision

#### No overlap

❌ Boundary context loss

#### Splitting at query time

❌ Should be ingestion-only

#### Ignoring token limits

❌ Runtime failures

---

### Best Practices

* Use RecursiveCharacterTextSplitter by default
* Tune chunk size per domain
* Always preserve metadata
* Split before embedding
* Validate token counts

---

### Text Splitter in Production RAG

Typical setup:

```python
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
```

Stable, predictable, and widely used.

---

### Interview-Ready Summary

> “A Text Splitter in LangChain breaks documents into smaller overlapping chunks to fit LLM context windows and improve retrieval quality. It operates at ingestion time and is a core component of RAG pipelines.”

---

### Rule of Thumb

* **Load → Split → Embed**
* **Smaller chunks → better recall**
* **Overlap → better continuity**
* **Ingestion time only**

