# **How to Choose the Right Text Splitter for Your RAG Design**

## **Introduction**

In the world of natural language processing (NLP) and document processing, splitting text into smaller, manageable chunks is a critical step for tasks like embedding generation, semantic search, summarization, and more. The choice of text-splitting method can significantly impact the quality of downstream applications. This article explores three powerful text-splitting classes from the `langchain-text-splitters` library: **`RecursiveCharacterTextSplitter`**, **`MarkdownHeaderTextSplitter`**, and **`SentenceTransformersTokenTextSplitter`**. Each of these tools is designed for specific use cases, offering unique features and capabilities. Through practical examples, we’ll demonstrate how to use these classes effectively and highlight their strengths. Additionally, a comparison table will help you choose the right tool for your needs.

### **Comparison Table**

| Feature/Class                     | RecursiveCharacterTextSplitter       | MarkdownHeaderTextSplitter           | SentenceTransformersTokenTextSplitter |
|-----------------------------------|--------------------------------------|--------------------------------------|----------------------------------------|
| **Primary Use Case**              | General-purpose text splitting       | Splitting Markdown by headers        | Token-based splitting for embeddings   |
| **Splitting Mechanism**           | Recursive splitting by separators    | Header-based splitting               | Tokenization using Sentence Transformers |
| **Customizable Separators**       | Yes                                  | No (headers are predefined)          | No (uses model tokenizer)              |
| **Preserves Document Structure**  | No                                   | Yes (header hierarchy)               | No                                     |
| **Token-Based Splitting**         | No                                   | No                                   | Yes                                    |
| **Chunk Size Control**            | Yes (character-based)                | Yes (header-based)                   | Yes (token-based)                      |
| **Overlap Between Chunks**        | Yes                                  | No                                   | Yes                                    |
| **Best For**                      | General text processing              | Markdown documents                   | Embedding generation, NLP tasks        |

This code block installs two Python libraries, `langchain_community` and `langchain_experimental`, using `pip`. 

- **`langchain_community`**: Likely contains community-driven extensions or integrations for the LangChain framework.
- **`langchain_experimental`**: Includes experimental features or tools for LangChain, which may be in active development or testing phases.

The `-qU` flags ensure that the installation is quiet (minimal output) and updates to the latest version if already installed.

In [None]:
!pip install -qU langchain_community
!pip install -qU langchain_experimental

---

## **1. RecursiveCharacterTextSplitter**

The `RecursiveCharacterTextSplitter` splits text recursively based on a list of separators (e.g., `\n\n`, `\n`, ` `, etc.). It is useful for breaking down large texts into smaller chunks while preserving context.

### **Example 1: Basic Usage**

This example demonstrates the basic usage of the `RecursiveCharacterTextSplitter`. It initializes the splitter with a specified chunk size and overlap, then splits a sample text into smaller chunks. Finally, it prints each resulting chunk to the console.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)

# Sample text
text = "This is a sample text. It will be split into smaller chunks. The goal is to ensure that each chunk is manageable and retains context."

# Split the text
chunks = splitter.split_text(text)

# Output the chunks with index
for index, chunk in enumerate(chunks):
    print(f"Chunk {index + 1}: {chunk}")

### **Example 2: Custom Separators**

In this example, the splitter is initialized with custom separators, including newline characters, spaces, and periods. This allows for more granular control over how the text is split, ensuring that splits occur at logical punctuation marks or line breaks.

In [None]:
# Initialize the splitter with custom separators
splitter = RecursiveCharacterTextSplitter(
    separators=["\n", " ", "."],  # Split by newline, space, and period
    chunk_size=50,
    chunk_overlap=10
)

# Sample text
text = "This is a sample text.\nIt will be split into smaller chunks.\nThe goal is to ensure that each chunk is manageable."

# Split the text
chunks = splitter.split_text(text)

# Output the chunks with index
for index, chunk in enumerate(chunks):
    print(f"Chunk {index + 1}: {chunk}")

### **Example 3: Splitting Documents**

This example showcases how to split multiple documents using the `RecursiveCharacterTextSplitter`. It imports the `Document` class, initializes the splitter, and processes a list of sample documents. Each split document's content is then printed.

In [None]:
from langchain_core.documents import Document

# Initialize the splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)

# Sample documents
documents = [
    Document(page_content="This is the first document. It contains some text."),
    Document(page_content="This is the second document. It contains more text.")
]

# Split the documents
split_docs = splitter.split_documents(documents)

# Output the split documents with index
for index, doc in enumerate(split_docs):
    print(f"Doc {index + 1}: {doc.page_content}")

### **Example 4: Using with Metadata**

In this example, the splitter is used to create documents that include metadata. It initializes the splitter, prepares sample texts along with their corresponding metadata, and generates documents that pair each text chunk with its metadata. The resulting documents are then printed, displaying both content and metadata.

In [None]:
# Initialize the splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)

# Sample text with metadata
texts = ["This is a sample text.", "It will be split into smaller chunks."]
metadatas = [{"source": "doc1"}, {"source": "doc2"}]

# Create documents with metadata
docs = splitter.create_documents(texts, metadatas=metadatas)

# Output the documents with index
for index, doc in enumerate(docs):
    print(f"Doc {index + 1}:\n  page_content: {doc.page_content}\n  metadata: {doc.metadata}")

---

## **2. MarkdownHeaderTextSplitter**

The `MarkdownHeaderTextSplitter` splits Markdown documents based on specified headers, preserving the hierarchical structure of the document.

### **Example 1: Basic Usage**

This example illustrates the basic usage of the `MarkdownHeaderTextSplitter`. It defines a hierarchy of headers to split on, initializes the splitter with these headers, and processes a sample Markdown text. Each resulting chunk corresponding to a header level is then printed.

In [None]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

# Define headers to split on
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

# Initialize the splitter
splitter = MarkdownHeaderTextSplitter(headers_to_split_on)

# Sample Markdown text
markdown_text = """
# Level-1
This is some text under Header 1.

## Level-2
This is some text under Header 2.

### Level-3
This is some text under Header 3.
"""

# Split the text
chunks = splitter.split_text(markdown_text)

# Output the chunks with index
for index, chunk in enumerate(chunks):
    print(f"Chunk {index + 1}: {chunk}")

### **Example 2: Keeping Headers in Content**

In this example, the splitter is configured to retain the headers within the content of each chunk. By setting `strip_headers=False`, the original headers are preserved in the resulting text chunks.

In [None]:
# Initialize the splitter with headers kept in content
splitter = MarkdownHeaderTextSplitter(headers_to_split_on, strip_headers=False)

# Split the text
chunks = splitter.split_text(markdown_text)

# Output the chunks
for chunk in chunks:
    print(f"{chunk.page_content}\n")

### **Example 3: Returning Each Line as a Separate Document**

This example demonstrates how to configure the splitter to treat each line of the Markdown text as a separate document. By setting `return_each_line=True`, the splitter processes and returns each line individually.

In [None]:
# Initialize the splitter to return each line as a separate document
splitter = MarkdownHeaderTextSplitter(headers_to_split_on, return_each_line=True)

# Split the text
chunks = splitter.split_text(markdown_text)

# Output the chunks with index
for index, chunk in enumerate(chunks):
    print(f"Chunk {index + 1}: {chunk}")

### **Example 4: Combining with RecursiveCharacterTextSplitter**

Here, the example shows how to combine `MarkdownHeaderTextSplitter` with `RecursiveCharacterTextSplitter` for more granular splitting. First, the Markdown text is split based on headers, and then each resulting chunk is further split into smaller character-based chunks. The final split chunks are then printed.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the Markdown splitter
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)

# Split the Markdown text
md_chunks = markdown_splitter.split_text(markdown_text)

# Initialize the character splitter
char_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)

# Further split the Markdown chunks
final_chunks = char_splitter.split_documents(md_chunks)

# Output the chunks with index
for index, chunk in enumerate(final_chunks):
    print(f"Chunk {index + 1}: {chunk.page_content}")

---

## **3. SentenceTransformersTokenTextSplitter**

The `SentenceTransformersTokenTextSplitter` splits text into tokens using a Sentence Transformers model tokenizer, ensuring alignment with the model's token boundaries.

### **Example 1: Basic Usage**

This example demonstrates the basic usage of the `SentenceTransformersTokenTextSplitter`. It initializes the splitter with specified token parameters, splits a sample text based on tokenization, and prints each resulting chunk.

In [None]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

# Initialize the splitter
splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=20, tokens_per_chunk=100)

# Longer sample text
text = """
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language. 
It focuses on how to program computers to process and analyze large amounts of natural language data. 
The goal is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. 
NLP techniques are used in a wide range of applications, including machine translation, sentiment analysis, speech recognition, and text summarization. 
One of the key challenges in NLP is dealing with the ambiguity and complexity of human language. 
For example, the same word can have multiple meanings depending on the context, and sentences can be structured in many different ways. 
To address these challenges, NLP researchers use a variety of techniques, including statistical models, machine learning algorithms, and deep learning architectures. 
In recent years, advances in deep learning have led to significant improvements in NLP tasks, such as language modeling, text generation, and question answering. 
These advancements have been driven by the development of large-scale pre-trained language models, such as BERT, GPT, and T5, which are trained on massive amounts of text data and can be fine-tuned for specific tasks.
"""

# Split the text
chunks = splitter.split_text(text)

# Output the chunks with index
for index, chunk in enumerate(chunks):
    print(f"Chunk {index + 1}:\n{chunk}\n")

### **Example 2: Counting Tokens**

In this example, the splitter is used to count the number of tokens in a given text. This is useful for understanding how text is tokenized and ensuring that it fits within model constraints.

In [None]:
# Count tokens in the text
token_count = splitter.count_tokens(text=text)
print(f"Token count: {token_count}")

### **Example 3: Splitting Documents**

This example showcases how to split multiple documents using the `SentenceTransformersTokenTextSplitter`. It processes a list of sample documents, splits each based on tokenization, and prints the content of each resulting split document.

In [None]:
from langchain_core.documents import Document

# Sample documents
documents = [
    Document(page_content="This is the first document."),
    Document(page_content="This is the second document.")
]

# Split the documents
split_docs = splitter.split_documents(documents)

# Output the split documents
for doc in split_docs:
    print(doc.page_content)

### **Example 4: Using a Custom Model**

Here, the splitter is initialized with a custom Sentence Transformers model. This allows for tokenization that aligns with the specific tokenizer of the chosen model, providing more accurate splits based on the model's token boundaries.

In [None]:
# Initialize the splitter with a custom Sentence Transformers model
splitter = SentenceTransformersTokenTextSplitter(
    model_name="sentence-transformers/paraphrase-MiniLM-L6-v2",
    tokens_per_chunk=50,
    chunk_overlap=10
)

# Split the text
chunks = splitter.split_text(text)

# Output the chunks with index
for index, chunk in enumerate(chunks):
    print(f"Chunk {index + 1}:\n{chunk}\n")

---

## **Conclusion**

Text splitting is a foundational step in many NLP and document processing workflows. The choice of the right tool depends on the nature of your data and the specific requirements of your task. 

- **`RecursiveCharacterTextSplitter`** is a versatile choice for general-purpose text splitting, offering flexibility in defining separators and chunk sizes.
- **`MarkdownHeaderTextSplitter`** excels at preserving the hierarchical structure of Markdown documents, making it ideal for processing structured content.
- **`SentenceTransformersTokenTextSplitter`** is tailored for token-based splitting, ensuring compatibility with Sentence Transformers models for tasks like embedding generation and semantic search.

By understanding the strengths and use cases of each class, you can make informed decisions and optimize your text-processing pipelines. Whether you’re working with plain text, Markdown, or tokenized data, these tools provide robust solutions for splitting text effectively.