# Recursive Character Splitter

Langchain’s RecursiveCharacterTextSplitter class is designed to break down a given text into smaller chunks by recursively attempting to split it using different separators. This class is particularly useful when a single separator may not be sufficient to identify the desired chunks.

The method starts by trying to split the text using a list of potential separators specified in the _separators attribute. It iteratively checks each separator to find the one that works for the given text. If a separator is found, the text is split, and the process is repeated recursively on the resulting chunks until the chunks are of a manageable size.

The separators are listed in descending order of preference, and the method attempts to split the text using the most specific ones first. For example, in the context of the Python language, it tries to split along class definitions ("\nclass "), function definitions ("\ndef "), and other common patterns. If a separator is found, it proceeds to split the text recursively.

The resulting chunks are then merged and returned as a list. The size of the chunks is determined by parameters like chunk_size and chunk_overlap defined in the parent class TextSplitter. This approach allows for a more flexible and adaptive way of breaking down a text into meaningful sections.

## Split Normal Text

In [1]:
text = """Langchain’s RecursiveCharacterTextSplitter class is designed to break down a given text into smaller chunks by recursively attempting to split it using different separators. This class is particularly useful when a single separator may not be sufficient to identify the desired chunks.

The method starts by trying to split the text using a list of potential separators specified in the _separators attribute. It iteratively checks each separator to find the one that works for the given text. If a separator is found, the text is split, and the process is repeated recursively on the resulting chunks until the chunks are of a manageable size.

The separators are listed in descending order of preference, and the method attempts to split the text using the most specific ones first. For example, in the context of the Python language, it tries to split along class definitions ("\nclass "), function definitions ("\ndef "), and other common patterns. If a separator is found, it proceeds to split the text recursively.

The resulting chunks are then merged and returned as a list. The size of the chunks is determined by parameters like chunk_size and chunk_overlap defined in the parent class TextSplitter. This approach allows for a more flexible and adaptive way of breaking down a text into meaningful sections."""

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

recursive_character_text_splitter = RecursiveCharacterTextSplitter(chunk_size = 200, chunk_overlap = 0, separators=["\n\n", "\n", "(?<=\. )", " ", ""])
splitted_text = recursive_character_text_splitter.split_text(text)
splitted_text

## Load and Split PDF

In [7]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("../docs/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

recursive_character_text_splitter = RecursiveCharacterTextSplitter(chunk_size = 200, chunk_overlap = 0, separators=["\n\n", "\n", "(?<=\. )", " ", ""])
docs = recursive_character_text_splitter.split_documents(pages)
docs