# Recursive Text Splitter

To begin, we'll explore two of the most common text splitters in LangChain: the recursive character text splitter and the character text splitter. We'll experiment with some simple examples to understand how they function. For this, we'll set a relatively small chunk size of 26 and an even smaller chunk overlap of 4, allowing us to clearly observe their behavior.

Let's initialize these text splitters as R splitter and C splitter, respectively. Then, we'll examine a few different use cases to see how each performs

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [2]:
chunk_size =26
chunk_overlap = 4

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [14]:
text1 = 'zyxwvutsrqponmlkjihgfedcba'

In [15]:
text2 = 'zyxwvutsrqponmlkjihgfedcbazyxwvutsr'
r_splitter.split_text(text2)

['zyxwvutsrqponmlkjihgfedcbazyxwvutsr']

In [18]:
text3 = "z y x w v u t s r q p o n m l k j i h g f e d c b a"
r_splitter.split_text(text3)

['z y x w v u t s r q p o n m l k j i h g f e d c b a']

In [19]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['z y x w v u t s r q p o n', 'o n m l k j i h g f e d c', 'd c b a']

## Deep dive into Recursive Splitter

In [7]:
text ="As the global community confronts the urgent issue of climate change, renewable energy stands out as a promising solution. Solar \
       and wind energy, in particular, are reshaping the energy sector, providing eco-friendly alternatives to conventional fossil fuels. \
     Nations and corporations worldwide are committing to clean energy projects to curb carbon emissions and lessen environmental harm.\n\n \
     The transition to renewable sources not only tackles ecological challenges but also drives technological advancement, paving the way \
     for a more sustainable and prosperous future for future generations"

In [8]:
# CharacterTextSplitter
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)

# RecursiveCharacterTextSplitter

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

For the Recursive Character Text Splitter, we have passed a list of separators. This list is double newline, single newline, space, and then nothing, an empty string.

This means that when you’re splitting a piece of text it will first try to split it by double newlines. Then, if it still needs to split the individual chunks more it will go on to single newlines. Then, if it still needs to do more it goes on to the space. Finally, it will go character by character if it really needs to do that.

Let's apply these two splitters to the text above and look at how they perform.

In [9]:
c_splitter.split_text(text)

['As the global community confronts the urgent issue of climate change, renewable energy stands out as a promising solution. Solar and wind energy, in particular, are reshaping the energy sector, providing eco-friendly alternatives to conventional fossil fuels. Nations and corporations worldwide are committing to clean energy projects to curb carbon emissions and lessen environmental harm.\n\n The transition to renewable sources not only tackles',
 'ecological challenges but also drives technological advancement, paving the way for a more sustainable and prosperous future for future generations']

In [10]:
r_splitter.split_text(text)

['As the global community confronts the urgent issue of climate change, renewable energy stands out as a promising solution. Solar        and wind energy, in particular, are reshaping the energy sector, providing eco-friendly alternatives to conventional fossil fuels.      Nations and corporations worldwide are committing to clean energy projects to curb carbon emissions and lessen environmental harm.',
 'The transition to renewable sources not only tackles ecological challenges but also drives technological advancement, paving the way      for a more sustainable and prosperous future for future generations']

Now Let us split it into even smaller chunks to get a better understanding of how it works. We will also add a period separator. This addition is aimed at splitting in between sentences.

In [11]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(text)

['As the global community confronts the urgent issue of climate change, renewable energy stands out as a promising solution. Solar        and wind',
 'energy, in particular, are reshaping the energy sector, providing eco-friendly alternatives to conventional fossil fuels.      Nations and',
 'corporations worldwide are committing to clean energy projects to curb carbon emissions and lessen environmental harm.',
 'The transition to renewable sources not only tackles ecological challenges but also drives technological advancement, paving the way      for a',
 'more sustainable and prosperous future for future generations']

We can see that the text is split into sentences, but the periods are incorrectly placed due to the underlying regex processing. To resolve this, we can specify a more advanced regex pattern using a lookbehind. By doing so, we can properly split the text into sentences with the periods correctly positioned. Now, when we run the code, the sentences are correctly separated, and the periods are placed in the right locations

In [13]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(text)

['As the global community confronts the urgent issue of climate change, renewable energy stands out as a promising solution. Solar        and wind',
 'energy, in particular, are reshaping the energy sector, providing eco-friendly alternatives to conventional fossil fuels.      Nations and',
 'corporations worldwide are committing to clean energy projects to curb carbon emissions and lessen environmental harm.',
 'The transition to renewable sources not only tackles ecological challenges but also drives technological advancement, paving the way      for a',
 'more sustainable and prosperous future for future generations']