# Overview

The goal of this notebook is to gain an intuitive feel for a few of Langchain's text splitters.
Specifically...
- Character Splitters
    - `CharacterTextSplitter(TextSplitter)` [RTD](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter)  
 This is the simplest method. This splits based on characters (by default "\n\n") and measure chunk length by number of characters.
    - `RecursiveCharacterTextSplitter(TextSplitter)` [RTD](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)
It starts by splitting at double newlines, then single newlines, then spaces, and finally, if necessary, character by character. This hierarchical approach ensures manageable chunks while retaining as much original structure as possible.
- Sentence Splitters
    - `NLTKTextSplitter(TextSplitter)` - uses NLTK's `sent_tokenizer` to split text into sentences.
    - SpacyTextSplitter(TextSplitter) - uses it's libraries to also split text into sentences.  Before you use it, after `pip install spacy`, you need to `python -m spacy download en_core_web_sm`.

All of these classes inherit from `TextSplitter()` and have default settings for chunk and overlap size [src](https://github.com/hwchase17/langchain/blob/dd648183fae95f5f251926e3a188d9ef9e6faeed/langchain/text_splitter.py#L38):
```
class TextSplitter(BaseDocumentTransformer, ABC):
    """Interface for splitting text into chunks."""

    def __init__(
        self,
        chunk_size: int = 4000,
        chunk_overlap: int = 200,
```


# Character Splitters
We'll start by exploring `CharacterTextSplitter` and `RecursiveCharacterTextSplitter` examples.

This table provides a summary of six different configurations of text splitters: `CharacterTextSplitter` and `RecursiveCharacterTextSplitter` that are used in our exploration. Each configuration is represented by a row, detailing the text splitter's name, its abbreviated label, the chunk size (maximum characters per chunk), and the overlap size (shared characters between chunks). The chunk sizes vary from 100 to 4000 characters, and the overlap sizes are either 20 or 200 characters.The (perhaps naive) thought is these samples would be enough to gain an intuitive feel for how the text splitters work and ultimately the quality of the queries when each one is used.

In [1]:
from IPython.display import display, Markdown

# Define the data for the table
data = [
    {'Text Splitter Name': 'CharacterTextSplitter', 'Label': 'C', 'Chunk Size': 100, 'Overlap Size': 20},
    {'Text Splitter Name': 'CharacterTextSplitter', 'Label': 'C', 'Chunk Size': 1000, 'Overlap Size': 200},
    {'Text Splitter Name': 'CharacterTextSplitter', 'Label': 'C', 'Chunk Size': 4000, 'Overlap Size': 200},
    {'Text Splitter Name': 'RecursiveCharacterTextSplitter', 'Label': 'R', 'Chunk Size': 100, 'Overlap Size': 20},
    {'Text Splitter Name': 'RecursiveCharacterTextSplitter', 'Label': 'R', 'Chunk Size': 1000, 'Overlap Size': 200},
    {'Text Splitter Name': 'RecursiveCharacterTextSplitter', 'Label': 'R', 'Chunk Size': 4000, 'Overlap Size': 200}
]

# Create the markdown table
table = "| Text Splitter Name | Label | Chunk Size | Overlap Size |\n| --- | --- | --- | --- |\n"
for row in data:
    table += f"| {row['Text Splitter Name']} | {row['Label']} | {row['Chunk Size']} | {row['Overlap Size']} |\n"

# Display the table
display(Markdown(table))


| Text Splitter Name | Label | Chunk Size | Overlap Size |
| --- | --- | --- | --- |
| CharacterTextSplitter | C | 100 | 20 |
| CharacterTextSplitter | C | 1000 | 200 |
| CharacterTextSplitter | C | 4000 | 200 |
| RecursiveCharacterTextSplitter | R | 100 | 20 |
| RecursiveCharacterTextSplitter | R | 1000 | 200 |
| RecursiveCharacterTextSplitter | R | 4000 | 200 |
