# 2.2 Document Splitting

May seem simple, but it can be a complex process that requires some thought and planning.


![Splitting](https://python.langchain.com/assets/images/text_splitters-7961ccc13e05e2fd7f7f58048e082f47.png)

## Setup

### Install dependencies

In [None]:
%pip install python-dotenv~=1.0 docarray~=0.40.0 pypdf~=5.1 --upgrade --quiet
%pip install langchain~=0.3.7 langchain_openai~=0.2.6 langchain_community~=0.3.5 --upgrade --quiet

# If running locally, you can do this instead:
#%pip install -r ../requirements.txt

### Load environment variables

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

# If running in Google Colab, you can use this code instead:
# from google.colab import userdata
# os.environ["AZURE_OPENAI_API_KEY"] = userdata.get("AZURE_OPENAI_API_KEY")
# os.environ["AZURE_OPENAI_ENDPOINT"] = userdata.get("AZURE_OPENAI_ENDPOINT")

### Setup path to data 

In [None]:
data_path = "../data"

## Basic splitting

The most intuitive strategy is to split documents based on their length. This simple yet effective approach ensures that each chunk doesn't exceed a specified size limit. 

Key benefits of length-based splitting:

- Straightforward implementation
- Consistent chunk sizes
- Easily adaptable to different model requirements

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [None]:
chunk_size =27
chunk_overlap = 4

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Let's see how the text is split using the `RecursiveCharacterTextSplitter`

In [None]:
text1 = 'abcdefghijklmn opqrstuvwxyz' # Equal to chunk_size

r_splitter.split_text(text1)

In [None]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg' # Longer than chunk_size
#text2 = 'abcdefghijklmno pqrstuvwxyzabcdefg' # Longer than chunk_size

r_splitter.split_text(text2) # Note overlap

In [None]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

r_splitter.split_text(text3)

Now, let's see how the text is split using the `CharacterTextSplitter` 

In [None]:
# How many chunks will be created?
c_splitter.split_text(text3)

In [None]:
# Let's redefine the splitter to split on spaces
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)


**Try your own examples!** 
<br/><br/><br/>

----


## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text. 

In [None]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

len(some_text)

In [None]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""] # These are the default separators
)

In [None]:
docs = c_splitter.split_text(some_text)

print(f"Splits: {[len(doc) for doc in docs]}")
print(docs)   

In [None]:
docs = r_splitter.split_text(some_text)

print(f"Splits: {[len(doc) for doc in docs]}")
print(docs)

## PDF Splitting

In [None]:
from langchain.document_loaders import PyPDFLoader
#loader = PyPDFLoader(f"{data_path}/MachineLearning-Lecture01.pdf")
loader = PyPDFLoader(f"{data_path}/prop_202425__11.pdf")
loaded_pages = loader.load()

content = ' '.join([page.page_content for page in loaded_pages[:20]])
len(content)
print(content[:500])

In [None]:
# Try with CharacterTextSplitter
# text_splitter = CharacterTextSplitter(
#     separator="\n",
#     chunk_size=1000,
#     chunk_overlap=150,
#     length_function=len
# )
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000, 
    chunk_overlap=200
)

In [None]:
docs = text_splitter.split_documents(loaded_pages)

In [None]:
print(f"Document splits: {len(docs)}")
print(f"Loaded pages: {len(loaded_pages)}")

In [None]:
# Let's print some info about a single page
page=3
print(f"Metadata: {docs[page].metadata}")
print(f"Length: {len(docs[page].page_content)}")
print(f"Page content (first 100): \n{docs[page].page_content[:100]}...")

## Context aware splitting

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

In [None]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [None]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [None]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [None]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [None]:
md_header_splits[0]

In [None]:
md_header_splits[1]