# Text Splitting

In LangChain, text splitting is used to break large chunks of text into smaller, more manageable pieces, making it easier to process with LLMs. This is especially useful for handling large documents like PDFs.

In [1]:
from langchain_community.document_loaders import PyPDFLoader 
pdf_loader = PyPDFLoader("Attention_is_ All_You_Need.pdf")
pdf_doc = pdf_loader.load()
pdf_doc

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'Attention_is_ All_You_Need.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoog

In [2]:
len(pdf_doc)

15

#### Recursively split text by characters

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(pdf_doc)
chunks

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'Attention_is_ All_You_Need.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu'),
 Document(metadata

In [5]:
chunks[:1]

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'Attention_is_ All_You_Need.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu')]

In [8]:
for i, chunk in enumerate(chunks[:5]):  # Display first 5 chunks
    print(f"Chunk {i+1}:\n{chunk.page_content}\n{'-'*50}")
    

Chunk 1:
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
--------------------------------------------------
Chunk 2:
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, 

In [9]:
from langchain_community.document_loaders import TextLoader
textloader = TextLoader("sample.txt")
text_doc = textloader.load()
text_doc

[Document(metadata={'source': 'sample.txt'}, page_content="LangChain is an open-source framework designed to help developers build applications powered by large language models (LLMs).  \n\nIt provides tools for loading, processing, and managing different types of data sources such as text files, PDFs, web pages, and databases.  \n\nUsing LangChain's document loaders, we can efficiently fetch data from multiple sources and utilize it for various AI-based tasks.\n")]

In [10]:
text = None
with open("sample.txt") as file:
    text = file.read()
text

"LangChain is an open-source framework designed to help developers build applications powered by large language models (LLMs).  \n\nIt provides tools for loading, processing, and managing different types of data sources such as text files, PDFs, web pages, and databases.  \n\nUsing LangChain's document loaders, we can efficiently fetch data from multiple sources and utilize it for various AI-based tasks.\n"

In [14]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
text_chunks = text_splitter.split_text(text)
print(text_chunks)

['LangChain is an open-source framework designed to help developers build applications powered by', 'powered by large language models (LLMs).', 'It provides tools for loading, processing, and managing different types of data sources such as', 'sources such as text files, PDFs, web pages, and databases.', "Using LangChain's document loaders, we can efficiently fetch data from multiple sources and utilize", 'sources and utilize it for various AI-based tasks.']


In [15]:
for i, text_chunks in enumerate(text_chunks[:5]):  
    print(f"Chunk {i+1}:\n{text_chunks}\n{'-'*50}")

Chunk 1:
LangChain is an open-source framework designed to help developers build applications powered by
--------------------------------------------------
Chunk 2:
powered by large language models (LLMs).
--------------------------------------------------
Chunk 3:
It provides tools for loading, processing, and managing different types of data sources such as
--------------------------------------------------
Chunk 4:
sources such as text files, PDFs, web pages, and databases.
--------------------------------------------------
Chunk 5:
Using LangChain's document loaders, we can efficiently fetch data from multiple sources and utilize
--------------------------------------------------


Here we can see the last 20 characters overlapped with next chunck

### CharacterTextSplitter

CharacterTextSplitter is a simple way to split text based on a specified separator, like spaces, new lines, or other characters. It is useful when working with structured text where specific delimiters (e.g., sentences, paragraphs) define meaningful splits.

Key Parameters:
- separator → Defines how the text is split (default: "\n\n" for paragraphs).
- chunk_size → Maximum size of each chunk.
- chunk_overlap → Number of overlapping characters between chunks.


In [19]:
from langchain_community.document_loaders import TextLoader
textloader = TextLoader("sample.txt")
text_doc = textloader.load()

In [20]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n",  # Split by new lines
    chunk_size=50,   # Each chunk has a max of 50 characters
    chunk_overlap=10 # Overlap between consecutive chunks
)
text_chunks = text_splitter.split_documents(text_doc)
print(text_chunks)

Created a chunk of size 127, which is longer than the specified 50
Created a chunk of size 141, which is longer than the specified 50


[Document(metadata={'source': 'sample.txt'}, page_content='LangChain is an open-source framework designed to help developers build applications powered by large language models (LLMs).'), Document(metadata={'source': 'sample.txt'}, page_content='It provides tools for loading, processing, and managing different types of data sources such as text files, PDFs, web pages, and databases.'), Document(metadata={'source': 'sample.txt'}, page_content="Using LangChain's document loaders, we can efficiently fetch data from multiple sources and utilize it for various AI-based tasks.")]


In [21]:
for i, chunk in enumerate(text_chunks):
    print(f"Chunk {i+1}:\n{chunk}\n{'-'*50}")

Chunk 1:
page_content='LangChain is an open-source framework designed to help developers build applications powered by large language models (LLMs).' metadata={'source': 'sample.txt'}
--------------------------------------------------
Chunk 2:
page_content='It provides tools for loading, processing, and managing different types of data sources such as text files, PDFs, web pages, and databases.' metadata={'source': 'sample.txt'}
--------------------------------------------------
Chunk 3:
page_content='Using LangChain's document loaders, we can efficiently fetch data from multiple sources and utilize it for various AI-based tasks.' metadata={'source': 'sample.txt'}
--------------------------------------------------


### HTMLHeaderTextSplitter
HTMLHeaderTextSplitter is a specialized text splitter in LangChain designed to handle HTML documents by splitting them based on HTML header tags (h1, h2, h3, etc.). This is useful when working with structured documents like web pages, articles, or reports where headers define logical sections.

Key Features:
- Splits text based on HTML header tags (h1, h2, h3, etc.).
- Preserves hierarchical structure to maintain context.
- Customizable header hierarchy for more control.

In [23]:
# Sample HTML content
html_text = """
<h1>Introduction</h1>
<p>Transformers have revolutionized NLP.</p>

<h2>Self-Attention Mechanism</h2>
<p>Self-attention allows models to weigh words differently.</p>

<h2>Applications</h2>
<p>Used in translation, summarization, and chatbots.</p>
"""
html_text

'\n<h1>Introduction</h1>\n<p>Transformers have revolutionized NLP.</p>\n\n<h2>Self-Attention Mechanism</h2>\n<p>Self-attention allows models to weigh words differently.</p>\n\n<h2>Applications</h2>\n<p>Used in translation, summarization, and chatbots.</p>\n'

In [25]:
from langchain.text_splitter import HTMLHeaderTextSplitter

# Define the headers to split on
headers_to_split_on = [
    ("h1", "Header1"),  # Maps <h1> to "Header1"
    ("h2", "Header2"),  # Maps <h2> to "Header2"
]

# Initialize HTMLHeaderTextSplitter
text_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
text_splitter

<langchain_text_splitters.html.HTMLHeaderTextSplitter at 0x271e2db6fe0>

In [26]:
# Split the text
chunks = text_splitter.split_text(html_text)

# Print the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk.page_content}\nMetadata: {chunk.metadata}\n{'-'*50}")

Chunk 1:
Introduction
Metadata: {'Header1': 'Introduction'}
--------------------------------------------------
Chunk 2:
Transformers have revolutionized NLP.
Metadata: {'Header1': 'Introduction'}
--------------------------------------------------
Chunk 3:
Self-Attention Mechanism
Metadata: {'Header1': 'Introduction', 'Header2': 'Self-Attention Mechanism'}
--------------------------------------------------
Chunk 4:
Self-attention allows models to weigh words differently.
Metadata: {'Header1': 'Introduction', 'Header2': 'Self-Attention Mechanism'}
--------------------------------------------------
Chunk 5:
Applications
Metadata: {'Header1': 'Introduction', 'Header2': 'Applications'}
--------------------------------------------------
Chunk 6:
Used in translation, summarization, and chatbots.
Metadata: {'Header1': 'Introduction', 'Header2': 'Applications'}
--------------------------------------------------


In [27]:
from langchain.text_splitter import HTMLHeaderTextSplitter

# Define the headers to split on
headers_to_split_on = [
    ("h1", "Header1"),  # Maps <h1> to "Header1"
    ("h2", "Header2"),  # Maps <h2> to "Header2"
    ("h3", "Header2"),  # Maps <h3> to "Header3"
    ("h4", "Header2"),  # Maps <h4> to "Header4"
]

# Initialize HTMLHeaderTextSplitter
text_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
# Split the text
chunks = text_splitter.split_text_from_url('https://plato.stanford.edu/entries/analysis/')

In [28]:
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk.page_content}\nMetadata: {chunk.metadata}\n{'-'*50}")

Chunk 1:
End header wrapper  
    End content  
    End footer  
End header  
End navigation  
 
  End search  
Stanford Encyclopedia of Philosophy  
Menu  
Browse  
Table of Contents  
What's New  
Random Entry  
Chronological  
Archives  
About  
Editorial Information  
About the SEP  
Editorial Board  
How to Cite the SEP  
Special Characters  
Advanced Tools  
Contact  
Support SEP  
Support the SEP  
PDFs for SEP Friends  
Make a Donation  
SEPIA for Libraries  
Begin article sidebar  
 
  End article sidebar  
  NOTE: Article content must have two wrapper divs: id="article" and id="article-content"  
    End article  
  NOTE: article banner is outside of the id="article" div.  
    End article-banner  
Entry Navigation  
Entry Contents  
Bibliography  
Academic Tools  
Friends PDF Preview  
Author and Citation Info  
Back to Top  
End article-content  
BEGIN ARTICLE HTML  
  #aueditable  DO NOT MODIFY THIS LINE AND BELOW 
  END ARTICLE HTML  
DO NOT MODIFY THIS LINE AND ABOVE
Met

### Splitting a JSON File 
When working with JSON files, the approach to text splitting depends on the structure of the JSON data. LangChain does not have a direct JSONTextSplitter, but we can process JSON files by:

- Loading JSON data → Read and parse the file.
- Extracting relevant text fields → Identify which parts of JSON need splitting.
- Using RecursiveCharacterTextSplitter or CharacterTextSplitter → Break text into manageable chunks.

In [30]:
import json
import requests

json_data = requests.get('https://api.smith.langchain.com/openapi.json').json()
json_data

{'openapi': '3.1.0',
 'info': {'title': 'LangSmith', 'version': '0.1.0'},
 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'],
    'summary': 'Read Tracer Session',
    'description': 'Get a specific session.',
    'operationId': 'read_tracer_session_api_v1_sessions__session_id__get',
    'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}],
    'parameters': [{'name': 'session_id',
      'in': 'path',
      'required': True,
      'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}},
     {'name': 'include_stats',
      'in': 'query',
      'required': False,
      'schema': {'type': 'boolean',
       'default': False,
       'title': 'Include Stats'}},
     {'name': 'accept',
      'in': 'header',
      'required': False,
      'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
       'title': 'Accept'}}],
    'responses': {'200': {'description': 'Successful Response',
      'content': {'application/json': {'sch

In [40]:
from langchain_text_splitters import RecursiveJsonSplitter
json_splitter = RecursiveJsonSplitter(max_chunk_size=300)
json_chunks = json_splitter.split_json(json_data)

In [41]:
for i, chunk in enumerate(json_chunks[:5]):
    print(f"Chunk {i+1}: DataType = {type(chunk)}\n{chunk}\n{'-'*50}")

Chunk 1: DataType = <class 'dict'>
{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.'}}}}
--------------------------------------------------
Chunk 2: DataType = <class 'dict'>
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'operationId': 'read_tracer_session_api_v1_sessions__session_id__get', 'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}
--------------------------------------------------
Chunk 3: DataType = <class 'dict'>
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'parameters': [{'name': 'session_id', 'in': 'path', 'required': True, 'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}}, {'name': 'include_stats', 'in': 'query', 'required': False, 'schema': {'type': 'boolean', 'default': False, 'title': 'Include Stats'}}, {'name': 'accept', 'in': 'heade

In [42]:
# Splitter can also an output document
json_docs = json_splitter.create_documents(texts=[json_data])
for i, chunk in enumerate(json_docs[:5]):
    print(f"Chunk {i+1}: DataType = {type(chunk)}\n{chunk}\n{'-'*50}")

Chunk 1: DataType = <class 'langchain_core.documents.base.Document'>
page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}'
--------------------------------------------------
Chunk 2: DataType = <class 'langchain_core.documents.base.Document'>
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"operationId": "read_tracer_session_api_v1_sessions__session_id__get", "security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}'
--------------------------------------------------
Chunk 3: DataType = <class 'langchain_core.documents.base.Document'>
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_sta