<center>
<img src="https://supportvectors.ai/logo-poster-transparent.png" width="400px" style="opacity:0.7">
</center>

In [1]:
%run supportvectors-common.ipynb


<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



#### ***The following notebook is adapted from the following resources:***

**1. [https://python.langchain.com/docs/concepts/text_splitters/](https://python.langchain.com/docs/concepts/text_splitters/)**

**2. [https://www.youtube.com/watch?v=8OJC21T2SL4&t=1933s](https://www.youtube.com/watch?v=8OJC21T2SL4&t=1933s)**

#### ***And Pros/Cons are adapted from following resource:***

**3. [https://www.f22labs.com/blogs/7-chunking-strategies-in-rag-you-need-to-know/](https://www.f22labs.com/blogs/7-chunking-strategies-in-rag-you-need-to-know/)**

# **Text Chunking Strategies**

Document splitting or Text chunking is a critical preprocessing step in machine learning and NLP tasks. It involves breaking down large text documents into smaller, manageable pieces (chunks) for downstream tasks.

## **Why split documents?**

**Ensuring Consistent Processing:** Documents in real-world datasets rarely follow a uniform length. Some may be a few sentences long, while others span multiple pages. Splitting them into manageable chunks ensures that all documents are processed consistently, regardless of their original size.  

**Working Within Model Constraints:** Most language models and embedding models have a limit on the number of tokens they can process at once. If a document exceeds this limit, it needs to be truncated or split. Chunking prevents information loss by allowing long documents to be processed in full, section by section.  

**Improving Representation and Understanding:** When dealing with long texts, embeddings and model representations may become diluted, struggling to capture key details effectively. By breaking a document into smaller segments, each part can retain a clearer and more focused representation, leading to better downstream performance in tasks like retrieval and summarization.  

**Enhancing Search and Retrieval Accuracy:** In search systems, retrieving an entire document in response to a query isn’t always helpful. Instead, chunking allows for more precise matching, so that users get results at a finer granularity—directly pointing them to the most relevant passage rather than the whole document.  

**Optimizing Memory and Computation:** Processing large texts as a whole can be computationally expensive and memory-intensive. Splitting them into smaller chunks enables better parallelization, reduces memory overhead, and speeds up processing, making it more efficient for both training and inference.  

Chunking is essential for maintaining both accuracy and efficiency when working with textual data. It not only ensures smooth processing but also enhances the quality of results in applications like search engines, chatbots, and summarization systems.  


In [2]:
# To be used for chunking strategies

document = """The American Revolution (1765–1783) was an ideological and political movement in the Thirteen Colonies which peaked when colonists initiated the ultimately successful war for independence (the American Revolutionary War) against the Kingdom of Great Britain. Leaders of the American Revolution were colonial separatist leaders who originally sought more autonomy as British subjects, but later assembled to support the Revolutionary War, which ended British colonial rule over the colonies, establishing their independence as the United States of America in July 1776.

Discontent with colonial rule began shortly after the defeat of France in the French and Indian War in 1763. Although the colonies had fought and supported the war, Parliament imposed new taxes to compensate for wartime costs and turned control of the colonies' western lands over to the British officials in Montreal. Representatives from several colonies convened the Stamp Act Congress; its "Declaration of Rights and Grievances" argued that taxation without representation violated their rights as Englishmen. In 1767, tensions flared again following the British Parliament's passage of the Townshend Acts. In an effort to quell the mounting rebellion, King George III deployed troops to Boston. A local confrontation resulted in the troops killing protesters in the Boston Massacre on March 5, 1770. In 1772, anti-tax demonstrators in Rhode Island destroyed the Royal Navy customs schooner Gaspee. On December 16, 1773, activists disguised as Indians instigated the Boston Tea Party and dumped chests of tea owned by the British East India Company into Boston Harbor. London closed Boston Harbor and enacted a series of punitive laws, which effectively ended self-government in Massachusetts.

In late 1774, 12 of the Thirteen Colonies (Georgia joined in 1775) sent delegates to the First Continental Congress in Philadelphia. It began coordinating Patriot resistance through underground networks of committees. In April 1775, British forces attempted to disarm local militias around Boston and engaged them. On June 14, 1775, the Second Continental Congress responded by authorizing formation of the Continental Army and appointing George Washington as its commander-in-chief. In August, the king proclaimed Massachusetts to be in a state of open defiance and rebellion. The Continental Army surrounded Boston, and the British withdrew by sea in March 1776, leaving the Patriots in control in every colony. In July 1776, the Second Continental Congress began to take on the role of governing a new nation. It passed the Lee Resolution for national independence on July 2, and on July 4, 1776, unanimously adopted the Declaration of Independence, which embodied the political philosophies of liberalism and republicanism, rejected monarchy and aristocracy, and famously proclaimed that "all men are created equal".

The fighting, now known as the Revolutionary War, continued for five years. During this time, the kingdom of France entered as an ally of the United States. The decisive victory came in the fall of 1781, when the combined American and French armies captured an entire British army in the Siege of Yorktown. The defeat led to the collapse of King George's control of Parliament, with a majority now in favor of ending the war on American terms. On September 3, 1783, the British signed the Treaty of Paris, granting the United States nearly all the territory east of the Mississippi River and south of the Great Lakes. About 60,000 Loyalists migrated to other British territories in Canada and elsewhere, but the great majority remained in the United States. With its victory in the American Revolution, the United States became the first constitutional republic in world history founded on the consent of the governed and the rule of law.
"""

## **Different Approaches for Text Chunking**

### **1. Length-based Chunking or Fixed-Size (Character) Chunking**

This is the most intuitive strategy of segmenting the document into consistent chunks of a fixed number of characters, regardless of the content or structure of the document, which can be more consistent across different types of text.

In [3]:
from langchain.text_splitter import CharacterTextSplitter

# Initialize the CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator=" ",        # Define the separator to split on (e.g., newline, space etc.)
    chunk_size=300,       # Maximum size of each chunk
    chunk_overlap=0       # Number of overlapping characters between chunks
)

In [4]:
text_splitter.create_documents([document])

[Document(metadata={}, page_content='The American Revolution (1765–1783) was an ideological and political movement in the Thirteen Colonies which peaked when colonists initiated the ultimately successful war for independence (the American Revolutionary War) against the Kingdom of Great Britain. Leaders of the American Revolution were'),
 Document(metadata={}, page_content='colonial separatist leaders who originally sought more autonomy as British subjects, but later assembled to support the Revolutionary War, which ended British colonial rule over the colonies, establishing their independence as the United States of America in July 1776.\n\nDiscontent with colonial rule'),
 Document(metadata={}, page_content="began shortly after the defeat of France in the French and Indian War in 1763. Although the colonies had fought and supported the war, Parliament imposed new taxes to compensate for wartime costs and turned control of the colonies' western lands over to the British officials in Mo

#### **Pros:**  
- **Straightforward Implementation** – Simple to apply without requiring complex logic.  
- **High Speed** – Works efficiently, even with large datasets, enabling quick processing.  
- **Uniformity** – Ensures consistent chunk sizes across all documents.  
- **Minimal Resource Usage** – Doesn't rely on advanced models or heavy computation.  

#### **Cons:**  
- **Context fragmentation** – May break sentences or disrupt the logical flow of information.  
- **Lack of Adaptability** – Doesn’t adjust to content structure or varying information density.  
- **Risk of Splitting Key Information** – Critical details might get divided between chunks, making interpretation harder.  
- **Less Effective for heterogeneous content** – Works best for uniformly structured texts but struggles with highly diverse document layouts.  


### **2. Text-structured based Chunking or Recursive Chunking**

Text is naturally organized into hierarchical units such as paragraphs, sentences, and words. We can leverage this inherent structure to inform our splitting strategy, creating split that maintain natural language flow, maintain semantic coherence within split, and adapts to varying levels of text granularity.

LangChain's **RecursiveCharacterTextSplitter** implements this concept:

1. The RecursiveCharacterTextSplitter attempts to keep larger units (e.g., paragraphs) intact.
2. If a unit exceeds the chunk size, it moves to the next level (e.g., sentences).
3. This process continues down to the word level if necessary.

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

recursive_text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=0)

In [6]:
recursive_text_splitter.create_documents([document])

[Document(metadata={}, page_content='The American Revolution (1765–1783) was an ideological and political movement in the Thirteen Colonies which peaked when colonists initiated the ultimately successful war for independence (the American Revolutionary War) against the Kingdom of Great Britain. Leaders of the American Revolution were'),
 Document(metadata={}, page_content='colonial separatist leaders who originally sought more autonomy as British subjects, but later assembled to support the Revolutionary War, which ended British colonial rule over the colonies, establishing their independence as the United States of America in July 1776.'),
 Document(metadata={}, page_content="Discontent with colonial rule began shortly after the defeat of France in the French and Indian War in 1763. Although the colonies had fought and supported the war, Parliament imposed new taxes to compensate for wartime costs and turned control of the colonies' western lands over to the British"),
 Document(metad

#### **Pros:**  
- **Maintains Logical Structure** – Splits content at natural breakpoints like paragraphs or sections, keeping context intact.  
- **Highly Adaptable** – Supports various content types, including text and code, by using multiple splitting criteria.  
- **Precise Customization** – Allows fine control over chunk size and overlap to balance readability and completeness.  
- **Effective for Complex Documents** – Well-suited for structured formats like technical manuals, research papers, and programming code.  

#### **Cons:**  
- **More Complex Implementation** – Requires careful setup compared to simple, fixed-size chunking.  
- **Increased Processing Load** – Recursive splitting and multiple checks can slow down handling of large texts.  
- **Relies on Clear Dividers** – Works best when documents have well-defined separators; otherwise, chunks may be inconsistent.  
- **Less Efficient for Massive Datasets** – Can be slower than basic chunking approaches when dealing with high volumes of data.

#### ***An enhancement for Python Code Chunking:***

In [7]:
PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,       # Change language here to appropriate one
    chunk_size=50, 
    chunk_overlap=0
)

In [8]:
python_splitter.create_documents([PYTHON_CODE])

[Document(metadata={}, page_content='def hello_world():\n    print("Hello, World!")'),
 Document(metadata={}, page_content='# Call the function\nhello_world()')]

### **3. Document-Structure Based Chunking**

Some documents have an inherent structure, such as HTML, Markdown, or JSON files. In these cases, it's beneficial to split the document based on its structure, as it often naturally groups semantically related text.

***Few Examples of structure-based splitting:***

- **Markdown**: Split based on headers (e.g., #, ##, ###)
- **HTML**: Split using tags
- **JSON**: Split by object or array elements
- **Code**: Split by functions, classes, or logical blocks

#### ***We will see the Markdown document splitter from LangChain with example.***

#### **MarkdownTextSplitter**

You can view the separators used here like Spaces, Double New Lines, Newlines etc. [**here**](https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L1175)

In [9]:
markdown_text = """
# Heading-1
This is some intro to heading

## Sub-heading-1
Machine learning is a branch of artificial intelligence that enables systems to learn and make decisions without being explicitly programmed. It involves the use of algorithms and statistical models to identify patterns in data. Common 
applications include image recognition, natural language processing, and predictive analytics.

World War II was a global conflict that lasted from 1939 to 1945. It involved most of the world's nations and was marked by significant events such as the Holocaust, the atomic bombings of Hiroshima and Nagasaki, and the D-Day invasion.
The war ended with the Allied victory and the establishment of the United Nations.
Hi this is Joe

## Sub-heading-2
Climate change refers to long-term changes in temperature, precipitation, and other atmospheric conditions on Earth. It is primarily driven by human activities, such as burning fossil fuels and deforestation. The effects of climate change include 
rising sea levels, extreme weather events, and loss of biodiversity.

# Heading-2
Python is a high-level, interpreted programming language known for its readability and simplicity. It supports multiple programming paradigms, including procedural, object-oriented, and functional programming. Popular libraries include NumPy, 
pandas, and TensorFlow, making Python widely used in data science and machine learning.

"""

In [10]:
from langchain.text_splitter import MarkdownTextSplitter

splitter = MarkdownTextSplitter(chunk_size = 100, chunk_overlap=0) 

splitter.create_documents([markdown_text])

[Document(metadata={}, page_content='# Heading-1\nThis is some intro to heading'),
 Document(metadata={}, page_content='## Sub-heading-1'),
 Document(metadata={}, page_content='Machine learning is a branch of artificial intelligence that enables systems to learn and make'),
 Document(metadata={}, page_content='decisions without being explicitly programmed. It involves the use of algorithms and statistical'),
 Document(metadata={}, page_content='models to identify patterns in data. Common'),
 Document(metadata={}, page_content='applications include image recognition, natural language processing, and predictive analytics.'),
 Document(metadata={}, page_content="World War II was a global conflict that lasted from 1939 to 1945. It involved most of the world's"),
 Document(metadata={}, page_content='nations and was marked by significant events such as the Holocaust, the atomic bombings of'),
 Document(metadata={}, page_content='Hiroshima and Nagasaki, and the D-Day invasion.'),
 Document(me

#### **MarkdownHeaderTextSplitter**

In [11]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "H1"),
    ("##", "H2"),
    ("###", "H3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on, strip_headers=False)

In [12]:
md_header_splits = markdown_splitter.split_text(markdown_text)
md_header_splits

[Document(metadata={'H1': 'Heading-1'}, page_content='# Heading-1\nThis is some intro to heading'),
 Document(metadata={'H1': 'Heading-1', 'H2': 'Sub-heading-1'}, page_content="## Sub-heading-1\nMachine learning is a branch of artificial intelligence that enables systems to learn and make decisions without being explicitly programmed. It involves the use of algorithms and statistical models to identify patterns in data. Common\napplications include image recognition, natural language processing, and predictive analytics.  \nWorld War II was a global conflict that lasted from 1939 to 1945. It involved most of the world's nations and was marked by significant events such as the Holocaust, the atomic bombings of Hiroshima and Nagasaki, and the D-Day invasion.\nThe war ended with the Allied victory and the establishment of the United Nations.\nHi this is Joe"),
 Document(metadata={'H1': 'Heading-1', 'H2': 'Sub-heading-2'}, page_content='## Sub-heading-2\nClimate change refers to long-term 

#### **How to constrain chunk size:**

Within each markdown group we can then apply any text splitter we want, such as RecursiveCharacterTextSplitter, which allows for further control of the chunk size.

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

recursive_text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)

# Split
splits = recursive_text_splitter.split_documents(md_header_splits)
splits

[Document(metadata={'H1': 'Heading-1'}, page_content='# Heading-1\nThis is some intro to heading'),
 Document(metadata={'H1': 'Heading-1', 'H2': 'Sub-heading-1'}, page_content='## Sub-heading-1'),
 Document(metadata={'H1': 'Heading-1', 'H2': 'Sub-heading-1'}, page_content='Machine learning is a branch of artificial intelligence that enables systems to learn and make'),
 Document(metadata={'H1': 'Heading-1', 'H2': 'Sub-heading-1'}, page_content='to learn and make decisions without being explicitly programmed. It involves the use of algorithms'),
 Document(metadata={'H1': 'Heading-1', 'H2': 'Sub-heading-1'}, page_content='use of algorithms and statistical models to identify patterns in data. Common'),
 Document(metadata={'H1': 'Heading-1', 'H2': 'Sub-heading-1'}, page_content='applications include image recognition, natural language processing, and predictive analytics.'),
 Document(metadata={'H1': 'Heading-1', 'H2': 'Sub-heading-1'}, page_content="World War II was a global conflict that

#### **Pros:**  

- **Preserves Complete Context** – Keeps the full document intact, ensuring no loss of meaning or disruption in flow.  
- **Best for Highly Structured Texts** – Well-suited for documents with strict formatting, such as legal agreements or medical records.  
- **Simple to Implement** – Requires no complex splitting logic, making it easy to apply.  

#### **Cons:**  

- **Not Scalable for Large Texts** – Struggles with lengthy documents that exceed model token limits or memory capacity.  
- **Resource-Intensive** – Processing entire documents at once can be inefficient and computationally expensive.  
- **Lacks Precision** – Retrieving specific sections or details can be more challenging.

### **4. Semantic Chunking**

Semantic chunking breaks text into chunks based on meaning rather than fixed sizes. It ensures that each chunk contains coherent and relevant information by analyzing shifts in the text’s semantic structure. This is typically done by measuring differences in sentence embeddings, which represent the meaning of sentences mathematically. 

In [14]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SemanticSplitterNodeParser

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

splitter = SemanticSplitterNodeParser(
    buffer_size=3,
    breakpoint_percentile_threshold=70,
    embed_model=embed_model
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [15]:
documents = SimpleDirectoryReader(input_files=["./sample_text.txt"]).load_data()
nodes = splitter.get_nodes_from_documents(documents)

for i, node in enumerate(nodes, 1):
    print(f"\nCHUNK-{i}: {node.text}")


CHUNK-1: The American Revolution (1765–1783) was an ideological and political movement in the Thirteen Colonies which peaked when colonists initiated the ultimately successful war for independence (the American Revolutionary War) against the Kingdom of Great Britain. Leaders of the American Revolution were colonial separatist leaders who originally sought more autonomy as British subjects, but later assembled to support the Revolutionary War, which ended British colonial rule over the colonies, establishing their independence as the United States of America in July 1776.

Discontent with colonial rule began shortly after the defeat of France in the French and Indian War in 1763. Although the colonies had fought and supported the war, Parliament imposed new taxes to compensate for wartime costs and turned control of the colonies' western lands over to the British officials in Montreal. 

CHUNK-2: Representatives from several colonies convened the Stamp Act Congress; its "Declaration of

#### **Pros:**  

- **Maintains Contextual Integrity** – Splits content at natural breaks, ensuring each chunk remains meaningful and self-contained.  
- **Versatile Across Diverse Content Types** – Adapts well to structured documents like research papers and technical manuals, preserving logical divisions.  
- **Enhances Search Relevance** – By keeping semantic coherence, it improves accuracy in information retrieval and query matching.  

#### **Cons:**  

- **Complex Implementation** – Requires sophisticated methods to detect semantic shifts and determine optimal split points.  
- **Higher Processing Overhead** – Analyzing contextual differences can be computationally demanding, especially for long texts.  
- **Sensitive to Parameter Tuning** – The effectiveness of chunking depends on carefully adjusting thresholds, which may vary across domains.  


#### **Using semChunk library**

In [16]:
import semchunk

chunk_size = 300
chunker = semchunk.chunkerify('gpt-4', chunk_size) 

#   Alternatives to use:
#          semchunk.chunkerify('cl100k_base', chunk_size) or \
#          semchunk.chunkerify(AutoTokenizer.from_pretrained('umarbutler/emubert'), chunk_size) or \
#          semchunk.chunkerify(tiktoken.encoding_for_model('gpt-4'), chunk_size) or \
#          semchunk.chunkerify(lambda text: len(text.split()), chunk_size)

# You can also pass a `offsets` argument to return the offsets of chunks, as well as an `overlap`
# argument to overlap chunks by a ratio (if < 1) or an absolute number of tokens (if >= 1).
chunks, offsets = chunker(document, offsets = True, overlap=0)

for i, chunk in enumerate(chunks, 1):
    print(f"\nCHUNK-{i}: {chunk}")


CHUNK-1: The American Revolution (1765–1783) was an ideological and political movement in the Thirteen Colonies which peaked when colonists initiated the ultimately successful war for independence (the American Revolutionary War) against the Kingdom of Great Britain. Leaders of the American Revolution were colonial separatist leaders who originally sought more autonomy as British subjects, but later assembled to support the Revolutionary War, which ended British colonial rule over the colonies, establishing their independence as the United States of America in July 1776.

CHUNK-2: Discontent with colonial rule began shortly after the defeat of France in the French and Indian War in 1763. Although the colonies had fought and supported the war, Parliament imposed new taxes to compensate for wartime costs and turned control of the colonies' western lands over to the British officials in Montreal. Representatives from several colonies convened the Stamp Act Congress; its "Declaration of R

#### **Using Chonkie library**

In [17]:
from chonkie import SemanticChunker

chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    threshold=0.5,                               # Similarity threshold (0-1) or (1-100) or "auto"
    chunk_size=300,                              # Maximum tokens per chunk
    min_sentences=1                              # Initial sentences per chunk
)

model.safetensors:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/202 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [18]:
# Chunk the text
chunks = chunker(document)

for i, chunk in enumerate(chunks, 1):
    print(f"\nCHUNK-{i}: {chunk.text}")


CHUNK-1: The American Revolution (1765–1783) was an ideological and political movement in the Thirteen Colonies which peaked when colonists initiated the ultimately successful war for independence (the American Revolutionary War) against the Kingdom of Great Britain. Leaders of the American Revolution were colonial separatist leaders who originally sought more autonomy as British subjects, but later assembled to support the Revolutionary War, which ended British colonial rule over the colonies, establishing their independence as the United States of America in July 1776.

Discontent with colonial rule began shortly after the defeat of France in the French and Indian War in 1763. Although the colonies had fought and supported the war, Parliament imposed new taxes to compensate for wartime costs and turned control of the colonies' western lands over to the British officials in Montreal. Representatives from several colonies convened the Stamp Act Congress; its "Declaration of Rights and

### **5. Token-Based Chunking**

Token-based chunking splits text based on a predefined number of tokens (words or subwords) rather than characters or sentences. Tokens are the smallest meaningful units of text, and the chunk size is controlled by a set token limit.

It is useful when working with language models.

In [19]:
from langchain.text_splitter import TokenTextSplitter

token_text_splitter = TokenTextSplitter(chunk_size=300, chunk_overlap=0)

In [20]:
token_text_splitter.create_documents([document])

[Document(metadata={}, page_content='The American Revolution (1765–1783) was an ideological and political movement in the Thirteen Colonies which peaked when colonists initiated the ultimately successful war for independence (the American Revolutionary War) against the Kingdom of Great Britain. Leaders of the American Revolution were colonial separatist leaders who originally sought more autonomy as British subjects, but later assembled to support the Revolutionary War, which ended British colonial rule over the colonies, establishing their independence as the United States of America in July 1776.\n\nDiscontent with colonial rule began shortly after the defeat of France in the French and Indian War in 1763. Although the colonies had fought and supported the war, Parliament imposed new taxes to compensate for wartime costs and turned control of the colonies\' western lands over to the British officials in Montreal. Representatives from several colonies convened the Stamp Act Congress; 

#### **Pros:**  

- **Optimized for language Models** – Keeps chunks within token limits, ensuring efficient processing for models like GPT.  
- **Fine-Grained Size Control** – Allows precise adjustment of chunk length to align with model constraints.  
- **Uniform Processing** – Maintains consistent token counts across documents, making large-scale text handling more manageable.  

#### **Cons:**  

- **Context fragmentation** – May split sentences or paragraphs in unnatural ways, leading to fragmented information.  
- **Ignores Semantic Structure** – Prioritizes token count over meaning, potentially losing key contextual links.  
- **Less Adaptive** – Doesn't consider variations in content density or natural text divisions, affecting coherence.  


### **6. Sentence-Based Chunking**

Sentence-based chunking divides text into full sentences, ensuring that each chunk contains complete thoughts. This method helps preserve the logical flow of information, facilitating more precise text analysis and processing.

In [21]:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SimpleDirectoryReader

splitter = SentenceSplitter(chunk_size=300, chunk_overlap=0,)

In [22]:
documents = SimpleDirectoryReader(input_files=["./sample_text.txt"]).load_data()

nodes = splitter.get_nodes_from_documents(documents)

for i, node in enumerate(nodes, 1):
    print(f"\nCHUNK-{i}: {node.text}")


CHUNK-1: The American Revolution (1765–1783) was an ideological and political movement in the Thirteen Colonies which peaked when colonists initiated the ultimately successful war for independence (the American Revolutionary War) against the Kingdom of Great Britain. Leaders of the American Revolution were colonial separatist leaders who originally sought more autonomy as British subjects, but later assembled to support the Revolutionary War, which ended British colonial rule over the colonies, establishing their independence as the United States of America in July 1776.

Discontent with colonial rule began shortly after the defeat of France in the French and Indian War in 1763. Although the colonies had fought and supported the war, Parliament imposed new taxes to compensate for wartime costs and turned control of the colonies' western lands over to the British officials in Montreal. Representatives from several colonies convened the Stamp Act Congress; its "Declaration of Rights and

#### **Pros:**  

- **Maintains Context** – Ensures each chunk contains complete sentences, preserving meaning and logical flow.  
- **Improved Readability** – Produces coherent and well-structured chunks, making them easier for both models and users to understand.  
- **Natural Splitting** – Divides text at appropriate points, preventing disruption of ideas or incomplete thoughts.  

#### **Cons:**  

- **Inconsistent Chunk Sizes** – Sentence length variability can lead to uneven chunk distribution, making processing less predictable.  
- **Inefficient for Long Sentences** – Long sentences may exceed token limits or carry excessive information within a single chunk.  
- **Reduced Control over Chunk Size** – Sentence-based chunking may not always align with fixed token constraints, affecting model efficiency.  


### **7. Agentic Chunking**

It essentially means to instruct an LLM to do the chunking by creating an agent for this task.

We will cover an example of using `llama3.1:latest` model for chunking the same text as used above.

In [23]:
from langchain_ollama import OllamaLLM
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

# Initialize the Ollama LLM
llm = OllamaLLM(model="llama3.1:latest")

# Define the prompt for chunking
chunking_prompt = PromptTemplate(
    input_variables=["document"],
    template="""
    You are an intelligent agent tasked with chunking documents into semantic segments. 
    Break the following document into meaningful sections, ensuring each chunk represents a cohesive idea:
    
    {document}
    
    Return the chunks as a numbered list, without modifying the content.
    """
)

# Define the chain for chunking
chunking_chain = LLMChain(llm=llm, prompt=chunking_prompt)

  chunking_chain = LLMChain(llm=llm, prompt=chunking_prompt)


In [26]:
# Process the document through the chain
chunks = chunking_chain.run(document=document)

# Print the resulting chunks
print("Document Chunks:")
print(chunks)

Document Chunks:
Here is the document broken down into 8 meaningful sections:

1. **The American Revolution: Context and Purpose**
The American Revolution (1765–1783) was an ideological and political movement in the Thirteen Colonies which peaked when colonists initiated the ultimately successful war for independence (the American Revolutionary War) against the Kingdom of Great Britain. Leaders of the American Revolution were colonial separatist leaders who originally sought more autonomy as British subjects, but later assembled to support the Revolutionary War, which ended British colonial rule over the colonies, establishing their independence as the United States of America in July 1776.

2. **Causes of the American Revolution**
Discontent with colonial rule began shortly after the defeat of France in the French and Indian War in 1763. Although the colonies had fought and supported the war, Parliament imposed new taxes to compensate for wartime costs and turned control of the coloni

#### **Pros:**  

- **Optimized for Specific Tasks** – Structures chunks to align with task requirements, improving AI efficiency and decision-making.  
- **Enhanced Focus on Relevant Information** – Ensures the AI processes only necessary data, leading to more precise responses and analysis.  
- **Versatile Application** – Adapts well to various tasks like answering questions, summarization, and task-based automation.  

#### **Cons:**  

- **Requires Detailed Setup** – Defining task-specific chunking rules and agent roles demands careful planning and configuration.  
- **Risk of Over-Specialization** – Breaking content into highly specific chunks may overlook broader insights or patterns.  
- **Potential Loss of Overall Context** – Chunks tailored to individual tasks might miss connections that are crucial for comprehensive understanding, such as summarization.  

### **Chunking Using Docling:**

Docling's chunking is rule-based and structure-aware. It parses structured formats like PDF, HTML, Markdown, or Word into a tree of logical elements: paragraphs, headings, lists, tables, etc.

***HierarchicalChunker*** implementation uses the document structure information from the DoclingDocument to create one chunk for each individual detected document element

***HybridChunker*** applies tokenization-aware refinements on top of document-based hierarchical chunking, to control the chunk size.

***Note:*** Use Docling when you’re chunking structured documents (PDFs, Word, academic papers).

In [25]:
from transformers import AutoTokenizer
from docling.document_converter import DocumentConverter

from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer

DOC_SOURCE = "./AncientIndia.pdf"
doc = DocumentConverter().convert(source=DOC_SOURCE).document

EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
MAX_TOKENS = 300  

tokenizer = HuggingFaceTokenizer(
    tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
    max_tokens=MAX_TOKENS,  # optional, by default derived from `tokenizer` for HF case
)

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

tm_config.json: 0.00B [00:00, ?B/s]

.gitignore: 0.00B [00:00, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/41.0 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

otslp_all_standard_094_clean.check:   0%|          | 0.00/213M [00:00<?, ?B/s]

otslp_all_fast.check:   0%|          | 0.00/146M [00:00<?, ?B/s]

model.pt:   0%|          | 0.00/202M [00:00<?, ?B/s]

tm_config.json: 0.00B [00:00, ?B/s]

Downloading detection model, please wait. This may take several minutes depending upon your network connection.


Progress: |██████████████████████████████████████████████████| 100.0% Complete

Downloading recognition model, please wait. This may take several minutes depending upon your network connection.


Progress: |██████████████████████████████████████████████████| 100.0% Complete

In [27]:
chunker = HybridChunker(
    tokenizer=tokenizer,
    merge_peers=True,  # optional, defaults to True
)
chunk_iter = chunker.chunk(dl_doc=doc)
chunks = list(chunk_iter)

In [28]:
for i, chunk in enumerate(chunks):
    print(f"=== {i} ===")
    txt_tokens = tokenizer.count_tokens(chunk.text)
    print(f"chunk.text ({txt_tokens} tokens):\n{chunk.text!r}")

    ser_txt = chunker.contextualize(chunk=chunk)
    ser_tokens = tokenizer.count_tokens(ser_txt)
    print(f"chunker.contextualize(chunk) ({ser_tokens} tokens):\n{ser_txt!r}")

    print()
    if i==5:
        break

=== 0 ===
chunk.text (243 tokens):
"The earliest complex society in South Asia was the Indus Valley Civilization (c. 3300-1300 BCE), a Bronze Age culture centered on the Indus River basin (sites at Harappa and Mohenjo-daro in today's Pakistan). It developed advanced urban planning, brick architecture, and trade networks; at its mature phase (c. 26001900 BCE) cities covered over 100 hectares and featured standardized pottery and seals . The civilization declined after c. 1900 BCE, giving way to smaller farming communities. 1 2\nBy about 1500 BCE, Indo-Aryan (Vedic) culture spread into northern India. The Vedas - the oldest sacred texts of Hinduism - date to this period; they were transmitted orally and only committed to writing by about 1000-500 BCE . The early Vedic society was largely pastoral and tribal; by the late Vedic period (c. 1000500 BCE) it had evolved into settled agriculture, iron use, caste distinctions, and small kingdoms (Mahājanapadas). Religious thought flourished: the

### **Additional References:**

**1. [https://github.com/FullStackRetrieval-com/RetrievalTutorials/-/-/Levels-Of-Text-Splitting/5-Levels-Of-Text-Splitting.ipynb](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)**

**2. [https://medium.com/@ayoubkirouane3/simple-chunking-strategies-for-rag-applications-part-1-d56903b167c5](https://medium.com/@ayoubkirouane3/simple-chunking-strategies-for-rag-applications-part-1-d56903b167c5)**

**3. [https://docs.chonkie.ai/chunkers/overview](https://docs.chonkie.ai/chunkers/overview)**

**4. [https://pypi.org/project/semchunk/](https://pypi.org/project/semchunk/)**

**5. [https://docling-project.github.io/docling/concepts/chunking/](https://docling-project.github.io/docling/concepts/chunking/)**