### Exploring Chunking Strategies for RAG

1. Character-Based Chunking
2. Recursive Character-Based Chunking 
3. Semantic Chunking
4. Cluster-Based Semantic Chunking
5. LLM-Based Semantic Chunking 

This notebook evaluates these chunking strategies and visualizes the number of chunks generated by each method to determine their impact on retrieval performance.


In [None]:
!pip install git+https://github.com/brandonstarxel/chunking_evaluation.git

In [10]:
!pip install pypdf




[notice] A new release of pip is available: 24.0 -> 25.0
[notice] To update, run: C:\Users\Sorou\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [22]:
from langchain_community.document_loaders import PyPDFLoader
# Main Chunking Functions
from chunking_evaluation.chunking import (
    ClusterSemanticChunker,
    LLMSemanticChunker,
    FixedTokenChunker,
    RecursiveTokenChunker,
    KamradtModifiedChunker
)
# Additional Dependencies
import tiktoken
from chromadb.utils import embedding_functions
from chunking_evaluation.utils import openai_token_count
import os

In [None]:
loader = PyPDFLoader("survey.pdf")
pages = loader.load()
print("First 1000 Characters: ", pages[:1000])

First 1000 Characters:  [Document(metadata={'source': 'survey.pdf', 'page': 0, 'page_label': '1'}, page_content="A Survey on the Memory Mechanism of Large\nLanguage Model based Agents\nZeyu Zhang1, Xiaohe Bo1, Chen Ma1, Rui Li1, Xu Chen1, Quanyu Dai2,\nJieming Zhu2, Zhenhua Dong2, Ji-Rong Wen1\n1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China\n2Huawei Noah’s Ark Lab, China\nzeyuzhang@ruc.edu.cn, xu.chen@ruc.edu.cn\nAbstract\nLarge language model (LLM) based agents have recently attracted much attention\nfrom the research and industry communities. Compared with original LLMs, LLM-\nbased agents are featured in their self-evolving capability, which is the basis for\nsolving real-world problems that need long-term and complex agent-environment\ninteractions. The key component to support agent-environment interactions is the\nmemory of the agents. While previous studies have proposed many promising mem-\nory mechanisms, they are scattered in different 

In [42]:
document = ""
for i in range(len(pages)):
    document += pages[i].page_content

In [57]:
def analyze_chunks(chunks, use_tokens=False):
    # Print the chunks of interest
    print("\nNumber of Chunks:", len(chunks))
    print("\n", "="*50, "40th Chunk", "="*50,"\n", chunks[40])
    print("\n", "="*50, "41st Chunk", "="*50,"\n", chunks[41])
    
    chunk1, chunk2 = chunks[20], chunks[21]
    
    if use_tokens:
        encoding = tiktoken.get_encoding("cl100k_base")
        tokens1 = encoding.encode(chunk1)
        tokens2 = encoding.encode(chunk2)
        
        # Find overlapping tokens
        for i in range(len(tokens1), 0, -1):
            if tokens1[-i:] == tokens2[:i]:
                overlap = encoding.decode(tokens1[-i:])
                print("\n", "="*50, f"\nOverlapping text ({i} tokens):", overlap)
                return
        print("\nNo token overlap found")
    else:
        # Find overlapping characters
        for i in range(min(len(chunk1), len(chunk2)), 0, -1):
            if chunk1[-i:] == chunk2[:i]:
                print("\n", "="*50, f"\nOverlapping text ({i} chars):", chunk1[-i:])
                return
        print("\nNo character overlap found")

Character Text Splitting

The simplest form of chunking would be simply counting some number of characters and splitting at that count.

![Example Image](.\images\fixed.png)

In [58]:
def chunk_text(document, chunk_size, overlap):
    chunks = []
    stride = chunk_size - overlap
    current_idx = 0
    
    while current_idx < len(document):
        # Take chunk_size characters starting from current_idx
        chunk = document[current_idx:current_idx + chunk_size]
        if not chunk:  # Break if we're out of text
            break
        chunks.append(chunk)
        current_idx += stride  # Move forward by stride
    
    return chunks

In [59]:
character_chunks = chunk_text(document, chunk_size=400, overlap=0)

analyze_chunks(character_chunks)


Number of Chunks: 368

 y aim to summarize techniques that can
be leveraged to tackle fundamental problems of LLMs. Specifically, Zhang et al. [8] provide a
comprehensive survey on the methods of supervised fine-tuning, which is a key technique for better
training LLMs. Shen et al. [9], Wang et al.[10] and Liu et al. [11] present surveys on the alignment of
LLMs, which is a key requirement for LLMs to produce outputs con

 sistent with human values. Gao
et al. [12] propose a survey on the retrieval-augmented generation (RAG) capability of LLMs, which
is key to providing LLMs with factual and up-to-date knowledge and removing hallucinations. Qin
et al. [18] summarize the state-of-the-art methods on enabling LLMs to leverage external tools, which
is fundamental for LLMs to expand their capability in domains that requi

Overlapping text (20 chars): . . . . . . . . . . 


In [60]:
#Chunk size of 800 Characters, 400 overlap
character_chunks = chunk_text(document, chunk_size=800, overlap=400)
analyze_chunks(character_chunks)



Number of Chunks: 368

 y aim to summarize techniques that can
be leveraged to tackle fundamental problems of LLMs. Specifically, Zhang et al. [8] provide a
comprehensive survey on the methods of supervised fine-tuning, which is a key technique for better
training LLMs. Shen et al. [9], Wang et al.[10] and Liu et al. [11] present surveys on the alignment of
LLMs, which is a key requirement for LLMs to produce outputs consistent with human values. Gao
et al. [12] propose a survey on the retrieval-augmented generation (RAG) capability of LLMs, which
is key to providing LLMs with factual and up-to-date knowledge and removing hallucinations. Qin
et al. [18] summarize the state-of-the-art methods on enabling LLMs to leverage external tools, which
is fundamental for LLMs to expand their capability in domains that requi

 sistent with human values. Gao
et al. [12] propose a survey on the retrieval-augmented generation (RAG) capability of LLMs, which
is key to providing LLMs with factual and 

Token Text Splitting

But language models (the end users of chunked text usually) don't operate at the character level. Instead they use tokens, or common sequences of characters that represent frequent words, word pieces, and subwords.

This means character-based splitting isn't ideal because:

1. A 500-character chunk might contain anywhere from 100-500 tokens depending on the text
2. Different languages and character sets encode to very different numbers of tokens
3. We might hit token limits in our LLM without realizing it


Tokenizers like 'cl100k_base' implement Byte-Pair Encoding (BPE) - a compression algorithm that creates a vocabulary by iteratively merging the most frequent pairs of bytes or characters. The '100k' refers to its vocab size, determining the balance between compression and representation granularity.


In [61]:
def count_tokens(text, model="cl100k_base"):
    """Count tokens in a text string using tiktoken"""
    encoder = tiktoken.get_encoding(model)
    return print(f"Number of tokens: {len(encoder.encode(text))}")

In [62]:
fixed_token_chunker = FixedTokenChunker(
    chunk_size=400, 
    chunk_overlap=0,
    encoding_name="cl100k_base"
)

token_chunks = fixed_token_chunker.split_text(document)

analyze_chunks(token_chunks, use_tokens=True)


Number of Chunks: 92

  in a structured database. When writing into the database, similar contents will
be stored in the same group. In SCM [98], it designs a memory controller to decide when to execute
the operations. The controller serves as a guide for the whole memory module. In MemGPT [100],
the memory writing is entirely self-directed. The agents can autonomously update the memory based
on the contexts. In MemoChat [94], the agents summarize each conversation segment by abstracting
the mainly discussed topics and storing them as keys for indexing memory pieces.
Discussion. Previous research indicates that designing the strategy of information extraction during
the memory writing operation is vital [94]. This is because the original information is commonly
lengthy and noisy. Besides, different environments may provide various forms of feedback, and how
to extract and represent the information as memory is also significant for memory writing.
5.3.2 Memory Management
For human bein

In [63]:
fixed_token_chunker = FixedTokenChunker(
    chunk_size=400, 
    chunk_overlap=200,
    encoding_name="cl100k_base"
)

token_overlap_chunks = fixed_token_chunker.split_text(document)

analyze_chunks(token_overlap_chunks, use_tokens=True)


Number of Chunks: 183

  and arrival time in [step 1] and
[step 2]. For task (B), the agent has to choose a movie for Alice at [step 3]; at this time, its memory
contains the arranged time to watch films.
3.3 Broad Definition of the Agent Memory
In a broad sense, the memory of the agent can come from much wider sources, for example,
the information across different trials and the external knowledge beyond the agent-environment
interactions. Formally, given a series of sequential tasks {T1, T2, ...,TK}, for task Tk, the memory
information at step t comes from three sources: (1) the historical information within the same
trial, that is, ξk
t = {ak
1 , ok
1 , ..., ak
t−1, ok
t−1}, where we add superscript k to label the task index.
(2) The historical information across different trials, that is, Ξk = {ξ1, ξ2, ..., ξk−1, ξk′
}, where
ξj (j ∈ {1, ..., k− 1}) represents the trials of task j1, and ξk′
denotes the previously explored trials
for task Tk. (3) External knowledge, which is repres

Recursive Character Text Splitter

But simply counting tokens or characters can only get us so much. When we write, we naturally separate text into paragraphs, sentences, and other logical units. The recursive character text splitter tries to intelligently split text by looking for natural separators in order, while respecting a maximum character length.

First, it makes a complete pass over the entire document using paragraph breaks (\n\n), creating an initial set of chunks. Then for any chunks that exceed the size limit, it recursively processes them using progressively smaller separators:

1. First tries to split on paragraph breaks (\n\n)
2. If chunks are still too big, tries line breaks (\n)
3. Then sentence boundaries (., ?, !)
4. Then words ( )
5. Finally, if no other separators work, splits on individual characters ("")

This way, the splitter preserves as much natural structure as possible - only drilling down to smaller separators when necessary to meet the size limit. A chunk that's already small enough stays intact, while larger chunks get progressively broken down until they fit.

![Example Image](.\images\recursive.png)

In [64]:
recursive_character_chunker = RecursiveTokenChunker(
    chunk_size=800,  # Character Length
    chunk_overlap=0,  # Overlap
    length_function=len,  # Character length with len()
    separators=["\n\n", "\n", ".", "?", "!", " ", ""] # According to Research
)

recursive_character_chunks = recursive_character_chunker.split_text(document)
analyze_chunks(recursive_character_chunks, use_tokens=False)


Number of Chunks: 194

 its memory contains the information about the selected attractions and arrival time in [step 1] and
[step 2]. For task (B), the agent has to choose a movie for Alice at [step 3]; at this time, its memory
contains the arranged time to watch films.
3.3 Broad Definition of the Agent Memory
In a broad sense, the memory of the agent can come from much wider sources, for example,
the information across different trials and the external knowledge beyond the agent-environment
interactions. Formally, given a series of sequential tasks {T1, T2, ...,TK}, for task Tk, the memory
information at step t comes from three sources: (1) the historical information within the same
trial, that is, ξk
t = {ak
1 , ok
1 , ..., ak
t−1, ok
t−1}, where we add superscript k to label the task index.

 (2) The historical information across different trials, that is, Ξk = {ξ1, ξ2, ..., ξk−1, ξk′
}, where
ξj (j ∈ {1, ..., k− 1}) represents the trials of task j1, and ξk′
denotes the previously 

In [65]:
recursive_character_chunker = RecursiveTokenChunker(
    chunk_size=800,  # Character Length
    chunk_overlap=400,  # Overlap
    length_function=len,  # Character length with len()
    separators=["\n\n", "\n", ".", "?", "!", " ", ""] # According to Research
)

recursive_character_overlap_chunks = recursive_character_chunker.split_text(document)
analyze_chunks(recursive_character_overlap_chunks, use_tokens=False)


Number of Chunks: 373

 Fundamental problems. The surveys in this category aim to summarize techniques that can
be leveraged to tackle fundamental problems of LLMs. Specifically, Zhang et al. [8] provide a
comprehensive survey on the methods of supervised fine-tuning, which is a key technique for better
training LLMs. Shen et al. [9], Wang et al.[10] and Liu et al. [11] present surveys on the alignment of
LLMs, which is a key requirement for LLMs to produce outputs consistent with human values. Gao
et al. [12] propose a survey on the retrieval-augmented generation (RAG) capability of LLMs, which
is key to providing LLMs with factual and up-to-date knowledge and removing hallucinations. Qin
et al. [18] summarize the state-of-the-art methods on enabling LLMs to leverage external tools, which

 LLMs, which is a key requirement for LLMs to produce outputs consistent with human values. Gao
et al. [12] propose a survey on the retrieval-augmented generation (RAG) capability of LLMs, which
is

Semantic Chunker

Greg Kamradt popularized what's known as the semantic chunker with his 5 Levels of Text Splitting notebook here which takes a different approach from fixed character/token chunking. Instead of splitting text at predetermined positions or separators, it uses embeddings to find natural semantic boundaries in the text while maintaining consistent chunk sizes.

Chroma modified the algorithm to provide better size control through binary search. The chunker first splits text into small fixed-size pieces (around 50 tokens) using standard recursive splitting with separators. For each piece, it looks at surrounding context (3 segments before and after) to understand the local meaning - this helps maintain semantic coherence across potential split points.

After embedding these contextualized pieces, it calculates cosine distances between consecutive segments. Higher distances suggest natural topic transitions that make good splitting points. But rather than using Kamradt's original fixed percentile approach for choosing split points, Chroma's version uses binary search to find a similarity threshold that produces chunks close to the target size.

The binary search starts with limits of 0.0 and 1.0, calculating the midpoint threshold and counting how many splits it would create. If there are too many splits, it raises the threshold by adjusting the lower limit; too few splits, it lowers the threshold by adjusting the upper limit. This continues until it finds a threshold that creates chunks of approximately the desired size


