# Chunking Strategies for RAG: Practical Implementation with LangChain and LlamaIndex

This notebook explores different text chunking strategies essential for building effective Retrieval Augmented Generation (RAG) systems. We will implement and compare various methods using LangChain and LlamaIndex.


This notebook is written to work in Google Colab. Please add you OpenAI api key in secrets for the more advanced chunkign implementations

### 1. Setup

In [None]:
!pip install langchain llama-index openai chromadb ragas datasets langchain_experimental
# Optionally: sentence-transformers, tqdm, rich, matplotlib if needed for analysis
# !pip install sentence-transformers tqdm rich matplotlib

### 2. Load Sample Text Document

In [2]:
# Replace this with your sample text document
sample_text = """
Large language models (LLMs) are a type of artificial intelligence algorithm that uses deep learning techniques and massively large data sets to understand, summarize, generate and predict new content. The "large" in LLM refers to the vast amount of data these models are trained on — often terabytes of text and data — and the number of parameters they use to process language, which can be billions.

LLMs are built on transformer architectures, a neural network architecture introduced in 2017 that processes input data in a way that allows for parallelization, making training on massive data sets feasible. This architecture uses self-attention mechanisms to weigh the importance of different words in the input sequence when processing a specific word.

There are different types of LLMs, categorized based on their architecture and training objectives:

*   **Generative LLMs:** These models are designed to generate new content, such as text, code, or images. Examples include GPT-3, LaMDA, and Stable Diffusion.
*   **Discriminative LLMs:** These models are used for classification and analysis tasks, such as sentiment analysis or named entity recognition.
*   **Hybrid LLMs:** These models combine aspects of both generative and discriminative models.

Chunking is a crucial step in preparing text data for LLMs, especially for tasks like RAG. It involves breaking down a large text document into smaller, manageable pieces or "chunks." The size and method of chunking can significantly impact the performance of LLM applications.
"""

In [50]:
sample_text2 = """
Chunking-the process of splitting documents into smaller, manageable pieces-directly impacts the quality of information retrieval, the relevance of generated responses, and ultimately, the user experience of your AI application. Most developers choose one chunking method and stick with it, but recent research suggests that the choice of chunking strategy can significantly impact performance.

In this comprehensive analysis, we’ll explore various chunking strategies based on groundbreaking research from ChromaDB and other leading organizations in the field. We’ll examine each method’s strengths, weaknesses, and optimal use cases to help you make informed decisions for your RAG applications.

Why Chunking Matters
Before diving into specific strategies, it’s essential to understand why chunking is so crucial for RAG systems:
"""

In [17]:
def print_chunks(chunks):
    """
    Prints each chunk in a list of chunks.

    Args:
        chunks (list): A list of text chunks.
    """
    print(f"Number of chunks: {len(chunks)}")
    print("\nChunks:")
    for i, chunk in enumerate(chunks):
        print(f"--- Chunk {i+1} ---")
        print(chunk)

### 3.1 Fixed-Size Chunking (LangChain)

Fixed-size chunking is one of the most straightforward methods. It involves splitting the text into chunks of a predetermined size, often with a specified overlap between consecutive chunks. While simple, it can sometimes split sentences or paragraphs awkwardly, potentially losing context. It serves as a useful baseline to understand the basic concept of text splitting.

We will use LangChain's `CharacterTextSplitter` for this.

In [19]:
from langchain.text_splitter import CharacterTextSplitter
from rich.console import Console
from rich.syntax import Syntax
from rich.text import Text

console = Console()

# Define chunk size and overlap
chunk_size = 200
chunk_overlap = 20

# Create the splitter
text_splitter = CharacterTextSplitter(
    separator=" ",
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)

# Split the text
fixed_size_chunks = text_splitter.split_text(sample_text)
print_chunks(fixed_size_chunks)

Number of chunks: 9

Chunks:
--- Chunk 1 ---
Large language models (LLMs) are a type of artificial intelligence algorithm that uses deep learning techniques and massively large data sets to understand, summarize, generate and predict new
--- Chunk 2 ---
and predict new content. The "large" in LLM refers to the vast amount of data these models are trained on — often terabytes of text and data — and the number of parameters they use to process
--- Chunk 3 ---
they use to process language, which can be billions.

LLMs are built on transformer architectures, a neural network architecture introduced in 2017 that processes input data in a way that allows for
--- Chunk 4 ---
way that allows for parallelization, making training on massive data sets feasible. This architecture uses self-attention mechanisms to weigh the importance of different words in the input sequence
--- Chunk 5 ---
the input sequence when processing a specific word.

There are different types of LLMs, categorized based on 

### 3.2 Recursive Character Splitting: A Smarter Baseline


In [20]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=20,
    separators=["\n\n", "\n", " ", ""]
)

# Create chunks
chunks = text_splitter.create_documents([sample_text])
print_chunks(chunks)

Number of chunks: 14

Chunks:
--- Chunk 1 ---
page_content='Large language models (LLMs) are a type of artificial intelligence algorithm that uses deep learning techniques and massively large data sets to'
--- Chunk 2 ---
page_content='large data sets to understand, summarize, generate and predict new content. The "large" in LLM refers to the vast amount of data these models are'
--- Chunk 3 ---
page_content='these models are trained on — often terabytes of text and data — and the number of parameters they use to process language, which can be billions.'
--- Chunk 4 ---
page_content='LLMs are built on transformer architectures, a neural network architecture introduced in 2017 that processes input data in a way that allows for'
--- Chunk 5 ---
page_content='way that allows for parallelization, making training on massive data sets feasible. This architecture uses self-attention mechanisms to weigh the'
--- Chunk 6 ---
page_content='to weigh the importance of different words in the input 

### 3.3 Content-Aware Chunking: Leveraging Document Structure


Creating chunks within specific header groups is an intuitive idea. To address this challenge, we can use MarkdownHeaderTextSplitter. This will split a markdown file by a specified set of headers.

For example, if we want to split this markdown:


In [21]:
md = '# Foo\n\n ## Bar\n\nHi this is Jim  \nHi this is Joe\n\n ## Baz\n\n Hi this is Molly'


In [26]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

# We can specify the headers to split on:

headers = [("#", "Header 1"),("##", "Header 2"),("###","Header 3")]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
md_header_splits = markdown_splitter.split_text(md)

md_header_splits


[Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='Hi this is Jim\nHi this is Joe'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='Hi this is Molly')]

By default, MarkdownHeaderTextSplitter strips headers being split on from the output chunk's content. This can be disabled by setting strip_headers = False.

In [28]:
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers,strip_headers=False)
md_header_splits = markdown_splitter.split_text(md)
md_header_splits

[Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='# Foo  \n## Bar  \nHi this is Jim\nHi this is Joe'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='## Baz  \nHi this is Molly')]

### 4.1 Semantic Chunking
You will need the OpenAI API Key for running this
Both LangChain and LlamaIndex, the leading frameworks for building LLM applications, offer robust implementations of semantic chunking.


In [38]:
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY').strip()

In [51]:
docs = text_splitter.create_documents([sample_text,sample_text2])

### Langchain Semantic Chunking

In [52]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
from google.colab import userdata

# Initialize the splitter with an embedding model and a breakpoint threshold type
# Ensure the API key is stripped of any leading/trailing whitespace


text_splitter = SemanticChunker(
    OpenAIEmbeddings(api_key=OPENAI_API_KEY),
    breakpoint_threshold_type="percentile" # Other options: "standard_deviation", "interquartile", "gradient"
)

# Create semantically coherent document chunks
# Assuming sample_text is defined elsewhere in the notebook
docs = text_splitter.create_documents([sample_text,sample_text2])
print(f"Number of semantic chunks: {len(docs)}")

Number of semantic chunks: 4


In [53]:
docs

[Document(metadata={}, page_content='\nLarge language models (LLMs) are a type of artificial intelligence algorithm that uses deep learning techniques and massively large data sets to understand, summarize, generate and predict new content. The "large" in LLM refers to the vast amount of data these models are trained on — often terabytes of text and data — and the number of parameters they use to process language, which can be billions. LLMs are built on transformer architectures, a neural network architecture introduced in 2017 that processes input data in a way that allows for parallelization, making training on massive data sets feasible. This architecture uses self-attention mechanisms to weigh the importance of different words in the input sequence when processing a specific word. There are different types of LLMs, categorized based on their architecture and training objectives:\n\n*   **Generative LLMs:** These models are designed to generate new content, such as text, code, or i

### Llama index Semantic Chunking

In [60]:
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Document

splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=OpenAIEmbedding(api_key=OPENAI_API_KEY)
)

text_list = [sample_text, sample_text2]
documents = [Document(text=t) for t in text_list]
nodes = splitter.get_nodes_from_documents(documents,show_progress=True)
print(f"Number of semantic nodes: {len(nodes)}")

Parsing nodes:   0%|          | 0/2 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/11 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/5 [00:00<?, ?it/s]

Number of semantic nodes: 4


### 4.2 Hierarchical & Relational Chunking

In [61]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# This text splitter is used to create the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# This text splitter is used to create the parent documents
# It should create larger chunks than the child splitter
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

# The vector store to use to index the child chunks
vectorstore = Chroma(collection_name="split_parents", embedding_function=OpenAIEmbeddings(api_key=OPENAI_API_KEY))

# The storage layer for the parent documents
store = InMemoryStore()

# Initialize the retriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add documents to the retriever, which handles the splitting and storage
retriever.add_documents(docs)

In [62]:
retriever

ParentDocumentRetriever(vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x78ce843b0bd0>, docstore=<langchain_core.stores.InMemoryStore object at 0x78ce87a9bc10>, search_kwargs={}, child_splitter=<langchain_text_splitters.character.RecursiveCharacterTextSplitter object at 0x78ce883a7ed0>, parent_splitter=<langchain_text_splitters.character.RecursiveCharacterTextSplitter object at 0x78ce87abfc50>)