# Chunking Strategies

It is important to chunk the text data into smaller parts before converting them into vector embeddings and storing it in a vector database. It allows us to retrieve the exact content that is relevant to the query, as having irrelevant content in the context leads to unnecessary of tokens and generation of irrelevant response.

In order to chunk our text we will be using the langchain text splitters library. This library provides a variety of text splitters that can be used to split the text into smaller chunks.

## Recursive chunking
Here, the text is splitted on `["\n\n", "\n", " ", ""]` characters until the chunk size is small enough.

Parameters:
1. `seperators`: List of characters to split the text on. The default is `["\n\n", "\n", " ", ""]`.
2. `chunk_size`: the max size of the chunk.
3. `chunk_overlap`: length of the overlapping text between the chunks.

In [1]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

with open("./sample.txt", "r") as f:
    text = f.read()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=10,
)

chunks = text_splitter.split_text(text)
chunks

['TXT test file\nPurpose: Provide example of this file type\nDocument file type: TXT\nVersion: 1.0',
 'Remark:',
 'Example content:',
 'The names "John Doe" for males, "Jane Doe" or "Jane Roe" for females, or "Jonnie Doe" and "Janie',
 '"Janie Doe" for children, or just "Doe" non-gender-specifically are used as placeholder names for a',
 'for a party whose true identity is unknown or must be withheld in a legal action, case, or',
 'case, or discussion. The names are also used to refer to acorpse or hospital patient whose identity',
 'identity is unknown. This practice is widely used in the United States and Canada, but is rarely',
 'is rarely used in other English-speaking countries including the United Kingdom itself, from where',
 'where the use of "John Doe" in a legal context originates. The names Joe Bloggs or John Smith are',
 'Smith are used in the UK instead, as well as in Australia and New Zealand.',
 'John Doe is sometimes used to refer to a typical male in other contexts as 

### HTML based chunking
While recursive chunking might be good enough for generic cases, it might not be good enough for cases where the text data does has a semantic structure to it. With the use of recursive chunking there is no guarantee that each chunk is "context aware and not split in between a piec of information. HTML based chunking makes sure that no chunk has no incomplete information by using the HTML tags as the seperators rather than using escape sequences.

In [2]:
from langchain_text_splitters import HTMLHeaderTextSplitter

with open("./sample.html", "r") as f:
    text = f.read()


headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = html_splitter.split_text(text)
chunks

[Document(metadata={'Header 1': 'The Evolution of Technology in the 21st Century'}, page_content='Introduction'),
 Document(metadata={'Header 1': 'The Evolution of Technology in the 21st Century', 'Header 2': 'Introduction'}, page_content='Technology has grown at an unprecedented rate in the 21st century, transforming every facet of human life. From the ways we communicate to how we work, learn, and entertain ourselves, technological advancements have redefined the modern world.'),
 Document(metadata={'Header 1': 'The Evolution of Technology in the 21st Century'}, page_content='The Rise of the Internet and Mobile Technology'),
 Document(metadata={'Header 1': 'The Evolution of Technology in the 21st Century', 'Header 2': 'The Rise of the Internet and Mobile Technology'}, page_content='The early 2000s witnessed the rise of personal computers and the internet, which revolutionized the way people accessed information. Before this period, gathering information meant visiting libraries, read

By setting `return_each_element` to `True`, the text splitter splits and returns every element in the HTML DOM with it's associated header, instead just splitting upto the last header given.

Arguments like `chunk_size` and `chunk_overlap` is present in the HTML splitter as well.

In [3]:
html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    return_each_element=True,
)
chunks = html_splitter.split_text(text)
chunks

[Document(metadata={'Header 1': 'The Evolution of Technology in the 21st Century'}, page_content='Introduction'),
 Document(metadata={'Header 1': 'The Evolution of Technology in the 21st Century', 'Header 2': 'Introduction'}, page_content='Technology has grown at an unprecedented rate in the 21st century, transforming every facet of human life. From the ways we communicate to how we work, learn, and entertain ourselves, technological advancements have redefined the modern world.'),
 Document(metadata={'Header 1': 'The Evolution of Technology in the 21st Century'}, page_content='The Rise of the Internet and Mobile Technology'),
 Document(metadata={'Header 1': 'The Evolution of Technology in the 21st Century', 'Header 2': 'The Rise of the Internet and Mobile Technology'}, page_content='The early 2000s witnessed the rise of personal computers and the internet, which revolutionized the way people accessed information. Before this period, gathering information meant visiting libraries, read

## Character Based Chunking

This is a very simple technique where the chunks are created based on a seperator or the `chunk_size` given.

In [4]:
from langchain_text_splitters import CharacterTextSplitter

with open("./sample.txt", "r") as f:
    text = f.read()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=500,
    chunk_overlap=50
) # Splits the text is there is a new para.

chunks = text_splitter.split_text(text)
chunks


Created a chunk of size 579, which is longer than the specified 500
Created a chunk of size 631, which is longer than the specified 500


['Technology has grown at an unprecedented rate in the 21st century, transforming every facet of human life. From the ways we communicate to how we work, learn, and entertain ourselves, technological advancements have redefined the modern world. This rapid growth, driven by the convergence of the internet, artificial intelligence (AI), and automation, has opened up new possibilities and challenges alike.',
 'The early 2000s witnessed the rise of personal computers and the internet, which revolutionized the way people accessed information. Before this period, gathering information meant visiting libraries, reading books, or consulting experts in person. The internet democratized knowledge by making it accessible to anyone with a connection. Search engines, such as Google, became essential tools for finding information instantly.',
 'As internet infrastructure improved, mobile technology followed. The introduction of smartphones in the late 2000s and early 2010s, particularly with the re

## Markdown Chunking

This is ver similar to HTML chunking except that is for markdown content. IT uses markdown heading tags split the text into chunks.

In [5]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

with open("./sample.md", "r") as f:
    md_text = f.read()

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
chunks = markdown_splitter.split_text(md_text)
chunks

[Document(metadata={'Header 1': 'The Evolution of Technology in the 21st Century', 'Header 2': 'Introduction'}, page_content='Technology has grown at an unprecedented rate in the 21st century, transforming every facet of human life. From the ways we communicate to how we work, learn, and entertain ourselves, technological advancements have redefined the modern world.'),
 Document(metadata={'Header 1': 'The Evolution of Technology in the 21st Century', 'Header 2': 'The Rise of the Internet and Mobile Technology'}, page_content='The early 2000s witnessed the rise of personal computers and the internet, which revolutionized the way people accessed information. Before this period, gathering information meant visiting libraries, reading books, or consulting experts in person.  \nAs internet infrastructure improved, mobile technology followed. The introduction of smartphones in the late 2000s, particularly with the release of the iPhone in 2007, marked a turning point.'),
 Document(metadata=

## Semantic Chunking

HTML and markdown based chunking are typically enough in most cases. However, this kind of chunking is done based on the assumption that the text within each heading has it's own context and semantic meaning. This may not be the case always. In such cases, we can use semantic chunking to split the text into chunks based on the semantic meaning of the text.


Here's how it works:

1. The text is broken down into groups of 3 sentences.
2. if two groups are similar (based on vector embeddings similarity), they are merged together.

In [6]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from dotenv import load_dotenv

load_dotenv()

with open("./sample.txt", "r") as f:
    text = f.read()

text_splitter = SemanticChunker(OpenAIEmbeddings())
chunks = text_splitter.split_text(text)
chunks

['Technology has grown at an unprecedented rate in the 21st century, transforming every facet of human life. From the ways we communicate to how we work, learn, and entertain ourselves, technological advancements have redefined the modern world. This rapid growth, driven by the convergence of the internet, artificial intelligence (AI), and automation, has opened up new possibilities and challenges alike. The early 2000s witnessed the rise of personal computers and the internet, which revolutionized the way people accessed information. Before this period, gathering information meant visiting libraries, reading books, or consulting experts in person. The internet democratized knowledge by making it accessible to anyone with a connection. Search engines, such as Google, became essential tools for finding information instantly. As internet infrastructure improved, mobile technology followed. The introduction of smartphones in the late 2000s and early 2010s, particularly with the release of

### Breakpoints
We can decide the when the text should split based on a breakpoint set. First we need to set the `breakpoint_threshold_type` to `percentile`, `standard_deviation`, `interquartile` or `gradient`. The we can set the `breakpoint_threshold_amount` to the desired value. If the difference between two sentences is greater than the threshold(`breakpoint_threshold_amount`), then the text is split.

In [10]:
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), 
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=75 
)

chunks = text_splitter.split_text(text)
chunks

['Technology has grown at an unprecedented rate in the 21st century, transforming every facet of human life. From the ways we communicate to how we work, learn, and entertain ourselves, technological advancements have redefined the modern world. This rapid growth, driven by the convergence of the internet, artificial intelligence (AI), and automation, has opened up new possibilities and challenges alike. The early 2000s witnessed the rise of personal computers and the internet, which revolutionized the way people accessed information. Before this period, gathering information meant visiting libraries, reading books, or consulting experts in person. The internet democratized knowledge by making it accessible to anyone with a connection. Search engines, such as Google, became essential tools for finding information instantly.',
 'As internet infrastructure improved, mobile technology followed. The introduction of smartphones in the late 2000s and early 2010s, particularly with the releas