# Chunking Strategies

It is important to chunk the text data into smaller parts before converting them into vector embeddings and storing it in a vector database. It allows us to retrieve the exact content that is relevant to the query, as having irrelevant content in the context leads to unnecessary of tokens and generation of irrelevant response.

In order to chunk our text we will be using the langchain text splitters library. This library provides a variety of text splitters that can be used to split the text into smaller chunks.

## Recursive chunking
Here, the text is splitted on `["\n\n", "\n", " ", ""]` characters until the chunk size is small enough.

Parameters:
1. `seperators`: List of characters to split the text on. The default is `["\n\n", "\n", " ", ""]`.
2. `chunk_size`: the max size of the chunk.
3. `chunk_overlap`: length of the overlapping text between the chunks.

In [1]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

with open("./sample.txt", "r") as f:
    text = f.read()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=10,
)

chunks = text_splitter.split_text(text)
chunks

['TXT test file\nPurpose: Provide example of this file type\nDocument file type: TXT\nVersion: 1.0',
 'Remark:',
 'Example content:',
 'The names "John Doe" for males, "Jane Doe" or "Jane Roe" for females, or "Jonnie Doe" and "Janie',
 '"Janie Doe" for children, or just "Doe" non-gender-specifically are used as placeholder names for a',
 'for a party whose true identity is unknown or must be withheld in a legal action, case, or',
 'case, or discussion. The names are also used to refer to acorpse or hospital patient whose identity',
 'identity is unknown. This practice is widely used in the United States and Canada, but is rarely',
 'is rarely used in other English-speaking countries including the United Kingdom itself, from where',
 'where the use of "John Doe" in a legal context originates. The names Joe Bloggs or John Smith are',
 'Smith are used in the UK instead, as well as in Australia and New Zealand.',
 'John Doe is sometimes used to refer to a typical male in other contexts as 

### HTML based chunking
While recursive chhunking might be good enough for generic cases, it might not be good enough for cases where the text data does has a semantic structure to it. With the use of recursive chunking there is no guarantee that each chunk is "context aware and not split in between a piec of information. HTML based chunking makes sure that no chunk has no incomplete information by using the HTML tags as the seperators rather than using escape sequences.