[Chunking](https://community.databricks.com/t5/technical-blog/the-ultimate-guide-to-chunking-strategies-for-rag-applications/ba-p/113089)
- Fixed-Size Chunking (word, char or token counts (with overlaps))
- Semantic Chunking (break at paragraphs or sentences)
- Recursive Chunking
- Adaptive Chunking
- Context-Enriched Chunking
- AI-Driven Dynamic Chunking

In [77]:
%pip install -qU langchain-text-splitters transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [78]:
document = None
with open("./datasets/dsm.md", 'r', encoding='utf-8') as f:
    document = f.read()

## Fixed-Size Chunking

This is the simplest method. This splits based on a given character sequence, which defaults to "\n\n". Chunk length is measured by number of characters.

1. How the text is split: by single character separator.
2. How the chunk size is measured: by number of characters.

To obtain the string content directly, use .split_text.
To create LangChain Document objects (e.g., for use in downstream tasks), use .create_documents.

https://python.langchain.com/docs/how_to/character_text_splitter/

In [None]:
from transformers import AutoTokenizer
from langchain_text_splitters import CharacterTextSplitter

# Load a tokenizer for a BERT-like model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def count_tokens(text):
    return len(tokenizer.encode(text))

text_splitter = CharacterTextSplitter(
    separator="\n",  # <=== new
    chunk_size=400,
    chunk_overlap=50,
    # length_function=len,
    length_function=count_tokens,
    # is_separator_regex=False,
)
texts = text_splitter.create_documents([document])

Token indices sequence length is longer than the specified maximum sequence length for this model (653 > 512). Running this sequence through the model will result in indexing errors
Created a chunk of size 653, which is longer than the specified 400
Created a chunk of size 803, which is longer than the specified 400
Created a chunk of size 416, which is longer than the specified 400
Created a chunk of size 408, which is longer than the specified 400
Created a chunk of size 478, which is longer than the specified 400
Created a chunk of size 453, which is longer than the specified 400
Created a chunk of size 472, which is longer than the specified 400
Created a chunk of size 568, which is longer than the specified 400
Created a chunk of size 635, which is longer than the specified 400
Created a chunk of size 403, which is longer than the specified 400
Created a chunk of size 453, which is longer than the specified 400
Created a chunk of size 458, which is longer than the specified 400
Cr

In [80]:
import os
import json

directory = "./sections"
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith(".json"):
        with open(os.path.join(directory, filename)) as f:
            jsn = json.loads(f.read())
            section = jsn["section"]
            texts = text_splitter.create_documents([section])
            chunks = []
            for i,t in enumerate(texts):
                # replace newlines with spaces this can help keep word boundires
                chunks.append(t.page_content.replace("\n", " "))
            jsn["chunks"] = chunks

            with open(f'./chunks/{jsn["id"]}.json', "w") as wf:
                wf.write(json.dumps(jsn))


Created a chunk of size 416, which is longer than the specified 400
Created a chunk of size 401, which is longer than the specified 400
Created a chunk of size 473, which is longer than the specified 400
Created a chunk of size 453, which is longer than the specified 400
Created a chunk of size 458, which is longer than the specified 400
Created a chunk of size 408, which is longer than the specified 400
Created a chunk of size 478, which is longer than the specified 400
Created a chunk of size 437, which is longer than the specified 400
Created a chunk of size 403, which is longer than the specified 400
Created a chunk of size 453, which is longer than the specified 400


## Semantic Chunking

This is a better chunking method, but for simplicity, I am just using the CharacterTextSplitter method above.

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

1. How the text is split: by list of characters.
2. How the chunk size is measured: by number of characters.

Below we show example usage.

To obtain the string content directly, use .split_text.

To create LangChain Document objects (e.g., for use in downstream tasks), use .create_documents.

https://python.langchain.com/docs/how_to/recursive_text_splitter/

In [None]:
# from langchain_text_splitters import RecursiveCharacterTextSplitter

# text_splitter = RecursiveCharacterTextSplitter(
#     # separators=["\n\n", "\n", ". ", " ", ""],
#     chunk_size=1000,
#     chunk_overlap=20,
#     length_function=len,
#     is_separator_regex=False,
# )

# texts = text_splitter.create_documents([document])
# print(len(texts))
# print(texts[0])
# print(texts[1])