# Unit 2

## Advanced Chunking Techniques for LLMs

# Advanced Chunking Techniques

Welcome back\! In our previous lesson, we explored the basics of chunking and storing text for efficient processing with Large Language Models (LLMs). We learned that breaking down large text is crucial for effective data handling. Today, we'll dive deeper into advanced chunking techniques, focusing on **recursive character-based** and **token-based** methods. These techniques will help you optimize text processing for AI models, making your applications more efficient.

## Introduction to Overlapping Chunks

**Overlapping chunks** ensure that consecutive chunks share some common content. This shared content is crucial for maintaining continuity and context, which helps AI models understand and process text effectively.

Imagine a chatbot answering questions about an article. If chunks don't overlap, the model might lose track of key details, leading to incomplete or incorrect responses. Overlapping chunks solve this by ensuring key phrases appear in consecutive chunks. This is especially important for applications like summarization, search indexing, and document processing, where information needs to flow smoothly.

## Understanding Recursive Character-Based Chunking

To achieve effective overlapping, we can use **Recursive Character-Based Chunking**. This technique breaks down text into smaller pieces based on characters while preserving context. It respects natural boundaries like sentences and paragraphs, ensuring that chunks are readable and maintain their logical structure. We'll use the `RecursiveCharacterTextSplitter` from the `langchain` library to implement this.

### How it Works

1.  **Define a Maximum Chunk Size**: Set a limit for each chunk (e.g., 100 characters).
2.  **Choose Separators**: These define where text should be split, such as:
      * Paragraphs (`\n\n`)
      * Sentences (`.`)
      * Spaces (`     `) as a last resort
3.  **Recursive Splitting**:
      * The text is first split using the **largest separator** (paragraphs).
      * If any chunk is still **too long**, it's further split using the **next separator** (sentences).
      * This continues down to spaces if necessary, to ensure all chunks fit within the limit.
4.  **Apply Overlap**: A defined number of characters from the end of one chunk is included at the beginning of the next to maintain context.

### Step 1: Importing Necessary Libraries

First, import the required library.

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter
```

### Step 2: Loading the Sample Text

Next, load the sample text from a file.

```python
with open('text.txt', 'r') as file:
    text = file.read()
```

### Step 3: Setting Up the Recursive Character-Based Splitter

Now, configure the `RecursiveCharacterTextSplitter` with your desired parameters.

```python
recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20, separators=["\n\n", ".", " "])
```

  * `chunk_size=100`: The maximum size for each chunk.
  * `chunk_overlap=20`: The number of characters that overlap between chunks.
  * `separators=["\n\n", ".", " "]`: The hierarchical list of characters used to split the text.

### Step 4: Splitting the Text

Finally, use the splitter to break the text into chunks and print the results.

```python
recursive_chunks = recursive_splitter.split_text(text)
print("Recursive Character-Based Chunking:")
for i, chunk in enumerate(recursive_chunks):
    print(f"Chunk {i+1}: {chunk}\n")
```

-----

## Exploring Token-Based Chunking

**Token-based chunking** breaks text into tokens, the smallest units of meaning, using a tokenizer. This method is effective for LLMs that process tokenized input. The process involves:

  * **Tokenization**: The text is converted into tokens, which are meaningful units like words or subwords.
  * **Chunk Size and Overlap**: You define the maximum number of tokens per chunk (`chunk_size`) and the overlap between chunks (`chunk_overlap`) to maintain context.
  * **Splitting**: The tokenized text is split into chunks based on the defined size and overlap.
  * **Encoding**: An encoding, such as OpenAI's `cl100k_base`, is specified to guide the tokenization process.

### Step 1: Importing Necessary Libraries

Import the libraries needed for token-based chunking.

```python
import tiktoken
from langchain.text_splitter import TokenTextSplitter
```

### Step 2: Understanding Tokens vs. Characters

Let's see how tokens, characters, and words differ.

```python
text_example = "AI-powered models process data in an efficient way."
enc = tiktoken.get_encoding("cl100k_base")
print(f"Character count: {len(text_example)}")
print(f"Word count: {len(text_example.split())}")
print(f"Token count: {len(enc.encode(text_example))}")
```

**Output**:

```
Character count: 51
Word count: 8
Token count: 10
```

As you can see, tokens don't always match character or word counts because tokenizers may split words into smaller subword units. This is why token-based chunking is more effective for LLMs that rely on tokenized input.

### Step 3: Setting Up the Token-Based Splitter

Set up the `TokenTextSplitter` using OpenAI's tokenizer.

```python
token_splitter = TokenTextSplitter(encoding_name="cl100k_base", chunk_size=40, chunk_overlap=10)
```

  * `encoding_name="cl100k_base"`: Specifies the encoding for tokenization.
  * `chunk_size=40`: Sets the maximum number of tokens per chunk.
  * `chunk_overlap=10`: Sets the overlap between chunks to 10 tokens.

### Step 4: Splitting the Text

Finally, split the text into token-based chunks and print the results.

```python
token_chunks = token_splitter.split_text(text)
print("Token-Based Chunking:")
for i, chunk in enumerate(token_chunks):
    print(f"Chunk {i+1}: {chunk}\n")
```

-----

## Comparing Chunking Methods

Both recursive character-based and token-based chunking have their own strengths and weaknesses. Recursive character-based chunking is useful for preserving context in a flexible manner, while token-based chunking offers precision by focusing on meaningful units of text. The right choice depends on your specific NLP task.

### Side-by-Side Comparison

| Method | Strengths | Weaknesses |
| :--- | :--- | :--- |
| **Recursive Character-Based** | Respects sentence structure, flexible | Not always aligned with token limits |
| **Token-Based** | Ensures compatibility with LLM token windows | May split at unnatural sentence points |

Understanding these trade-offs helps you choose the right method for your specific AI application.

-----

## Summary and Preparation for Practice

In this lesson, we explored advanced chunking techniques, focusing on **recursive character-based** and **token-based** methods. Overlapping chunks help maintain context, enhancing a model's understanding. Recursive character-based chunking respects natural text boundaries, while token-based chunking aligns with token limits.

As you prepare for practice, familiarize yourself with the `langchain` and `tiktoken` libraries, as you'll apply these concepts in the upcoming exercises.

## Exploring Separator Configurations

Nice work on understanding recursive character-based chunking! Now, let's put that knowledge into practice. Your task is to experiment with the separators parameter in RecursiveCharacterTextSplitter. For this task, imagine you have loaded a sample long document from text.txt.

Modify the code to try three different separator configurations:
The default ["\n\n", "\n", " ", ""]: Prioritizes paragraph and line breaks.
[".", ",", " ", ""]: Focuses on sentence and clause boundaries.
[" "]: Splits only on spaces, treating each word as a boundary.
Analyze how these affect the chunking results. This exercise will help you see how different separator hierarchies impact the logical structure and readability of the chunks. Print only the first 3 chunks for each configuration to explore the effects!

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter, TokenTextSplitter

# Load the sample text from text.txt
with open('text.txt', 'r') as file:
    text = file.read()

# Recursive Character-Based Chunking with different separators
separator_configs = [
    # TODO: Add the first separator configuration
    # TODO: Add the second separator configuration
    # TODO: Add the third separator configuration
]

for config in separator_configs:
    # Use the current config for separators
    recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20, separators=config)
    recursive_chunks = recursive_splitter.split_text(text)
    
    print(f"Recursive Character-Based Chunking with separators {config}:")
    for i, chunk in enumerate(recursive_chunks[:3]):
        print(f"Chunk {i+1}: {chunk}\n")
```

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter, TokenTextSplitter

# Load the sample text from text.txt
with open('text.txt', 'r') as file:
    text = file.read()

# Recursive Character-Based Chunking with different separators
separator_configs = [
    ["\n\n", "\n", " ", ""],
    [".", ",", " ", ""],
    [" "],
]

for config in separator_configs:
    # Use the current config for separators
    recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20, separators=config)
    recursive_chunks = recursive_splitter.split_text(text)
    
    print(f"Recursive Character-Based Chunking with separators {config}:")
    for i, chunk in enumerate(recursive_chunks[:3]):
        print(f"Chunk {i+1}: {chunk}\n")
```

## Exploring Overlap in Text Chunking

You've done well in understanding recursive character-based chunking! Now, let's explore how different chunk_overlap values affect context preservation.

Use the text from text.txt.
Adjust the chunk_overlap parameter to 0, 30, and 50 characters.
Print only the first 3 chunks for each overlap setting.
Analyze how each setting impacts the continuity between chunks.
Identify specific examples (e.g., repeated sentences or phrases) where increased overlap helps maintain the flow of information or prevents important details from being lost between chunks.
This exercise will help you find the optimal overlap value that balances efficiency and context. Dive in and see how these adjustments can enhance your text processing!

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load text from text.txt
with open('text.txt', 'r') as file:
    text = file.read()

# Function to test different chunk_overlap values
def test_chunk_overlap(overlap):
    print(f"Testing with chunk_overlap={overlap}")
    recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=overlap, separators=["\n\n", ".", " "])
    recursive_chunks = recursive_splitter.split_text(text)
    
    # TODO: Print only the first 3 chunks

# TODO: Test with different overlap values (0, 30, 50)
```

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load text from text.txt
with open('text.txt', 'r') as file:
    text = file.read()

# Function to test different chunk_overlap values
def test_chunk_overlap(overlap):
    print(f"Testing with chunk_overlap={overlap}\n{'-'*30}")
    recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=overlap, separators=["\n\n", ".", " "])
    recursive_chunks = recursive_splitter.split_text(text)
    
    # Print only the first 3 chunks
    for i, chunk in enumerate(recursive_chunks[:3]):
        print(f"Chunk {i+1}:\n{chunk}\n")

# Test with different overlap values (0, 30, 50)
test_chunk_overlap(0)
test_chunk_overlap(30)
test_chunk_overlap(50)
```

## Token-Based Chunking Implementation

Well done on mastering recursive character-based chunking! Now, let's shift our focus to token-based chunking. Your task is to implement token-based chunking using TokenTextSplitter.

Set up the TokenTextSplitter with the correct parameters.
Use it to split a sample text into token-based chunks.
Print the first 4 token-based chunks to see how the text is divided.
This exercise will help you understand how to optimize text processing for AI models. Dive in and see how token-based chunking can enhance your applications!

```python
from langchain.text_splitter import TokenTextSplitter
import tiktoken  # OpenAI's tokenizer library for precise token-based chunking

# Load the sample text from text.txt
with open('text.txt', 'r') as file:
    text = file.read()

# TODO: Implement Token-Based Chunking using OpenAI's tokenizer
# Set up the TokenTextSplitter with the following parameters:
# - encoding_name="cl100k_base" (the encoding name used for tokenization, specific to OpenAI's tokenizer)
# - chunk_size=100 (the maximum number of tokens per chunk)
# - chunk_overlap=10 (the number of tokens that overlap between consecutive chunks)
# TODO: Use the TokenTextSplitter to split the text into token-based chunks

# TODO: Print first 4 token-based chunks

```

```python
from langchain.text_splitter import TokenTextSplitter
import tiktoken  # OpenAI's tokenizer library for precise token-based chunking

# Load the sample text from text.txt
with open('text.txt', 'r') as file:
    text = file.read()

# Implement Token-Based Chunking using OpenAI's tokenizer
# Set up the TokenTextSplitter with the specified parameters
token_splitter = TokenTextSplitter(
    encoding_name="cl100k_base", 
    chunk_size=100, 
    chunk_overlap=10
)

# Use the TokenTextSplitter to split the text into token-based chunks
token_chunks = token_splitter.split_text(text)

# Print first 4 token-based chunks
print("Token-Based Chunking:")
for i, chunk in enumerate(token_chunks[:4]):
    print(f"Chunk {i+1}:\n{chunk}\n")
```