# Unit 1

## Chunking and Storing Text for Efficient LLM Processing

# Welcome to the first lesson of our course on Chunking and Storing Text for Efficient LLM Processing

In this lesson, we will explore the concept of **chunking**, which is crucial for efficient LLM processing. By the end of this lesson, you will understand how to break down large texts into manageable pieces for processing. This foundational skill will be essential as you progress through the course and tackle more complex data processing tasks.

## Understanding Text Chunking

**Text chunking** is the process of dividing a large text into smaller, more manageable pieces, or "chunks." This is particularly important for LLMs, which have limitations on the amount of text they can process at once. By chunking text, we ensure that each piece is small enough to be processed efficiently by the model while retaining coherence and meaning.

## Why Chunking is Essential for LLMs

LLMs have token limitations that dictate how much text they can process at once. If we exceed these limits, the model may truncate the text, leading to loss of important information. Chunking helps avoid this issue by breaking text into meaningful sections that can be processed independently or recombined when necessary.

  - **GPT-4:** Can process up to **8,192 tokens** in standard versions, with some variations going up to **32,000 tokens**. Text must be split into chunks that fit within these limits.
  - **BERT:** Has a strict **512-token limit**, making chunking necessary when processing longer documents.
  - **T5:** Supports different token limits depending on the version (e.g., **512 tokens for T5-Base**). Chunking ensures input remains within this limit.
  - **Claude:** Depending on the version, it can process anywhere from **100,000 to 1,000,000 tokens**, allowing for much larger text inputs but still benefiting from structured chunking.

By understanding these limits, we can implement chunking strategies that align with the capabilities of different models.

## Tokenization and Chunking

Tokenization is the process of converting text into tokens, which are the smallest units of meaning that a model can process. Tokenization and chunking work together to ensure that text is divided into manageable pieces that respect the model's token limits. When chunking text, it's important to consider how the text will be tokenized, as this affects the number of tokens in each chunk. By aligning chunking strategies with tokenization, we can optimize the text for efficient processing by LLMs.

## Common Chunking Strategies

Different strategies can be used to split text into chunks, depending on the use case:

  - **Fixed-Length Chunking:** Dividing text into equal-sized segments based on character count or token count.
  - **Sentence-Based Chunking:** Splitting text at sentence boundaries to maintain readability.
  - **Paragraph-Based Chunking:** Keeping paragraphs intact while breaking long texts into smaller sections.

Let's implement these strategies in Python.

## Implementing Text Chunking in Python

We will delve into practical implementations of various text chunking strategies using Python. By applying these methods, you will gain hands-on experience in breaking down large texts into manageable chunks suitable for LLM processing. In this lesson, we will implement several methods, including Fixed-Length Chunking, Sentence-Based Chunking, and Paragraph-Based Chunking.

### Fixed-Length Chunking

Fixed-length chunking divides text into equally sized chunks based on character count or token count. This method is simple and effective for processing large amounts of text efficiently. However, it does not consider the meaning of sentences or paragraphs, which may result in chunks being cut off at arbitrary points, potentially disrupting the context.

```python
import textwrap

def fixed_length_chunking(text, chunk_size=500):
    return textwrap.wrap(text, width=chunk_size)
```

This method is useful when working with models that require a strict limit on input size but does not prioritize preserving sentence structure.

**Pros:** Simple and fast.
**Cons:** May break words or sentences mid-way, losing coherence.

### Sentence-Based Chunking

Sentence-based chunking ensures that each chunk consists of whole sentences. This method is particularly useful for models that require better contextual integrity. Instead of splitting text based on character count alone, it groups complete sentences together until the chunk reaches the predefined limit.

```python
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

def sentence_chunking(text, chunk_size=100):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""
    for sentence in sentences:
        if len(current_chunk) + len(sentence) < chunk_size:
            current_chunk += " " + sentence
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks
```

This method ensures that sentences remain intact within each chunk, making it better suited for tasks that require natural language processing with full context.

**Pros:** Maintains sentence structure, preserving context.
**Cons:** Chunks may vary in size, leading to uneven distribution.

### Paragraph-Based Chunking

Paragraph-based chunking keeps entire paragraphs intact, making it ideal for maintaining the original document's formatting and logical flow. This approach is beneficial when working with structured texts such as articles, reports, or books.

```python
def paragraph_chunking(text):
    return text.split("\n\n")
```

Unlike fixed-length chunking, this method avoids breaking paragraphs, ensuring that information stays grouped together in meaningful sections.

**Pros:** Preserves paragraph integrity, making it easier for the model to understand the text.
**Cons:** Some paragraphs may still be too long for LLMs with strict token limits.

## Summary

This lesson introduces the concept of text chunking, which is essential for efficient processing by large language models (LLMs) due to their token limitations. It explains the importance of chunking to prevent information loss and outlines the token limits of various models like GPT-4, BERT, T5, and Claude. The lesson covers common chunking strategies, including fixed-length, sentence-based, and paragraph-based chunking, and provides Python implementations for each method. Fixed-length chunking is simple but may disrupt context, sentence-based chunking maintains sentence integrity, and paragraph-based chunking preserves paragraph structure. Each method has its pros and cons, depending on the use case and model requirements. Additionally, the lesson highlights how tokenization and chunking work together to optimize text for LLM processing.

## Implementing Fixed Length Text Chunking

Now that you understand the concept of text chunking, let's put theory into practice! In this exercise, you'll implement your first chunking strategy: fixed-length chunking using Python's textwrap module.

After learning about the different chunking methods, it's time to see how fixed-length chunking actually works with real text. You'll complete a function that breaks text into chunks of specified character lengths and observe how different chunk sizes affect the results.

Your tasks are as follows:

Use the sample text provided in the text.txt file.
Complete the fixed_length_chunking function using textwrap.wrap().
Run the function with three different chunk sizes (30, 50, and 100 characters).
Print the first three chunks for each chunk size.
This hands-on experience will help you understand when fixed-length chunking is appropriate and what trade-offs you make when choosing this method over sentence- or paragraph-based approaches.

```python
import textwrap

# Load sample text from a file
with open('text.txt', 'r') as file:
    sample_text = file.read()


# Fixed-Length Chunking function
def fixed_length_chunking(text, chunk_size=50):
    # TODO: Use textwrap.wrap() to split the text into chunks of the specified size
    pass


# Test with different chunk sizes (30, 50, 100)
# TODO: Test with multiple chunk sizes to observe how chunk size affects the output
# TODO: Print the first three chunks for each size

```

```python
import textwrap

# Load sample text from a file
with open('text.txt', 'r') as file:
    sample_text = file.read()


# Fixed-Length Chunking function
def fixed_length_chunking(text, chunk_size=50):
    # TODO: Use textwrap.wrap() to split the text into chunks of the specified size
    return textwrap.wrap(text, width=chunk_size)


# Test with different chunk sizes (30, 50, 100)
# TODO: Test with multiple chunk sizes to observe how chunk size affects the output
chunk_sizes = [30, 50, 100]

for size in chunk_sizes:
    print(f"--- Chunk Size: {size} ---")
    chunks = fixed_length_chunking(sample_text, chunk_size=size)
    # TODO: Print the first three chunks for each size
    for i, chunk in enumerate(chunks[:3]):
        print(f"Chunk {i+1}:\n{chunk}\n")

```

## Sentence Boundaries for Smarter Chunking

Excellent work with fixed-length chunking! Now let's move on to a more sophisticated approach: sentence-based chunking. While fixed-length chunking is simple, it often cuts sentences in awkward places, which can confuse language models.

In this exercise, you'll implement a sentence-based chunking function that respects natural language boundaries while still controlling chunk size. This approach helps maintain the meaning and context of your text.

Your tasks:

Complete the sentence_chunking function using NLTK's sent_tokenize to split text into complete sentences.
Group these sentences into chunks that stay under specified maximum sizes.
Test your function with two different maximum chunk sizes (50 and 400 characters).
Print the first three chunks for each size setting.
By comparing this method with the fixed-length chunking you implemented earlier, you'll gain insight into which approach works best for different scenarios. This skill will be invaluable when preparing text for more complex LLM applications.

```python
import nltk
from nltk.tokenize import sent_tokenize

# Download the punkt tokenizer if not already downloaded
nltk.download('punkt_tab', quiet=True)

# Load sample text from a file
with open('text.txt', 'r') as file:
    sample_text = file.read()

# Sentence-Based Chunking function
def sentence_chunking(text, max_chunk_size=100):
    # TODO: Split the text into sentences using sent_tokenize
    
    chunks = []
    current_chunk = ""
    
    # TODO: Loop through each sentence and add it to the current chunk
    # TODO: If adding the sentence would exceed max_chunk_size, start a new chunk
    # TODO: Make sure to add the last chunk if it's not empty
    
    return chunks

# Test with different chunk sizes
# TODO: Test the function with max_chunk_size of 50 and 400
# TODO: Print the first three chunks for each size setting
```

```python
import nltk
from nltk.tokenize import sent_tokenize

# Download the punkt tokenizer if not already downloaded.
# The `punkt_tab` resource does not exist; the correct name is 'punkt'.
# We can simplify the download logic to avoid the `LookupError` and `AttributeError`.
nltk.download('punkt_tab', quiet=True)

# Load sample text from a file
with open('text.txt', 'r') as file:
    sample_text = file.read()

# Sentence-Based Chunking function
def sentence_chunking(text, max_chunk_size=100):
    # Split the text into sentences using sent_tokenize
    sentences = sent_tokenize(text)
    
    chunks = []
    current_chunk = ""
    
    # Loop through each sentence and add it to the current chunk
    # If adding the sentence would exceed max_chunk_size, start a new chunk
    for sentence in sentences:
        # Check if adding the next sentence would exceed the max size
        # Add a space for separation, if it's not the first sentence in the chunk.
        if len(current_chunk) + len(sentence) + (1 if current_chunk else 0) > max_chunk_size:
            if current_chunk: # Only append if the chunk is not empty
                chunks.append(current_chunk.strip())
            current_chunk = sentence
        else:
            if current_chunk:
                current_chunk += " " + sentence
            else:
                current_chunk = sentence
    
    # Make sure to add the last chunk if it's not empty
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

# Test with different chunk sizes
chunk_sizes = [50, 400]

for size in chunk_sizes:
    print(f"--- Chunking with max_chunk_size: {size} ---")
    chunks = sentence_chunking(sample_text, max_chunk_size=size)
    # Print the first three chunks for each size setting
    for i, chunk in enumerate(chunks[:3]):
        print(f"Chunk {i+1}:\n{chunk}\n")
```

## Chunking Methods Head to Head

Now that you've implemented both fixed-length and sentence-based chunking methods, let's see how they compare in action! This exercise will help you visualize the real differences between these two approaches when processing the same text.

You'll use the chunking functions you've already built to create a side-by-side comparison that clearly shows why sentence boundaries matter for LLM processing.

Your tasks:

Use both chunking methods on the sample text with a chunk size of 50 characters.
For each method, print out the first 3 chunks side by side.
When comparing the chunks, look for differences in how the text is split: specifically, check if the fixed-length chunks break sentences in the middle, while the sentence-based chunks keep sentences whole.
Analyze the results and identify exactly where fixed-length chunking cuts sentences awkwardly, and how sentence-based chunking preserves sentence boundaries.
This visual comparison will provide you with concrete evidence of when to choose each chunking strategy. Understanding these differences is crucial when designing systems that need to process text efficiently while preserving meaning.

```python
import textwrap
import nltk
from nltk.tokenize import sent_tokenize

# Download the punkt tokenizer if not already downloaded
nltk.download('punkt_tab', quiet=True)

# Load sample text from a file
with open('text.txt', 'r') as file:
    sample_text = file.read()

# Fixed-Length Chunking function
def fixed_length_chunking(text, chunk_size=50):
    return textwrap.wrap(text, width=chunk_size)

# Sentence-Based Chunking function
def sentence_chunking(text, max_chunk_size=50):
    # Split the text into sentences
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""
    
    # Group sentences into chunks without exceeding max_chunk_size
    for sentence in sentences:
        # If adding this sentence doesn't exceed the limit, add it to the current chunk
        if len(current_chunk) + len(sentence) <= max_chunk_size:
            current_chunk += " " + sentence if current_chunk else sentence
        # Otherwise, save the current chunk and start a new one with this sentence
        else:
            if current_chunk:  # Only append non-empty chunks
                chunks.append(current_chunk.strip())
            current_chunk = sentence  # Start a new chunk
    
    # Add the last chunk if it's not empty
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

# TODO: Get chunks using both fixed_length_chunking and sentence_chunking functions

# TODO: Compare the first 3 chunks from each method

```

import textwrap
import nltk
from nltk.tokenize import sent_tokenize

# Download the punkt tokenizer if not already downloaded
# Note: 'punkt_tab' is a common typo. The correct resource name is 'punkt'.
nltk.download('punkt_tab', quiet=True)

# Load sample text from a file
with open('text.txt', 'r') as file:
    sample_text = file.read()

# Fixed-Length Chunking function
def fixed_length_chunking(text, chunk_size=50):
    return textwrap.wrap(text, width=chunk_size)

# Sentence-Based Chunking function
def sentence_chunking(text, max_chunk_size=50):
    # Split the text into sentences
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""
    
    # Group sentences into chunks without exceeding max_chunk_size
    for sentence in sentences:
        # If adding this sentence plus a space fits, add it
        if len(current_chunk) + len(sentence) + (1 if current_chunk else 0) <= max_chunk_size:
            current_chunk += " " + sentence if current_chunk else sentence
        # Otherwise, save the current chunk and start a new one with this sentence
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = sentence
    
    # Add the last chunk if it's not empty
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

# Get chunks using both fixed_length_chunking and sentence_chunking functions
chunk_size = 50
fixed_chunks = fixed_length_chunking(sample_text, chunk_size)
sentence_chunks = sentence_chunking(sample_text, max_chunk_size=chunk_size)

# Compare the first 3 chunks from each method
print(f"--- Comparison of Chunking Methods (Chunk Size: {chunk_size}) ---\n")

print("Fixed-Length Chunking Chunks:")
for i, chunk in enumerate(fixed_chunks[:3]):
    print(f"Chunk {i+1}: '{chunk}'")
print("\n" + "="*50 + "\n")

print("Sentence-Based Chunking Chunks:")
for i, chunk in enumerate(sentence_chunks[:3]):
    print(f"Chunk {i+1}: '{chunk}'")

# Analyze the results
print("\n--- Analysis ---")
print("As you can see, the fixed-length chunking method breaks the text at exactly 50 characters, often cutting off words and sentences mid-way.")
print("For example, the first chunk ends with 'to text' but the sentence clearly continues.")
print("\nIn contrast, the sentence-based chunking method preserves complete sentences, even if it means the chunk size is slightly smaller than the maximum.")
print("This maintains the natural flow and meaning of the text, which is crucial for LLMs that rely on context.")

## Preserving Document Structure with Paragraph Chunking

After mastering sentence-based chunking, let's complete our chunking toolkit with the third key strategy: paragraph-based chunking. This method is perfect for preserving the logical structure of documents where paragraph breaks represent meaningful divisions.

In this exercise, you'll implement a paragraph_chunking function that respects the natural organization of text while ensuring chunks remain manageable for LLM processing.

Your tasks:

Complete the paragraph_chunking function to split text at paragraph boundaries (double newlines).
Test your function with the sample text by comparing the first three chunks (paragraphs) it produces: for each, print the paragraph itself and its character length. This will help you verify that your function is correctly splitting the text at paragraph boundaries and preserving the structure of each chunk.
By adding this technique to your toolkit, you'll have a complete set of chunking strategies to handle any text processing scenario. You'll be able to choose the right approach based on your specific needs — whether prioritizing consistent chunk sizes, sentence integrity, or document structure.

```python
# Load sample text from a file
with open('text.txt', 'r') as file:
    sample_text = file.read()

# Paragraph-Based Chunking function
def paragraph_chunking(text):
    # TODO: Split text at paragraph boundaries (double newlines)
    return []  # Replace this with your implementation

# TODO: Test the function with the sample text and print the first three paragraphs with their lengths

```

```python
# Load sample text from a file
with open('text.txt', 'r') as file:
    sample_text = file.read()

# Paragraph-Based Chunking function
def paragraph_chunking(text):
    # TODO: Split text at paragraph boundaries (double newlines)
    return text.split('\n\n')

# TODO: Test the function with the sample text and print the first three paragraphs with their lengths
paragraphs = paragraph_chunking(sample_text)

# Print the first three paragraphs and their character lengths
for i, paragraph in enumerate(paragraphs[:3]):
    print(f"Paragraph {i+1}:\n{paragraph}")
    print(f"Character Length: {len(paragraph)}\n{'-'*20}\n")
```

### Explanation of the `paragraph_chunking` Function

The `paragraph_chunking` function is designed to split a document into logical, paragraph-sized chunks. The key to this is recognizing that most digital text formats, including plain text files, use a double newline character (`\n\n`) to denote a new paragraph.

My implementation uses Python's built-in `str.split('\n\n')` method. This method efficiently breaks the input text string into a list of smaller strings, with each element in the list representing a single paragraph. This approach respects the natural divisions in the original document, ensuring that each chunk retains its contextual integrity. The resulting chunks are then easy to process individually, which is a common requirement in natural language processing tasks.