# Basic Text Chunking Strategies

Text chunking is the process of breaking down large documents into smaller, manageable pieces. This is essential for:
- **RAG (Retrieval Augmented Generation)**: Storing and retrieving relevant context
- **Vector Databases**: Creating meaningful embeddings
- **LLM Context Windows**: Fitting text within token limits
- **Search & Information Retrieval**: Improving relevance and precision

In this notebook, we'll explore fundamental chunking strategies with practical examples.

## Sample Text for Examples

Let's create a sample document to demonstrate different chunking approaches.

In [1]:
sample_text = """
Artificial intelligence has transformed the way we interact with technology. Machine learning algorithms can now recognize patterns in vast datasets, enabling applications from image recognition to natural language processing.

Deep learning, a subset of machine learning, uses neural networks with multiple layers. These networks can learn hierarchical representations of data. The breakthrough came with increased computational power and large datasets.

Natural language processing (NLP) allows computers to understand human language. Modern NLP models like transformers have revolutionized the field. They power applications such as chatbots, translation services, and text summarization.

The future of AI involves more sophisticated models that can reason, plan, and interact with humans more naturally. Ethical considerations around AI deployment are becoming increasingly important. We must ensure AI systems are fair, transparent, and beneficial to society.
"""

print(f"Original text length: {len(sample_text)} characters")
print(f"Word count: {len(sample_text.split())} words")

Original text length: 968 characters
Word count: 129 words


## 1. Fixed-Size Chunking

The simplest approach: split text into chunks of a fixed character or word count.

**Pros:**
- Simple to implement
- Predictable chunk sizes
- Fast processing

**Cons:**
- May split sentences or ideas mid-way
- Doesn't respect document structure
- Can break semantic meaning

In [2]:
def fixed_size_chunking(text, chunk_size=200, overlap=50):
    """
    Split text into fixed-size chunks with optional overlap.
    
    Args:
        text: Input text to chunk
        chunk_size: Number of characters per chunk
        overlap: Number of overlapping characters between chunks
    """
    chunks = []
    start = 0
    text = text.strip()
    
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - overlap
    
    return chunks

# Example usage
fixed_chunks = fixed_size_chunking(sample_text, chunk_size=200, overlap=50)

print(f"Number of chunks: {len(fixed_chunks)}\n")
for i, chunk in enumerate(fixed_chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-" * 80)

Number of chunks: 7

Chunk 1 (200 chars):
Artificial intelligence has transformed the way we interact with technology. Machine learning algorithms can now recognize patterns in vast datasets, enabling applications from image recognition to na
--------------------------------------------------------------------------------
Chunk 2 (200 chars):
enabling applications from image recognition to natural language processing.

Deep learning, a subset of machine learning, uses neural networks with multiple layers. These networks can learn hierarchi
--------------------------------------------------------------------------------
Chunk 3 (200 chars):
ultiple layers. These networks can learn hierarchical representations of data. The breakthrough came with increased computational power and large datasets.

Natural language processing (NLP) allows co
--------------------------------------------------------------------------------
Chunk 4 (200 chars):
sets.

Natural language processing (NLP) allows c

## 2. Sentence-Based Chunking

Split text at sentence boundaries to preserve complete thoughts.

**Pros:**
- Preserves sentence integrity
- More semantically meaningful
- Better for understanding context

**Cons:**
- Variable chunk sizes
- Requires sentence detection
- May still break related ideas across chunks

In [3]:
import re

def sentence_chunking(text, sentences_per_chunk=3):
    """
    Split text into chunks containing N sentences.
    
    Args:
        text: Input text to chunk
        sentences_per_chunk: Number of sentences per chunk
    """
    # Simple sentence splitter (can be improved with spacy or nltk)
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    sentences = [s for s in sentences if s]  # Remove empty strings
    
    chunks = []
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = ' '.join(sentences[i:i + sentences_per_chunk])
        chunks.append(chunk)
    
    return chunks

# Example usage
sentence_chunks = sentence_chunking(sample_text, sentences_per_chunk=2)

print(f"Number of chunks: {len(sentence_chunks)}\n")
for i, chunk in enumerate(sentence_chunks, 1):
    print(f"Chunk {i}:")
    print(chunk)
    print("-" * 80)

Number of chunks: 6

Chunk 1:
Artificial intelligence has transformed the way we interact with technology. Machine learning algorithms can now recognize patterns in vast datasets, enabling applications from image recognition to natural language processing.
--------------------------------------------------------------------------------
Chunk 2:
Deep learning, a subset of machine learning, uses neural networks with multiple layers. These networks can learn hierarchical representations of data.
--------------------------------------------------------------------------------
Chunk 3:
The breakthrough came with increased computational power and large datasets. Natural language processing (NLP) allows computers to understand human language.
--------------------------------------------------------------------------------
Chunk 4:
Modern NLP models like transformers have revolutionized the field. They power applications such as chatbots, translation services, and text summarization.
---------

## 3. Paragraph-Based Chunking

Split text at paragraph boundaries (double newlines) to preserve topical coherence.

**Pros:**
- Preserves logical document structure
- Keeps related ideas together
- Natural semantic boundaries

**Cons:**
- Highly variable chunk sizes
- Large paragraphs may exceed token limits
- Depends on document formatting

In [4]:
def paragraph_chunking(text, max_paragraphs_per_chunk=2):
    """
    Split text into chunks based on paragraphs.
    
    Args:
        text: Input text to chunk
        max_paragraphs_per_chunk: Maximum paragraphs per chunk
    """
    # Split by double newlines or paragraph breaks
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
    
    chunks = []
    for i in range(0, len(paragraphs), max_paragraphs_per_chunk):
        chunk = '\n\n'.join(paragraphs[i:i + max_paragraphs_per_chunk])
        chunks.append(chunk)
    
    return chunks

# Example usage
paragraph_chunks = paragraph_chunking(sample_text, max_paragraphs_per_chunk=1)

print(f"Number of chunks: {len(paragraph_chunks)}\n")
for i, chunk in enumerate(paragraph_chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-" * 80)

Number of chunks: 4

Chunk 1 (226 chars):
Artificial intelligence has transformed the way we interact with technology. Machine learning algorithms can now recognize patterns in vast datasets, enabling applications from image recognition to natural language processing.
--------------------------------------------------------------------------------
Chunk 2 (227 chars):
Deep learning, a subset of machine learning, uses neural networks with multiple layers. These networks can learn hierarchical representations of data. The breakthrough came with increased computational power and large datasets.
--------------------------------------------------------------------------------
Chunk 3 (235 chars):
Natural language processing (NLP) allows computers to understand human language. Modern NLP models like transformers have revolutionized the field. They power applications such as chatbots, translation services, and text summarization.
---------------------------------------------------------------

## 4. Recursive Chunking

A hierarchical approach that tries different separators in order (paragraphs → sentences → words → characters).

**Pros:**
- Balances semantic meaning with size constraints
- More intelligent splitting
- Adapts to document structure

**Cons:**
- More complex to implement
- Slower than simple methods
- Requires tuning parameters

In [5]:
def recursive_chunking(text, chunk_size=300, overlap=50):
    """
    Recursively split text using hierarchical separators.
    
    Args:
        text: Input text to chunk
        chunk_size: Target size for each chunk
        overlap: Overlap between chunks
    """
    # Separators in order of preference
    separators = ['\n\n', '\n', '. ', ' ', '']
    
    def split_text(text, separators):
        chunks = []
        
        # Try each separator
        for separator in separators:
            if separator == '':
                # Last resort: character-level split
                return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size-overlap)]
            
            if separator in text:
                splits = text.split(separator)
                current_chunk = ""
                
                for split in splits:
                    if len(current_chunk) + len(split) + len(separator) <= chunk_size:
                        current_chunk += split + separator
                    else:
                        if current_chunk:
                            chunks.append(current_chunk.strip())
                        current_chunk = split + separator
                
                if current_chunk:
                    chunks.append(current_chunk.strip())
                
                return chunks
        
        return [text]
    
    return split_text(text.strip(), separators)

# Example usage
recursive_chunks = recursive_chunking(sample_text, chunk_size=300, overlap=30)

print(f"Number of chunks: {len(recursive_chunks)}\n")
for i, chunk in enumerate(recursive_chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-" * 80)

Number of chunks: 4

Chunk 1 (226 chars):
Artificial intelligence has transformed the way we interact with technology. Machine learning algorithms can now recognize patterns in vast datasets, enabling applications from image recognition to natural language processing.
--------------------------------------------------------------------------------
Chunk 2 (227 chars):
Deep learning, a subset of machine learning, uses neural networks with multiple layers. These networks can learn hierarchical representations of data. The breakthrough came with increased computational power and large datasets.
--------------------------------------------------------------------------------
Chunk 3 (235 chars):
Natural language processing (NLP) allows computers to understand human language. Modern NLP models like transformers have revolutionized the field. They power applications such as chatbots, translation services, and text summarization.
---------------------------------------------------------------

## 5. Token-Based Chunking

Split text based on token count (important for LLM context windows).

**Pros:**
- Directly respects LLM token limits
- Precise control over chunk sizes
- Essential for API usage

**Cons:**
- Requires tokenizer library
- Different models have different tokenizers
- May still break semantic meaning

In [6]:
# Note: This requires tiktoken library
# pip install tiktoken

try:
    import tiktoken
    
    def token_chunking(text, max_tokens=100, model="gpt-3.5-turbo"):
        """
        Split text based on token count.
        
        Args:
            text: Input text to chunk
            max_tokens: Maximum tokens per chunk
            model: Model name for tokenizer
        """
        encoding = tiktoken.encoding_for_model(model)
        tokens = encoding.encode(text)
        
        chunks = []
        for i in range(0, len(tokens), max_tokens):
            chunk_tokens = tokens[i:i + max_tokens]
            chunk_text = encoding.decode(chunk_tokens)
            chunks.append(chunk_text)
        
        return chunks
    
    # Example usage
    token_chunks = token_chunking(sample_text, max_tokens=50)
    
    print(f"Number of chunks: {len(token_chunks)}\n")
    for i, chunk in enumerate(token_chunks, 1):
        print(f"Chunk {i}:")
        print(chunk)
        print("-" * 80)
        
except ImportError:
    print("tiktoken not installed. Install with: pip install tiktoken")
    print("\nToken-based chunking requires the tiktoken library for accurate token counting.")

tiktoken not installed. Install with: pip install tiktoken

Token-based chunking requires the tiktoken library for accurate token counting.


## Comparison of Chunking Strategies

Let's compare all the strategies we've covered:

In [7]:
import pandas as pd

# Gather statistics
strategies = {
    'Fixed-Size': fixed_chunks,
    'Sentence-Based': sentence_chunks,
    'Paragraph-Based': paragraph_chunks,
    'Recursive': recursive_chunks
}

comparison_data = []
for name, chunks in strategies.items():
    chunk_sizes = [len(c) for c in chunks]
    comparison_data.append({
        'Strategy': name,
        'Num Chunks': len(chunks),
        'Avg Size': int(sum(chunk_sizes) / len(chunk_sizes)),
        'Min Size': min(chunk_sizes),
        'Max Size': max(chunk_sizes)
    })

df = pd.DataFrame(comparison_data)
print(df.to_string(index=False))

       Strategy  Num Chunks  Avg Size  Min Size  Max Size
     Fixed-Size           7       180        66       200
 Sentence-Based           6       159        75       226
Paragraph-Based           4       240       226       272
      Recursive           4       240       226       272


## Best Practices for Basic Chunking

1. **Choose based on your use case:**
   - RAG systems: Sentence or paragraph-based
   - Token limits: Token-based chunking
   - Simple processing: Fixed-size

2. **Use overlap:** Helps maintain context between chunks (typically 10-20% of chunk size)

3. **Consider chunk size:** 
   - Too small: Loses context
   - Too large: Reduces retrieval precision
   - Sweet spot: 256-512 tokens for most RAG applications

4. **Preserve metadata:** Track source document, position, and relationships

5. **Test and iterate:** Different documents may need different strategies

## Looking Ahead: Advanced Chunking Strategies

The basic strategies we've covered are rule-based and don't understand the *meaning* of the text. In upcoming lessons, we'll explore **advanced chunking techniques** that leverage AI to create more intelligent chunks:

### Semantic Chunking
- Uses embeddings to understand text similarity
- Groups sentences with related meanings together
- Creates natural semantic boundaries
- Better preserves context and improves retrieval accuracy

### Agentic Chunking
- Uses LLMs to intelligently identify logical boundaries
- Can understand document structure and topic transitions
- Adapts to different content types automatically

### Context-Aware Chunking
- Maintains references and relationships between chunks
- Preserves hierarchical document structure
- Creates "smart" overlaps based on content

### Specialized Chunking
- **Code chunking:** Respects function/class boundaries
- **Markdown chunking:** Preserves heading hierarchy
- **Table chunking:** Keeps tables intact

These advanced techniques can significantly improve retrieval quality in RAG systems, but they come with trade-offs in complexity and computational cost. Stay tuned!

## Exercise: Try It Yourself!

Experiment with different chunking strategies on your own text:

In [8]:
# Add your own text here
your_text = """
Replace this with your own text to experiment with different chunking strategies.
Try different chunk sizes, overlap values, and strategies to see what works best
for your specific use case.
"""

# Try different strategies
print("Fixed-Size Chunks:")
for i, chunk in enumerate(fixed_size_chunking(your_text, 100, 20), 1):
    print(f"{i}. {chunk}\n")

print("\n" + "="*80 + "\n")

print("Sentence Chunks:")
for i, chunk in enumerate(sentence_chunking(your_text, 2), 1):
    print(f"{i}. {chunk}\n")

Fixed-Size Chunks:
1. Replace this with your own text to experiment with different chunking strategies.
Try different chun

2. .
Try different chunk sizes, overlap values, and strategies to see what works best
for your specific

3. st
for your specific use case.



Sentence Chunks:
1. Replace this with your own text to experiment with different chunking strategies. Try different chunk sizes, overlap values, and strategies to see what works best
for your specific use case.

