# Chunking

- Fixed-length: Splitting documents into fixed-size chunks. It’s easy to do, but sometimes chunks may not align with logical breaks, so you could split important info or include irrelevant content.
- Sentence-based: Breaking documents into sentences keeps sentences intact, which is great for detailed analysis. However, it may lead to too many chunks or lose context when sentences are too short to capture full ideas.
- Paragraph-based: Dividing by paragraphs helps keep the context intact, but paragraphs may be too long, making retrieval and processing less efficient.
- Semantic chunking: Chunks are created based on meaning, like sections or topics. This keeps the context clear but is harder to implement since it needs advanced text analysis.
- Sliding window: Chunks overlap by sliding over the text. This ensures important info isn't missed but can be computationally expensive and may result in repeated information.

## Chunking: Fixed-Length

In [None]:
# Example: Fixed-length chunking by character count
def chunk_text_fixed(text, max_chars=500):
    return [text[i:i+max_chars] for i in range(0, len(text), max_chars)]
document = "..."  # a long text document
chunks = chunk_text_fixed(document, max_chars=1000)
print(f"Created {len(chunks)} fixed-size chunks.")
print(chunks[0][:100])  # preview first 100 chars of the first chunkb

## Chunking: Recursive

In [None]:
# separators list tells the splitter to prefer paragraph breaks, then line breaks, then spaces.
# here each chunk shares 50 characters with the next – which can help preserve context at boundaries so that important info cut off at the end of one chunk is still present at the start of the next chunk. 

# Using LangChain's RecursiveCharacterTextSplitter for rule-based chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 800,        # target chunk size in characters (or tokens)
    chunk_overlap = 50,      # overlap to maintain context between chunks
    separators = ["\n\n", "\n", " ", ""]  
    # The splitter will try to split by double newline (paragraph), then newline, then space, then as last resort character.
)
chunks = text_splitter.split_text(document)
print(f"Created {len(chunks)} chunks with recursive splitting.")
print(chunks[0][:100])

## Chunking: Semantic

- Semantic chunking aims to split text based on meaning or topic shifts rather than hard rules of length or explicit delimiters.
- slide a window through the text and split when the similarity between adjacent sentences drops below a threshold (indicating a topic shift).
- You can also leverage summarization: recursively split the document and summarize sections to decide if splitting further is needed

## Chunking: Sliding Window

- A sliding window approach creates overlapping chunks by moving a window of fixed size through the text with a certain stride.
- For example, take 500 words at a time but start the next chunk 300 words in (thus 200-word overlap). This ensures high coverage (every part of the text appears in some chunk) and preserves context between adjacent chunks.
- Sliding window chunking is common when you absolutely need to capture local context and can’t afford to miss something at a boundary.

## Chunking: Gotchas and Best Practices

- Chunk Size vs Context Window: Always design chunking with your target LLM’s context length in mind.
- Overlap Trade-off: Use overlapping chunks if missing boundary information is a concern, but be mindful of the inflation in chunk count. A small overlap (10–20% of chunk length) is often enough to catch important context.
- Metadata: Maintain metadata with each chunk. For example, store the document title, section name, page number, or other identifiers alongside the chunk.
- Don’t Overchunk Extremely Short Texts: If some documents are already short (short articles or FAQs), you might not need to chunk them at all.
- Multilingual Considerations: If your data is multilingual, chunking by sentence/paragraph still works, but remember that different languages have different average word lengths and tokenization behaviors.

# Embeddings

- Embeddings are the backbone of the RAG retrieval system’s semantic memory.