# Technique 3: Semantic Chunking

## The Problem
Fixed-size chunking (every 500 tokens) **breaks semantic boundaries**:
- Splits mid-sentence
- Splits mid-paragraph
- Splits mid-topic

Result: Chunks lack coherent meaning!

## The Solution
**Semantic Chunking:** Split based on meaning, not size.

Uses embeddings to detect topic shifts and split there.

**Difficulty:** ⭐⭐⭐☆☆

## Step 1: Imports

In [None]:
from utils_openai import setup_openai_api, create_embeddings, create_llm, load_msme_data
from langchain_experimental.text_splitter import SemanticChunker

print('[OK] Imports done!')

## Step 2: Setup

In [None]:
api_key = setup_openai_api()
embeddings = create_embeddings(api_key)
llm = create_llm(api_key)
docs, metas, ids = load_msme_data('msme.csv')
print('[OK] Data loaded!')

## Step 3: Create Semantic Chunker

In [None]:
semantic_chunker = SemanticChunker(embeddings)
print('[OK] Semantic chunker ready!')


## Step 4: Split Documents
Combine all docs and split semantically:

In [None]:
combined_text = '\n\n'.join(docs)
semantic_chunks = semantic_chunker.create_documents([combined_text])

print(f'Original docs: {len(docs)}')
print(f'Semantic chunks: {len(semantic_chunks)}')
print(f'\nSample chunk lengths:')
for i in range(min(5, len(semantic_chunks))):
    print(f'  Chunk {i+1}: {len(semantic_chunks[i].page_content)} chars')

## Step 5: Compare with Fixed Chunking

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

fixed_splitter = RecursiveCharacterTextSplitter(chunk_size=500)
fixed_chunks = fixed_splitter.create_documents([combined_text])

print(f'Fixed chunks (500 chars): {len(fixed_chunks)}')
print(f'Semantic chunks: {len(semantic_chunks)}')
print(f'\nFixed chunk example (may break mid-sentence):')
print(fixed_chunks[0].page_content[:300])
print(f'\nSemantic chunk example (respects boundaries):')
print(semantic_chunks[0].page_content[:300])

In [None]:
print(fixed_chunks[0].page_content)
print(semantic_chunks[0].page_content)

## ⚖️ Fixed vs Semantic Chunking Trade-offs

| Aspect | Fixed-Size Chunking | Semantic Chunking |
|--------|-------------------|-------------------|
| **Chunk Count** | More chunks (812) | Fewer chunks (83) |
| **Chunk Sizes** | Uniform (~500 chars) | Variable (79-11,065 chars) |
| **Semantic Coherence** | Often broken ❌ | Always preserved ✅ |
| **Processing Speed** | Fast | Slower (embedding overhead) |
| **Setup Complexity** | Simple | Requires embedding model |
| **Context Quality** | May include incomplete ideas | Complete thoughts only |
| **Best For** | Speed, simplicity | Quality, coherence |

### Real Impact on Retrieval

**Scenario:** Query about "business registration requirements"

**Fixed Chunking:**
```
Retrieved Chunk: "...visit the CAC portal. Submit required documen"
                                                              ↑ CUT OFF!
Problem: User gets incomplete information
```

**Semantic Chunking:**
```
Retrieved Chunk: "To register a business, visit the CAC portal. 
                  Submit required documents including tax ID, 
                  proof of address, and business plan. The process 
                  takes 2-3 weeks..."
                                                              ↑ COMPLETE!
Result: User gets full, actionable answer
```

### When the Extra Cost is Worth It

✅ **Use Semantic Chunking When:**
- Documents contain flowing narratives (articles, guides, reports)
- Context matters (legal, medical, educational content)
- Users ask conceptual questions requiring complete explanations
- Quality > speed

❌ **Stick with Fixed Chunking When:**
- Documents are already well-structured (bullet points, tables)
- Speed is critical (real-time applications)
- Documents are short and uniform
- Simple lookups (definitions, FAQs)

## Exercise
1. Compare retrieval quality with semantic vs fixed chunks
2. Test with different document types
3. Measure impact on answer coherence


In [None]:
# Your code here

**Next:** Technique 4 - Reranking