# M1.3 ‚Äî Document Processing Pipeline

**Extraction ‚Üí Cleaning ‚Üí Chunking ‚Üí Embedding ‚Üí Storage**

## Purpose
Transform raw documents (PDF, TXT, Markdown) into searchable vector embeddings for production RAG systems. This notebook teaches the complete pipeline from document extraction to Pinecone storage, with hands-on examples of each stage.

## Concepts Covered
- Multi-format document extraction (PDF, TXT, Markdown)
- Text cleaning and Unicode normalization
- Three chunking strategies (fixed, semantic, paragraph-aware)
- Metadata enrichment for precise retrieval filtering
- Batch embedding generation with OpenAI
- Vector storage in Pinecone with size limits and deduplication

## After Completing This Module
You will be able to:
- Build production document processing pipelines
- Choose appropriate chunking strategies for your use case
- Handle common failures (Unicode errors, memory issues, bad chunking)
- Make informed decisions between custom pipelines vs managed services
- Cost-effectively process documents at scale (~$0.13 per million tokens)

## Context in Track
- **M1.1**: Vector database fundamentals
- **M1.2**: Pinecone data model and indexing
- **M1.3** (this module): Document processing pipeline
- **M1.4** (next): Query pipeline and response generation

## 1. Why Document Processing Matters (Reality Check & Trade-offs)

Document processing transforms raw text into searchable embeddings. Traditional keyword search fails when users query "machine learning algorithms" but documents say "ML techniques". Semantic search bridges this gap.

### What You Gain
- **Semantic search**: Find content by meaning, not keywords
- **Automation**: Process hundreds of docs without manual tagging
- **Scale**: Handle diverse formats (PDF, TXT, MD)

### What You Lose
- **Exact formatting**: Tables and images become text soup
- **Cross-doc reasoning**: No knowledge graph linking
- **Speed**: Embedding 1000 chunks takes ~30 seconds

### When NOT to Build This
- **<10 documents**: Manual processing may be faster
- **Scanned PDFs**: Need OCR first (Tesseract, AWS Textract)
- **Real-time updates**: Reprocessing takes time
- **<100K token docs**: Long-context embeddings preserve full context

In [None]:
# Quick demo: raw text ‚Üí cleaned text transformation
from src.m1_3_document_processing import TextCleaner

raw_text = 'He said "hello" and\nsh e   said   "goodbye".   \n\n\n\nEnd.'
cleaner = TextCleaner()
cleaned = cleaner.clean(raw_text)

print(f"Raw ({len(raw_text)} chars): {repr(raw_text)[:60]}...")
print(f"Cleaned ({len(cleaned)} chars): {repr(cleaned)[:60]}...")

# Expected:
# - Multiple spaces ‚Üí single space
# - Multiple newlines ‚Üí double newlines

## 2. Extraction (PDF/TXT/Markdown) + Metadata

The `DocumentExtractor` pulls text from multiple formats and preserves metadata:

- **PDF**: PyMuPDF extracts text page-by-page
- **TXT/MD**: Plain text readers with encoding fallback
- **Metadata**: File name, size, page count, extraction method

### Key Features
- Generates unique `doc_id` (hash of filepath)
- Handles Unicode errors gracefully (latin-1 fallback)
- Raises clear errors for unsupported file types

In [None]:
# Extract document and show metadata
from src.m1_3_document_processing import DocumentExtractor

extractor = DocumentExtractor()
doc = extractor.extract('data/example/example_data.txt')

print(f"doc_id: {doc.doc_id}")
print(f"text length: {len(doc.text)} chars")
print(f"metadata keys: {list(doc.metadata.keys())}")

# Expected:
# - doc_id: 12-char hash
# - text length: ~1500 chars
# - metadata: file_name, file_type, page_count, etc.

## 3. Cleaning & Normalization (Artifacts, Unicode, Line Breaks)

Raw extracted text contains noise that degrades embedding quality:

- **Unicode issues**: Smart quotes (`""`), em dashes (`‚Äî`), special chars
- **Whitespace chaos**: Multiple spaces, tabs, excessive newlines
- **PDF artifacts**: Hyphenated line breaks (`exam-\nple`), page numbers
- **Format remnants**: Headers, footers, metadata strings

The `TextCleaner` normalizes text using:
- **NFKC Unicode normalization**: Standardizes characters
- **Punctuation mapping**: Smart quotes ‚Üí standard quotes
- **Regex patterns**: Fix hyphenated words, collapse whitespace
- **Artifact removal**: Strip common page numbers and headers

In [None]:
# Clean the extracted document text
from src.m1_3_document_processing import TextCleaner

cleaner = TextCleaner()
cleaned_text = cleaner.clean(doc.text)

print(f"Before: {len(doc.text)} chars")
print(f"After: {len(cleaned_text)} chars")
print(f"First 120 chars: {cleaned_text[:120]}...")

# Expected:
# - Slightly shorter (whitespace removed)
# - No excessive newlines or spaces
# - Unicode normalized

## 4. Chunking Strategies (Fixed, Semantic, Paragraph-Aware)

Chunking splits documents into retrievable units. The strategy affects:
- **Retrieval precision**: Small chunks = specific matches
- **Context preservation**: Large chunks = more context
- **Cost**: More chunks = more embeddings

### Three Strategies

**Fixed-Size Chunking**
- Splits at character boundaries (e.g., 512 chars)
- Fast, predictable
- May split mid-sentence

**Semantic Chunking**
- Recursive split: paragraphs ‚Üí sentences ‚Üí words
- Respects boundaries, preserves meaning
- Slower, variable chunk sizes

**Paragraph Chunking**
- Groups paragraphs up to max size
- Best for structured documents
- Requires clear paragraph markers

In [None]:
# Compare all three chunking strategies
from src.m1_3_document_processing import FixedSizeChunker, SemanticChunker, ParagraphChunker

fixed = FixedSizeChunker(chunk_size=300, overlap=30)
semantic = SemanticChunker(chunk_size=300, overlap=30)
paragraph = ParagraphChunker(max_chunk_size=600)

fixed_chunks = fixed.chunk(cleaned_text)
semantic_chunks = semantic.chunk(cleaned_text)
paragraph_chunks = paragraph.chunk(cleaned_text)

print(f"Fixed: {len(fixed_chunks)} chunks | First chunk ends: ...{fixed_chunks[0][-80:]}")
print(f"Semantic: {len(semantic_chunks)} chunks | First chunk ends: ...{semantic_chunks[0][-80:]}")
print(f"Paragraph: {len(paragraph_chunks)} chunks | First chunk ends: ...{paragraph_chunks[0][-80:]}")

# Expected:
# - Different chunk counts (fixed ‚âà semantic > paragraph)
# - Semantic/paragraph end at better boundaries

## 5. Metadata Enrichment (IDs, Hash, Counts, Semantic Flags)

Rich metadata enables powerful filtering during retrieval. Without metadata, you retrieve chunks blindly. With it, you can:
- **Filter by content type**: Skip code chunks for user docs
- **Detect duplicates**: Check `content_hash` before upserting
- **Debug retrieval**: Trace chunks back to source documents

### Metadata Fields

**Identifiers**
- `chunk_id`: Unique ID (file_chunk_index_hash)
- `content_hash`: MD5 of chunk text (deduplication)
- `doc_id`: Parent document ID

**Counts**
- `word_count`, `char_count`, `line_count`: Size metrics

**Semantic Flags**
- `contains_code`: Code snippets (def, function, class, etc.)
- `is_list`: Bulleted/numbered lists (>50% list markers)
- `has_heading`: Starts with Markdown heading or all-caps title

In [None]:
# Extract metadata for first semantic chunk
from src.m1_3_document_processing import MetadataExtractor

metadata_extractor = MetadataExtractor()
first_chunk_metadata = metadata_extractor.extract(semantic_chunks[0], doc.metadata, chunk_index=0)

# Show subset of metadata
print({
    'chunk_id': first_chunk_metadata['chunk_id'],
    'word_count': first_chunk_metadata['word_count'],
    'contains_code': first_chunk_metadata['contains_code'],
    'is_list': first_chunk_metadata['is_list']
})

# Expected:
# - chunk_id: example_data.txt_0_<hash>
# - word_count: ~50-100
# - contains_code: False (first chunk is title/intro)
# - is_list: False

## 6. Embeddings & Storage in Pinecone (Batches, Size Limits, Failures)

The final stage converts text chunks into dense vector embeddings and stores them in Pinecone for fast retrieval.

### Embedding Generation
- **Model**: OpenAI `text-embedding-3-small` (1536 dimensions)
- **Cost**: ~$0.13 per million tokens
- **Batch size**: 100 chunks per API call (faster, cheaper)
- **Latency**: ~1 second per batch of 100 chunks

### Pinecone Storage
- **Upsert limit**: 100 vectors per request
- **Metadata limit**: 40KB per vector (we trim if exceeded)
- **Namespaces**: Logical partitions within an index

### Common Failures
- **Metadata too large**: Trim fields exceeding 30KB
- **Rate limits**: Batch requests with retry logic
- **Duplicate IDs**: Use `content_hash` for deduplication

In [None]:
# Prepare chunks with metadata, then embed if API keys available
from src.m1_3_document_processing import Chunk, EmbeddingPipeline
from src.m1_3_document_processing.config import get_clients

# Create Chunk objects with enriched metadata
chunks_with_metadata = []
for i, chunk_text in enumerate(semantic_chunks[:3]):  # Just first 3 for demo
    metadata = metadata_extractor.extract(chunk_text, doc.metadata, i)
    chunks_with_metadata.append(Chunk(chunk_id=metadata['chunk_id'], text=chunk_text, metadata=metadata))

# Get API clients (may be None if keys not configured)
openai_client, pinecone_client = get_clients()

# Create pipeline and attempt embedding
pipeline = EmbeddingPipeline(openai_client, pinecone_client, 'demo-index')

if openai_client:
    try:
        vectors = pipeline.embed_chunks(chunks_with_metadata)
        print(f"Generated {len(vectors)} embeddings")
        print(f"Vector shape: id={vectors[0]['id']}, values=[{len(vectors[0]['values'])} dims], metadata keys={list(vectors[0]['metadata'].keys())}")
    except Exception as e:
        print(f"‚ö†Ô∏è Skipping embedding (API error): {e}")
else:
    # Show payload shape without calling API
    print("‚ö†Ô∏è Skipping API calls (no keys found)")
    print(f"Would generate {len(chunks_with_metadata)} vectors")
    print(f"Payload shape: {{'id': 'chunk_id', 'values': [1536 floats], 'metadata': {list(chunks_with_metadata[0].metadata.keys())}}}")

## 7. Common Failures & Fixes + Decision Card

### Five Common Failure Modes

**1. Unicode Errors (ÔøΩ characters)**
- **Cause**: Smart quotes, em dashes, special chars not handled
- **Fix**: `TextCleaner` uses NFKC normalization + punctuation mapping
- **Prevention**: Test with diverse PDFs (academic papers, scanned docs)

**2. Memory Exhaustion**
- **Cause**: 500MB PDF loaded entirely into memory
- **Fix**: Process page-by-page, use streaming readers
- **Prevention**: Add file size checks, limit max pages

**3. Bad Chunking (Code Split Mid-Function)**
- **Cause**: Fixed-size chunker splits at arbitrary boundaries
- **Fix**: Use `contains_code` metadata flag + custom code-aware splitter
- **Prevention**: Validate chunks on sample docs before production

**4. Garbled Table Extraction**
- **Cause**: PyMuPDF converts tables to unstructured text
- **Fix**: Use `pdfplumber` or Camelot for table-specific extraction
- **Prevention**: Test on docs with tables, consider hybrid approach

**5. Duplicate Chunks**
- **Cause**: Reprocessing same document without deduplication
- **Fix**: Check `content_hash` before upserting to Pinecone
- **Prevention**: Maintain processing logs with doc_id + timestamps

---

### Decision Card: When to Use Each Approach

| Scenario | Best Choice | Reason |\n|----------|-------------|--------|\n| **50+ consistent PDFs** (e.g., all technical papers) | Custom Pipeline | Domain-specific chunking, full control |\n| **<10 documents** | Manual Processing | Faster than building automation |\n| **Diverse formats** (DOCX, images, scans) | Managed Service (Unstructured.io) | OCR, table extraction, multi-format support |\n| **<100K token docs** | Long-Context Embeddings (Voyage AI) | No chunking loss, entire doc context |\n| **Real-time updates** | Streaming Pipeline (Kafka + Lambda) | Incremental processing, not batch |\n| **Scanned PDFs** | AWS Textract ‚Üí Custom Pipeline | OCR first, then process |\n\n---

### Next Steps

1. **Run tests**: `python tests_processing.py` (no API keys required)\n2. **Process example doc**: `python m1_3_document_processing.py --process example_data.txt`\n3. **Try different chunkers**: Compare quality on your own documents\n4. **Monitor costs**: Track OpenAI token usage (embeddings add up fast)\n5. **Validate retrieval**: In M1.4, you'll query these chunks and evaluate quality\n\n**Remember**: Perfect chunking is impossible. Start simple (semantic chunking), measure retrieval quality, then optimize only if needed."

---

## Summary & Next Steps

### What You've Learned
‚úÖ Extract text from PDFs, TXT, and Markdown with metadata preservation  
‚úÖ Clean text by normalizing Unicode, removing artifacts, fixing line breaks  
‚úÖ Choose chunking strategies: fixed (fast), semantic (meaning-preserving), paragraph (structured)  
‚úÖ Enrich chunks with metadata for filtering and deduplication  
‚úÖ Generate embeddings in batches and store in Pinecone  
‚úÖ Handle common failures and make informed architecture decisions  

### Key Takeaways
- **No perfect chunking**: Start simple (semantic), measure retrieval quality, optimize if needed
- **Cost vs quality**: Smaller chunks = precise retrieval but more embeddings ($$$)
- **Metadata matters**: Enables filtering by content type, deduplication, debugging
- **Graceful degradation**: Pipeline works without API keys (returns metadata only)

### Practice Exercises
1. Process your own documents with different chunking strategies
2. Add custom metadata extractors for your domain (e.g., extract dates, authors)
3. Implement deduplication logic using `content_hash`
4. Test with edge cases (large PDFs, Unicode-heavy docs, tables)

### Next Module: M1.4 Query Pipeline & Response Generation
You'll learn to:
- Query Pinecone with semantic search
- Implement hybrid search (semantic + keyword)
- Rank and filter results using metadata
- Generate responses with retrieved context
- Handle retrieval failures and edge cases

### CLI Quick Reference
```bash
# Process single document
python -m src.m1_3_document_processing.module --process data/example/example_data.txt --index production-rag

# Batch processing
python -m src.m1_3_document_processing.module --process-batch docs/ --index production-rag --chunker semantic

# Start API server
powershell -c "$env:PYTHONPATH='$PWD'; uvicorn app:app --reload"
```

### Resources
- [OpenAI Embeddings Pricing](https://openai.com/pricing)
- [Pinecone Documentation](https://docs.pinecone.io/)
- [PyMuPDF (fitz) Docs](https://pymupdf.readthedocs.io/)

**Happy processing! üöÄ**