# Ingestor Configuration Guide

This notebook covers all configuration options available in the ingestor library.

## Topics Covered

- Configuration loading (environment vs programmatic)
- Input configuration (local files, blob storage)
- Azure service configuration
- Chunking configuration
- Document Intelligence settings
- Embedding configuration
- Output and artifacts configuration

In [None]:
from ingestor import PipelineConfig, create_config
from ingestor.config import (
    InputMode,
    ArtifactsMode,
    ChunkingMode,
    OverlapMode,
    TableRenderMode
)
import os
from dotenv import load_dotenv

load_dotenv()
print("‚úÖ Imports complete")

## 1. Configuration Loading Methods

### Method 1: Load from Environment Variables

In [None]:
# Load all settings from .env file
config = PipelineConfig.from_env()

print("Configuration loaded from environment:")
print(f"  Search service: {config.search.service_name}")
print(f"  Index: {config.search.index_name}")
print(f"  Document Intelligence: {config.document_intelligence.endpoint}")
print(f"  OpenAI: {config.openai.endpoint}")

### Method 2: Programmatic Configuration

In [None]:
# Create config from scratch (no environment variables)
from ingestor.config import (
    SearchConfig,
    DocumentIntelligenceConfig,
    OpenAIConfig,
    InputConfig,
    LocalInputConfig
)

config = PipelineConfig(
    search=SearchConfig(
        service_name="my-search-service",
        api_key="my-key",
        index_name="my-index"
    ),
    document_intelligence=DocumentIntelligenceConfig(
        endpoint="https://my-di.cognitiveservices.azure.com",
        api_key="my-di-key"
    ),
    openai=OpenAIConfig(
        endpoint="https://my-openai.openai.azure.com",
        api_key="my-openai-key",
        embedding_deployment="text-embedding-ada-002"
    ),
    input=InputConfig(
        mode=InputMode.LOCAL,
        local=LocalInputConfig(
            glob="documents/*.pdf"
        )
    )
)

print("‚úÖ Programmatic configuration created")

### Method 3: Hybrid (Environment + Override)

In [None]:
# Load from environment, then override specific settings
config = PipelineConfig.from_env()

# Override specific settings
config.search.index_name = "custom-index"
config.chunking.target_chunk_size = 1500
config.chunking.chunk_overlap = 300

print("‚úÖ Hybrid configuration (env + overrides)")
print(f"  Index: {config.search.index_name}")
print(f"  Chunk size: {config.chunking.target_chunk_size}")

## 2. Input Configuration

### Local File Input

In [None]:
config = create_config(
    input_glob="documents/*.pdf"
)

print(f"Input mode: {config.input.mode}")
print(f"Glob pattern: {config.input.local.glob}")

# Advanced glob patterns
examples = [
    "documents/*.pdf",              # All PDFs in documents/
    "documents/**/*.pdf",            # All PDFs recursively
    "documents/*.{pdf,docx}",        # PDFs and DOCX files
    "documents/2024-*.pdf",          # PDFs starting with 2024-
    "documents/*/reports/*.pdf"      # PDFs in reports subdirectories
]

print("\nüìã Glob pattern examples:")
for pattern in examples:
    print(f"  {pattern}")

### Azure Blob Storage Input

In [None]:
from ingestor.config import BlobStorageConfig

config = PipelineConfig.from_env()

# Configure blob storage input
config.input.mode = InputMode.BLOB_STORAGE
config.input.blob = BlobStorageConfig(
    account_name="mystorageaccount",
    account_key="my-storage-key",
    container_name="documents",
    glob="**/*.pdf"  # All PDFs in container
)

print("‚úÖ Blob storage input configured")
print(f"  Container: {config.input.blob.container_name}")
print(f"  Pattern: {config.input.blob.glob}")

## 3. Chunking Configuration

Control how documents are split into chunks.

In [None]:
config = PipelineConfig.from_env()

# Basic chunking settings
config.chunking.target_chunk_size = 1000      # Target size in characters
config.chunking.chunk_overlap = 200           # Overlap between chunks
config.chunking.mode = ChunkingMode.LAYOUT    # Layout-aware chunking

print("Chunking configuration:")
print(f"  Target size: {config.chunking.target_chunk_size} chars")
print(f"  Overlap: {config.chunking.chunk_overlap} chars")
print(f"  Mode: {config.chunking.mode}")

### Chunking Modes

In [None]:
print("Available chunking modes:\n")

print("1. LAYOUT (recommended)")
print("   - Respects document structure (sections, paragraphs)")
print("   - Uses Document Intelligence layout analysis")
print("   - Best for: Technical manuals, structured documents\n")

print("2. PAGE")
print("   - One chunk per page")
print("   - Simple and fast")
print("   - Best for: Presentations, forms\n")

print("3. FIXED")
print("   - Fixed-size chunks by character count")
print("   - Ignores document structure")
print("   - Best for: Plain text, simple documents")

### Overlap Configuration

In [None]:
config = PipelineConfig.from_env()

# Overlap mode: WORDS vs CHARACTERS
config.chunking.overlap_mode = OverlapMode.WORDS
config.chunking.chunk_overlap = 50  # 50 words overlap

print(f"Overlap mode: {config.chunking.overlap_mode}")
print(f"Overlap amount: {config.chunking.chunk_overlap}")

# Why overlap matters
print("\nüí° Overlap ensures context isn't lost at chunk boundaries")
print("   Recommended: 10-20% of chunk size")
print("   Example: 1000 char chunks ‚Üí 100-200 char overlap")

## 4. Document Intelligence Configuration

In [None]:
config = PipelineConfig.from_env()

# Document Intelligence settings
config.document_intelligence.model = "prebuilt-layout"  # Or "prebuilt-read"
config.document_intelligence.features = [
    "OCR_HIGH_RESOLUTION",
    "FORMULAS",
    "LANGUAGES"
]

print("Document Intelligence settings:")
print(f"  Endpoint: {config.document_intelligence.endpoint}")
print(f"  Model: {config.document_intelligence.model}")
print(f"  Features: {config.document_intelligence.features}")

## 5. Table Processing Configuration

In [None]:
config = PipelineConfig.from_env()

# Table rendering mode
config.chunking.table_render_mode = TableRenderMode.MARKDOWN_DETAILED

print("Table render modes:\n")
print("1. MARKDOWN_DETAILED (recommended)")
print("   - Full markdown tables with all structure")
print("   - Best for search and LLM understanding\n")

print("2. MARKDOWN_COMPACT")
print("   - Simplified markdown")
print("   - Saves space for simple tables\n")

print("3. HTML")
print("   - HTML table format")
print("   - Use if your RAG system renders HTML\n")

print("4. TEXT")
print("   - Plain text representation")
print("   - Most compact, loses structure")

## 6. Embedding Configuration

In [None]:
config = PipelineConfig.from_env()

# OpenAI embedding settings
config.openai.embedding_deployment = "text-embedding-ada-002"
config.openai.embedding_dimensions = 1536  # Default for ada-002

# For text-embedding-3-large (higher quality)
# config.openai.embedding_deployment = "text-embedding-3-large"
# config.openai.embedding_dimensions = 1536  # or 3072 for full dimensions

print("Embedding configuration:")
print(f"  Deployment: {config.openai.embedding_deployment}")
print(f"  Dimensions: {config.openai.embedding_dimensions}")

## 7. Artifacts Configuration

Save intermediate processing artifacts (JSON, markdown) for debugging.

In [None]:
config = PipelineConfig.from_env()

# Save artifacts locally
config.artifacts.mode = ArtifactsMode.LOCAL
config.artifacts.local_path = "./artifacts"

# Or save to blob storage
# config.artifacts.mode = ArtifactsMode.BLOB_STORAGE
# config.artifacts.blob_container_name = "artifacts"

print("Artifacts configuration:")
print(f"  Mode: {config.artifacts.mode}")
print(f"  Path: {config.artifacts.local_path}")
print("\nüí° Artifacts include:")
print("  - Document Intelligence JSON results")
print("  - Extracted markdown")
print("  - Chunk metadata")

## 8. Search Index Configuration

In [None]:
config = PipelineConfig.from_env()

# Search service settings
config.search.index_name = "my-index"
config.search.semantic_config_name = "my-semantic-config"

print("Search configuration:")
print(f"  Service: {config.search.service_name}")
print(f"  Index: {config.search.index_name}")
print(f"  Semantic config: {config.search.semantic_config_name}")

## 9. Complete Configuration Example

A real-world configuration for processing medical device manuals:

In [None]:
# Production-ready configuration
config = PipelineConfig.from_env()

# Input: Local PDFs
config.input.mode = InputMode.LOCAL
config.input.local.glob = "medical_manuals/**/*.pdf"

# Chunking: Layout-aware with overlap
config.chunking.mode = ChunkingMode.LAYOUT
config.chunking.target_chunk_size = 1200
config.chunking.chunk_overlap = 200
config.chunking.overlap_mode = OverlapMode.WORDS
config.chunking.table_render_mode = TableRenderMode.MARKDOWN_DETAILED

# Document Intelligence: High-res OCR with formulas
config.document_intelligence.model = "prebuilt-layout"
config.document_intelligence.features = [
    "OCR_HIGH_RESOLUTION",
    "FORMULAS",
    "LANGUAGES"
]

# Embeddings: text-embedding-3-large
config.openai.embedding_deployment = "text-embedding-3-large"
config.openai.embedding_dimensions = 1536

# Artifacts: Save to blob storage
config.artifacts.mode = ArtifactsMode.BLOB_STORAGE
config.artifacts.blob_container_name = "medical-artifacts"

# Search: Custom index
config.search.index_name = "medical-devices-index"

print("‚úÖ Production configuration ready")
print(f"\nConfiguration summary:")
print(f"  Input: {config.input.local.glob}")
print(f"  Chunking: {config.chunking.mode} ({config.chunking.target_chunk_size} chars)")
print(f"  Overlap: {config.chunking.chunk_overlap} {config.chunking.overlap_mode}")
print(f"  Tables: {config.chunking.table_render_mode}")
print(f"  Embeddings: {config.openai.embedding_deployment}")
print(f"  Index: {config.search.index_name}")

## 10. Configuration Validation

Validate your configuration before running:

In [None]:
def validate_config(config: PipelineConfig):
    """Validate configuration settings."""
    issues = []
    
    # Check required fields
    if not config.search.service_name:
        issues.append("Missing search service name")
    if not config.search.index_name:
        issues.append("Missing search index name")
    if not config.document_intelligence.endpoint:
        issues.append("Missing Document Intelligence endpoint")
    if not config.openai.endpoint:
        issues.append("Missing OpenAI endpoint")
    
    # Check chunking settings
    if config.chunking.chunk_overlap >= config.chunking.target_chunk_size:
        issues.append("Chunk overlap must be less than chunk size")
    
    if config.chunking.chunk_overlap > config.chunking.target_chunk_size * 0.3:
        issues.append("Warning: Overlap is >30% of chunk size (may cause duplication)")
    
    # Report
    if issues:
        print("‚ùå Configuration issues found:\n")
        for issue in issues:
            print(f"  - {issue}")
        return False
    else:
        print("‚úÖ Configuration is valid")
        return True

# Validate
validate_config(config)

## Summary

You've learned:

‚úÖ Three methods to configure ingestor (env, programmatic, hybrid)  
‚úÖ Input configuration (local files, blob storage)  
‚úÖ Chunking modes and overlap settings  
‚úÖ Document Intelligence and table processing  
‚úÖ Embedding and search configuration  
‚úÖ Artifacts management  
‚úÖ Configuration validation  

## Next Steps

- **03_advanced_features.ipynb**: Advanced chunking strategies and custom processors
- **04_real_world_legal.ipynb**: Configuration for legal documents
- **07_performance_tuning.ipynb**: Optimize configuration for scale