# üöÄ Ingestor Quick Playbook - All Usage Methods

This notebook shows **ALL** the different ways to run the ingestor library:

## Methods Covered

1. **üêç Python API** - Import and use programmatically
2. **üíª CLI** - Command-line interface
3. **üñ•Ô∏è UI (Gradio)** - Web interface
4. **üì¶ Batch Processing** - Multiple documents
5. **‚öôÔ∏è Configuration** - All the ways to configure

## Quick Links

- [Method 1: Python API (Programmatic)](#method-1-python-api)
- [Method 2: CLI (Command Line)](#method-2-cli)
- [Method 3: UI (Gradio Web Interface)](#method-3-ui)
- [Common Scenarios](#common-scenarios)
- [Configuration Guide](#configuration-guide)

---

# Setup: Load Environment

All methods need Azure credentials. Set them up once:

In [None]:
import os
from pathlib import Path
from dotenv import load_dotenv

# Load .env file
env_path = Path("../../.env")
if env_path.exists():
    load_dotenv(dotenv_path=env_path)
    print(f"‚úÖ Loaded environment from: {env_path.absolute()}")
else:
    load_dotenv()
    print("‚úÖ Loaded environment from default location")

# Verify
print(f"\nAzure Search Service: {os.getenv('AZURE_SEARCH_SERVICE', 'NOT SET')}")
print(f"Azure Search Index: {os.getenv('AZURE_SEARCH_INDEX', 'NOT SET')}")

---

# Method 1: Python API (Programmatic)

Use the ingestor as a Python library in your code.

## 1.1: Simple One-Liner (Fastest)

Process a document with one function call:

In [None]:
from ingestor import run_pipeline

# Process a single document
status = await run_pipeline(
    input_glob="../../samples/sample_pages_test.pdf"
)

print(f"‚úÖ Processed {status.successful_documents} document(s)")
print(f"üìÑ Indexed {status.total_chunks_indexed} chunks")

## 1.2: With Performance Optimizations

Enable parallel processing for faster execution:

In [None]:
from ingestor import run_pipeline

# Optimized processing
status = await run_pipeline(
    input_glob="../../samples/*.pdf",
    
    # Performance settings
    performance_max_workers=4,              # Process 4 docs in parallel
    azure_openai_max_concurrency=10,        # Parallel embedding batches
    azure_di_max_concurrency=5,             # Parallel DI requests
    use_integrated_vectorization=True       # Server-side embeddings (fastest!)
)

print(f"üöÄ Optimized processing complete!")
print(f"Documents: {status.successful_documents}")
print(f"Chunks: {status.total_chunks_indexed}")

## 1.3: Using create_config() Helper

Create a configuration object for more control:

In [None]:
from ingestor import create_config, Pipeline

# Create config
config = create_config(
    input_glob="../../samples/*.pdf",
    azure_search_index="my-custom-index",
    
    # Chunking settings
    chunking_max_tokens=1000,
    chunking_overlap_percent=15,
    
    # Performance
    performance_max_workers=4,
    use_integrated_vectorization=True
)

# Run pipeline
pipeline = Pipeline(config)
try:
    status = await pipeline.run()
    print(f"‚úÖ Custom config processing complete!")
    print(f"Chunks: {status.total_chunks_indexed}")
finally:
    await pipeline.close()

## 1.4: Using ConfigBuilder (Fluent API)

Build configuration programmatically with chainable methods:

In [None]:
from ingestor import ConfigBuilder, Pipeline

# Build config with fluent API
config = (
    ConfigBuilder()
    .with_search(
        service_name="your-service",
        index_name="documents-index"
    )
    .with_document_intelligence(
        endpoint=os.getenv("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT"),
        max_concurrency=5
    )
    .with_azure_openai(
        endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        embedding_deployment="text-embedding-ada-002",
        max_concurrency=10
    )
    .with_input(
        mode="local",
        local_glob="../../samples/*.pdf"
    )
    .with_performance(
        max_workers=4,
        use_integrated_vectorization=True
    )
    .build()
)

# Run pipeline
pipeline = Pipeline(config)
try:
    status = await pipeline.run()
    print(f"‚úÖ ConfigBuilder processing complete!")
    print(f"Documents: {status.successful_documents}")
finally:
    await pipeline.close()

## 1.5: From Environment Variables

Load all configuration from `.env` file:

In [None]:
from ingestor import PipelineConfig, Pipeline
from dotenv import load_dotenv

# Load environment
load_dotenv()

# Create config from environment
config = PipelineConfig.from_env()

print(f"Configuration loaded from environment:")
print(f"  Index: {config.search.index_name}")
print(f"  Max workers: {config.performance.max_workers}")
print(f"  Integrated vectorization: {config.use_integrated_vectorization}")

# Run pipeline
pipeline = Pipeline(config)
try:
    status = await pipeline.run()
    print(f"\n‚úÖ Environment-based processing complete!")
    print(f"Documents: {status.successful_documents}")
finally:
    await pipeline.close()

---

# Method 2: CLI (Command Line Interface)

Run the ingestor from the command line without writing Python code.

## 2.1: Basic CLI Usage

Process a document from the command line:

In [None]:
# Run from terminal:
# python -m ingestor.cli --input "samples/*.pdf"

# Or use the convenience wrapper:
!python ../../examples/scripts/run_cli.py --input "../../samples/sample_pages_test.pdf"

## 2.2: CLI with All Options

Full CLI command with all available options:

In [None]:
# Full CLI command example (run in terminal)
"""
python -m ingestor.cli \
  --input "documents/**/*.pdf" \
  --env-path ".env" \
  --search-index "documents-index" \
  --max-workers 4 \
  --openai-concurrency 10 \
  --di-concurrency 5 \
  --integrated-vectorization \
  --chunking-max-tokens 1000 \
  --action add
"""

print("Copy the above command to your terminal")

## 2.3: CLI Common Commands

Common CLI patterns:

In [None]:
# === Process single document ===
# python -m ingestor.cli --input "document.pdf"

# === Process all PDFs in directory ===
# python -m ingestor.cli --input "documents/*.pdf"

# === Process recursively ===
# python -m ingestor.cli --input "documents/**/*.pdf"

# === With custom index ===
# python -m ingestor.cli --input "docs/*.pdf" --search-index "my-index"

# === With performance optimizations ===
# python -m ingestor.cli --input "docs/*.pdf" --max-workers 4 --integrated-vectorization

# === Remove documents ===
# python -m ingestor.cli --input "document.pdf" --action remove

# === Remove all documents ===
# python -m ingestor.cli --action remove-all

print("CLI command examples above")

## 2.4: CLI Environment Files

Use different environment files with CLI:

In [None]:
# === Use default .env ===
# python -m ingestor.cli --input "docs/*.pdf"

# === Use custom env file ===
# python -m ingestor.cli --input "docs/*.pdf" --env-path ".env.production"

# === Use development settings ===
# python -m ingestor.cli --input "docs/*.pdf" --env-path ".env.development"

# === Override specific settings ===
# python -m ingestor.cli --input "docs/*.pdf" --env-path ".env" --max-workers 8

print("Environment file examples above")

---

# Method 3: UI (Gradio Web Interface)

Use the web-based UI for interactive document processing.

## 3.1: Launch Gradio UI

Start the web interface:

In [None]:
# Launch UI from terminal:
# python -m ingestor.gradio_app

# Or with custom port:
# python -m ingestor.gradio_app --port 7860

# Or launch from Python:
# from ingestor.gradio_app import create_interface
# demo = create_interface()
# demo.launch()

print("""\nTo launch the UI:

1. Open a terminal
2. Run: python -m ingestor.gradio_app
3. Open browser to: http://localhost:7860

The UI will let you:
- Upload documents via drag-and-drop
- Configure all settings visually
- Monitor processing progress
- View results in real-time
""")

## 3.2: UI Features

What you can do in the Gradio UI:

In [None]:
# Gradio UI Features:

ui_features = """
üì§ Upload Documents:
   - Drag and drop files
   - Supports: PDF, DOCX, PPTX, TXT, MD, CSV, JSON
   - Multiple file upload

‚öôÔ∏è Configure Settings:
   - Search service and index
   - Document action (add/remove/remove-all)
   - Chunking parameters
   - Performance settings
   - Concurrency limits

üîÑ Process Documents:
   - Real-time progress updates
   - Per-document status
   - Error messages

üìä View Results:
   - Success/failure counts
   - Chunks indexed
   - Processing times
   - Detailed logs

üíæ Load/Save Configs:
   - Load from .env file
   - Save current settings
   - Quick presets
"""

print(ui_features)

## 3.3: UI with Custom Configuration

Launch UI with pre-configured settings:

In [None]:
# Launch UI with custom config (in Python script):

from ingestor.gradio_app import create_interface
from ingestor import create_config

# Create default config
config = create_config(
    azure_search_index="my-custom-index",
    performance_max_workers=4,
    use_integrated_vectorization=True
)

# Launch UI with this config as default
# demo = create_interface(default_config=config)
# demo.launch()

print("UI will launch with custom default settings")

---

# Common Scenarios

Real-world examples for common use cases.

## Scenario 1: One-Time Document Ingestion

Process a single document quickly:

In [None]:
from ingestor import run_pipeline

# Simplest way - just specify the file
status = await run_pipeline(input_glob="../../samples/sample_pages_test.pdf")

print(f"‚úÖ Done! Indexed {status.total_chunks_indexed} chunks")

## Scenario 2: Batch Processing Multiple Documents

Process an entire directory with optimization:

In [None]:
from ingestor import run_pipeline
import time

start = time.time()

# Process all documents in parallel
status = await run_pipeline(
    input_glob="../../documents/**/*.pdf",
    performance_max_workers=4,              # 4 docs in parallel
    azure_openai_max_concurrency=10,        # Fast embeddings
    use_integrated_vectorization=True       # Server-side (fastest)
)

elapsed = time.time() - start

print(f"\nüìä Batch Results:")
print(f"Documents: {status.successful_documents} succeeded, {status.failed_documents} failed")
print(f"Chunks: {status.total_chunks_indexed}")
print(f"Time: {elapsed:.2f}s")
print(f"Throughput: {status.total_chunks_indexed / elapsed:.2f} chunks/sec")

# Show per-document times
print(f"\nPer-document breakdown:")
for result in status.results:
    icon = "‚úÖ" if result.success else "‚ùå"
    print(f"  {icon} {result.filename}: {result.processing_time_seconds:.2f}s")

## Scenario 3: Update Existing Documents

Re-process documents that are already indexed:

In [None]:
from ingestor import run_pipeline

# The pipeline automatically deletes old chunks before adding new ones
# This ensures a clean update (no duplicates)
status = await run_pipeline(
    input_glob="../../samples/updated_document.pdf",
    document_action="add"  # Default: deletes old + adds new
)

print(f"‚úÖ Document updated successfully")
print(f"   New chunks indexed: {status.total_chunks_indexed}")

## Scenario 4: Remove Documents from Index

Delete specific documents:

In [None]:
from ingestor import run_pipeline

# Remove specific document(s)
status = await run_pipeline(
    input_glob="../../samples/old_document.pdf",
    document_action="remove"  # Only remove, don't add
)

print(f"‚úÖ Documents removed from index")

# Remove ALL documents (be careful!)
# status = await run_pipeline(document_action="remove_all")

## Scenario 5: Different Environments (Dev/Staging/Prod)

Use different configurations for different environments:

In [None]:
from dotenv import load_dotenv
from ingestor import PipelineConfig, Pipeline

# Select environment
environment = "development"  # or "staging", "production"

# Load appropriate .env file
env_file = f"../../.env.{environment}"
load_dotenv(dotenv_path=env_file, override=True)

# Create config from environment
config = PipelineConfig.from_env()

print(f"Configuration for: {environment}")
print(f"  Index: {config.search.index_name}")
print(f"  Max workers: {config.performance.max_workers}")

# Process with environment-specific config
pipeline = Pipeline(config)
try:
    status = await pipeline.run()
    print(f"\n‚úÖ Processed in {environment} environment")
finally:
    await pipeline.close()

## Scenario 6: Custom Chunking Strategy

Fine-tune how documents are split into chunks:

In [None]:
from ingestor import run_pipeline

# Custom chunking for long-form content
status = await run_pipeline(
    input_glob="../../samples/long_document.pdf",
    
    # Chunking configuration
    chunking_max_tokens=1500,       # Larger chunks
    chunking_overlap_percent=20,     # More overlap for context
    chunking_cross_page_overlap=True # Preserve context across pages
)

print(f"‚úÖ Processed with custom chunking")
print(f"   Chunks: {status.total_chunks_indexed}")

## Scenario 7: Process Office Documents

Handle DOCX and PPTX files:

In [None]:
from ingestor import run_pipeline

# Process Office documents (DOCX, PPTX)
status = await run_pipeline(
    input_glob="../../documents/**/*.{docx,pptx}",
    
    # Office-specific settings
    office_extractor_mode="hybrid",          # Try Azure DI first, fallback to offline
    office_extractor_offline_fallback=True   # Enable fallback
)

print(f"‚úÖ Processed Office documents")
print(f"   Documents: {status.successful_documents}")
print(f"   Chunks: {status.total_chunks_indexed}")

## Scenario 8: Monitoring and Error Handling

Track progress and handle failures:

In [None]:
from ingestor import run_pipeline
import time

print("üîÑ Starting batch processing...\n")
start = time.time()

try:
    status = await run_pipeline(
        input_glob="../../documents/**/*.pdf",
        performance_max_workers=4
    )
    
    elapsed = time.time() - start
    
    # Success summary
    print(f"\n‚úÖ Processing complete in {elapsed:.2f}s")
    print(f"   Success: {status.successful_documents}")
    print(f"   Failed: {status.failed_documents}")
    print(f"   Chunks: {status.total_chunks_indexed}")
    
    # Handle failures
    if status.failed_documents > 0:
        print(f"\n‚ö†Ô∏è Failed documents:")
        for result in status.results:
            if not result.success:
                print(f"   ‚ùå {result.filename}")
                print(f"      Error: {result.error_message}")
        
        # Retry failed documents with conservative settings
        print(f"\nüîÑ Retrying failed documents...")
        failed_files = [r.filename for r in status.results if not r.success]
        
        for file in failed_files:
            try:
                retry_status = await run_pipeline(
                    input_glob=file,
                    performance_max_workers=1,
                    azure_openai_max_concurrency=3
                )
                if retry_status.successful_documents > 0:
                    print(f"   ‚úÖ Retry succeeded: {file}")
            except Exception as e:
                print(f"   ‚ùå Retry failed: {file}: {e}")
    
except Exception as e:
    print(f"\n‚ùå Pipeline error: {e}")
    import traceback
    traceback.print_exc()

---

# Configuration Guide

All the ways to configure the ingestor.

# === Azure Services ===
AZURE_SEARCH_SERVICE=your-service
AZURE_SEARCH_INDEX=documents-index
AZURE_SEARCH_KEY=your-key

AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-di.cognitiveservices.azure.com/
AZURE_DOCUMENT_INTELLIGENCE_KEY=your-key

AZURE_OPENAI_ENDPOINT=https://your-openai.openai.azure.com
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-ada-002
AZURE_OPENAI_KEY=your-key

# === Performance Optimizations ===
MAX_WORKERS=4
AZURE_OPENAI_MAX_CONCURRENCY=10
AZURE_DI_MAX_CONCURRENCY=5
AZURE_USE_INTEGRATED_VECTORIZATION=true

# === Input/Output ===
INPUT_MODE=local
LOCAL_GLOB=documents/**/*.pdf
ARTIFACTS_MODE=blob

# === Chunking ===
CHUNKING_MAX_TOKENS=1000
CHUNKING_OVERLAP_PERCENT=15


In [None]:
# Example .env file

env_template = """
# === Azure Services ===
AZURE_SEARCH_SERVICE=your-service
AZURE_SEARCH_INDEX=documents-index
AZURE_SEARCH_KEY=your-key

AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-di.cognitiveservices.azure.com/
AZURE_DOCUMENT_INTELLIGENCE_KEY=your-key

AZURE_OPENAI_ENDPOINT=https://your-openai.openai.azure.com
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-ada-002
AZURE_OPENAI_KEY=your-key

# === Performance Optimizations ===
MAX_WORKERS=4
AZURE_OPENAI_MAX_CONCURRENCY=10
AZURE_DI_MAX_CONCURRENCY=5
AZURE_USE_INTEGRATED_VECTORIZATION=true

# === Input/Output ===
INPUT_MODE=local
LOCAL_GLOB=documents/**/*.pdf
ARTIFACTS_MODE=blob

# === Chunking ===
CHUNKING_MAX_TOKENS=1000
CHUNKING_OVERLAP_PERCENT=15
"""

print("Save this as .env file:")
print(env_template)

## Configuration Method 2: Programmatic (Python)

Set configuration in code:

In [None]:
from ingestor import run_pipeline

# All settings as function parameters
status = await run_pipeline(
    # Input
    input_glob="../../documents/*.pdf",
    
    # Azure services
    azure_search_service="your-service",
    azure_search_index="documents-index",
    azure_search_key="your-key",
    
    # Performance
    performance_max_workers=4,
    azure_openai_max_concurrency=10,
    azure_di_max_concurrency=5,
    use_integrated_vectorization=True,
    
    # Chunking
    chunking_max_tokens=1000,
    chunking_overlap_percent=15,
    
    # Action
    document_action="add"
)

print(f"‚úÖ Configured programmatically")

## Configuration Method 3: Hybrid (Env + Overrides)

Load from `.env` and override specific values:

In [None]:
from ingestor import create_config

# Load from .env and override specific settings
config = create_config(
    env_path="../../.env",               # Load from .env
    use_env=True,                         # Enable env loading
    
    # Override specific settings
    azure_search_index="custom-index",   # Override index
    performance_max_workers=8,            # Override max_workers
    chunking_max_tokens=1500              # Override chunking
)

print(f"‚úÖ Hybrid configuration created")
print(f"   Index: {config.search.index_name}")
print(f"   Max workers: {config.performance.max_workers}")

---

# Quick Reference

Cheat sheet for common operations.

## Python API Quick Reference

In [None]:
python_reference = """
# === Simple Usage ===
from ingestor import run_pipeline
status = await run_pipeline(input_glob="docs/*.pdf")

# === With Optimizations ===
status = await run_pipeline(
    input_glob="docs/*.pdf",
    performance_max_workers=4,
    azure_openai_max_concurrency=10,
    use_integrated_vectorization=True
)

# === From Environment ===
from ingestor import PipelineConfig, Pipeline
config = PipelineConfig.from_env()
pipeline = Pipeline(config)
status = await pipeline.run()
await pipeline.close()

# === ConfigBuilder ===
from ingestor import ConfigBuilder
config = ConfigBuilder().with_search(...).with_input(...).build()
"""

print(python_reference)

## CLI Quick Reference

In [None]:
cli_reference = """
# === Basic ===
python -m ingestor.cli --input "docs/*.pdf"

# === With Options ===
python -m ingestor.cli \
  --input "docs/**/*.pdf" \
  --max-workers 4 \
  --integrated-vectorization

# === Custom Env ===
python -m ingestor.cli \
  --input "docs/*.pdf" \
  --env-path ".env.production"

# === Remove ===
python -m ingestor.cli --input "doc.pdf" --action remove
python -m ingestor.cli --action remove-all
"""

print(cli_reference)

## UI Quick Reference

In [None]:
ui_reference = """
# === Launch UI ===
python -m ingestor.gradio_app

# === Custom Port ===
python -m ingestor.gradio_app --port 7860

# === From Python ===
from ingestor.gradio_app import create_interface
demo = create_interface()
demo.launch()

# Then open: http://localhost:7860
"""

print(ui_reference)

---

# Summary

You now know **ALL** the ways to use the ingestor:

## Methods

‚úÖ **Python API** - Import and use programmatically  
‚úÖ **CLI** - Command-line interface for scripts  
‚úÖ **UI** - Gradio web interface for interactive use  

## Configuration

‚úÖ **.env files** - Recommended for production  
‚úÖ **Programmatic** - Set in Python code  
‚úÖ **ConfigBuilder** - Fluent API for building config  
‚úÖ **Hybrid** - Mix environment + overrides  

## Features

‚úÖ **Parallel processing** - 75-85% faster  
‚úÖ **Batch operations** - Multiple documents  
‚úÖ **Error handling** - Graceful failures + retry  
‚úÖ **Progress tracking** - Real-time monitoring  
‚úÖ **Multiple environments** - Dev/staging/prod  

## Next Steps

- **[01_quickstart.ipynb](01_quickstart.ipynb)** - Get started quickly
- **[08_batch_processing.ipynb](08_batch_processing.ipynb)** - Batch processing deep dive
- **[07_performance_tuning.ipynb](07_performance_tuning.ipynb)** - Optimization guide
- **[QUICK_START_OPTIMIZATIONS.md](../../QUICK_START_OPTIMIZATIONS.md)** - Performance tips