# Ingestor Quick Start Guide

This notebook demonstrates the basics of using the ingestor library to process documents and index them into Azure AI Search.

## What You'll Learn

- Installing and importing ingestor
- Processing your first document
- Using convenience functions
- Viewing results

## Prerequisites

- Azure AI Search service
- Azure Document Intelligence service
- Azure OpenAI service (for embeddings)

## Step 1: Installation

If you haven't installed ingestor yet, run this cell:

In [None]:
# Install ingestor (uncomment if needed)
# !pip install -e ../..

## Step 2: Set Up Environment Variables

Create a `.env` file in the project root with your Azure credentials, or set them here:

In [None]:
import os
from pathlib import Path

# Option 1: Load from .env file (RECOMMENDED)
from dotenv import load_dotenv

# Load .env from project root or specify custom path
env_path = Path("../../.env")  # Adjust path as needed
if env_path.exists():
    load_dotenv(dotenv_path=env_path)
    print(f"‚úÖ Loaded environment from: {env_path.absolute()}")
else:
    load_dotenv()  # Try default locations
    print("‚úÖ Loaded environment from default location")

# Option 2: Load environment-specific .env files
# load_dotenv(dotenv_path="../../.env.production")
# load_dotenv(dotenv_path="../../.env.development")

# Verify critical variables are set
required_vars = [
    "AZURE_SEARCH_SERVICE",
    "AZURE_SEARCH_INDEX",
    "AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT",
    "AZURE_OPENAI_ENDPOINT"
]

missing_vars = [var for var in required_vars if not os.getenv(var)]
if missing_vars:
    print(f"\n‚ö†Ô∏è  Warning: Missing environment variables: {missing_vars}")
    print("   Set them in your .env file or manually below")
else:
    print(f"‚úÖ All required environment variables are set")

# Option 3: Set manually (for testing only - never commit credentials!)
# os.environ["AZURE_SEARCH_SERVICE"] = "your-service"
# os.environ["AZURE_SEARCH_KEY"] = "your-key"
# os.environ["AZURE_SEARCH_INDEX"] = "documents-index"
# os.environ["AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT"] = "https://your-di.cognitiveservices.azure.com/"
# os.environ["AZURE_DOCUMENT_INTELLIGENCE_KEY"] = "your-key"
# os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-openai.openai.azure.com"
# os.environ["AZURE_OPENAI_EMBEDDING_DEPLOYMENT"] = "text-embedding-ada-002"
# os.environ["AZURE_OPENAI_KEY"] = "your-key"

## Step 3: Import Ingestor

Import the main functions you'll need:

In [None]:
from ingestor import run_pipeline, create_config
from ingestor.config import InputMode

print("‚úÖ Ingestor imported successfully")

## Step 4: Process Your First Document

The simplest way to process documents is using the `run_pipeline` convenience function:

In [None]:
# Process a single PDF file
status = await run_pipeline(
    input_glob="../../sample_documents/*.pdf"
)

print(f"\n‚úÖ Processing complete!")
print(f"Documents processed: {status.successful_documents}")
print(f"Chunks indexed: {status.total_chunks_indexed}")

## Step 5: View Detailed Results

Inspect per-document results:

In [None]:
import pandas as pd

# Convert results to DataFrame for easy viewing
results_data = []
for result in status.results:
    results_data.append({
        "Filename": result.filename,
        "Success": "‚úÖ" if result.success else "‚ùå",
        "Chunks": result.chunks_indexed,
        "Duration (s)": f"{result.processing_time_seconds:.2f}",
        "Error": result.error_message or "-"
    })

df = pd.DataFrame(results_data)
df

## Step 6: Process Multiple Documents

Process an entire directory of documents:

In [None]:
# Process all PDFs in a directory WITH PARALLEL PROCESSING (NEW!)
# This processes multiple documents concurrently for maximum throughput

status = await run_pipeline(
    input_glob="../../documents/**/*.pdf",  # Recursive glob
    # Performance optimizations (NEW in v4.0)
    performance_max_workers=4,              # Process 4 docs in parallel
    azure_openai_max_concurrency=10,        # Parallel embedding batches
    azure_di_max_concurrency=5,             # Parallel DI requests
    use_integrated_vectorization=True       # Server-side embeddings (fastest!)
)

print(f"\nüìä Batch Processing Results:")
print(f"Successful: {status.successful_documents}")
print(f"Failed: {status.failed_documents}")
print(f"Total chunks: {status.total_chunks_indexed}")

# Show per-document processing times
print(f"\nPer-document results:")
for result in status.results:
    status_icon = "‚úÖ" if result.success else "‚ùå"
    print(f"  {status_icon} {result.filename}: {result.processing_time_seconds:.2f}s ({result.chunks_indexed} chunks)")

## Step 7: Using Custom Configuration

For more control, create a custom configuration:

In [None]:
# Create custom config
config = create_config(
    input_glob="../../sample_documents/*.pdf",
    azure_search_index="my-custom-index"
)

# Customize chunking settings
config.chunking.target_chunk_size = 1000
config.chunking.chunk_overlap = 200

# Run with custom config
status = await run_pipeline(config=config)

print(f"‚úÖ Processed with custom settings")
print(f"Chunks indexed: {status.total_chunks_indexed}")

## Step 8: Verify in Azure Search

Query the index to verify documents were indexed:

In [None]:
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential

# Create search client
search_service = os.getenv("AZURE_SEARCH_SERVICE")
search_key = os.getenv("AZURE_SEARCH_KEY")
index_name = os.getenv("AZURE_SEARCH_INDEX", "documents-index")

search_client = SearchClient(
    endpoint=f"https://{search_service}.search.windows.net",
    index_name=index_name,
    credential=AzureKeyCredential(search_key)
)

# Get document count
results = search_client.search(search_text="*", top=0, include_total_count=True)
print(f"\nüìä Index Statistics:")
print(f"Total documents in index: {results.get_count()}")

## Step 9: Perform a Test Search

Search the indexed documents:

In [None]:
# Search for content
results = search_client.search(
    search_text="your search query here",
    top=5,
    select=["id", "filename", "title", "content", "pageNumber"]
)

print("\nüîç Search Results:\n")
for i, result in enumerate(results, 1):
    print(f"{i}. {result['filename']} (Page {result.get('pageNumber', 'N/A')})")
    print(f"   Title: {result.get('title', 'N/A')}")
    print(f"   Content preview: {result['content'][:200]}...")
    print()

## Summary

You've successfully:

‚úÖ Installed and imported ingestor  
‚úÖ Processed documents using the convenience function  
‚úÖ Viewed processing results  
‚úÖ Used custom configuration  
‚úÖ Verified documents in Azure Search  
‚úÖ Performed test searches  

## Next Steps

- **02_configuration.ipynb**: Learn about all configuration options
- **03_advanced_features.ipynb**: Explore advanced features like chunking strategies
- **06_troubleshooting.ipynb**: Debug common issues
- **07_performance_tuning.ipynb**: Optimize for large-scale processing