# Real-World Example: Legal Document Processing

This notebook demonstrates processing legal documents (contracts, case law, regulations) with ingestor.

## Use Case

Process a collection of legal documents including:
- Contracts
- Case law / Court decisions
- Regulations and statutes
- Legal briefs

## Requirements

Legal documents need:
- Precise chunking (don't split clauses/sections)
- Preservation of numbered sections
- Table of contents awareness
- Citation preservation
- Metadata extraction (document type, date, parties)

In [None]:
from ingestor import Pipeline, PipelineConfig, create_config
from ingestor.config import ChunkingMode, TableRenderMode, OverlapMode
from dotenv import load_dotenv
import re
from datetime import datetime

load_dotenv()
print("‚úÖ Setup complete")

## 1. Legal Document Configuration

Optimized settings for legal documents:

In [None]:
config = PipelineConfig.from_env()

# Input: Legal documents
config.input.local.glob = "legal_documents/**/*.pdf"

# Chunking: Layout-aware to preserve structure
config.chunking.mode = ChunkingMode.LAYOUT
config.chunking.target_chunk_size = 1500  # Larger chunks for legal context
config.chunking.chunk_overlap = 250       # More overlap for continuity
config.chunking.overlap_mode = OverlapMode.WORDS
config.chunking.preserve_section_boundaries = True

# Tables: Preserve structure for exhibits/schedules
config.chunking.table_render_mode = TableRenderMode.MARKDOWN_DETAILED
config.chunking.keep_tables_intact = True

# Document Intelligence: High-res OCR for scanned documents
config.document_intelligence.features = [
    "OCR_HIGH_RESOLUTION",
    "LANGUAGES"
]

# Search index
config.search.index_name = "legal-documents-index"

print("‚úÖ Legal document configuration ready")
print(f"  Chunk size: {config.chunking.target_chunk_size} chars")
print(f"  Overlap: {config.chunking.chunk_overlap} words")
print(f"  Section preservation: {config.chunking.preserve_section_boundaries}")

## 2. Custom Metadata Extraction for Legal Documents

In [None]:
def extract_legal_metadata(filename: str, content: str) -> dict:
    """
    Extract legal document metadata.
    
    Args:
        filename: Document filename
        content: First page content
    
    Returns:
        Dictionary of metadata fields
    """
    metadata = {}
    
    # Document type from filename
    filename_lower = filename.lower()
    if "contract" in filename_lower or "agreement" in filename_lower:
        metadata["literatureType"] = ["Contract"]
        metadata["category"] = "Contract"
    elif "brief" in filename_lower or "motion" in filename_lower:
        metadata["literatureType"] = ["Legal Brief"]
        metadata["category"] = "Brief"
    elif "regulation" in filename_lower or "statute" in filename_lower:
        metadata["literatureType"] = ["Regulation"]
        metadata["category"] = "Regulation"
    elif "case" in filename_lower or "decision" in filename_lower:
        metadata["literatureType"] = ["Case Law"]
        metadata["category"] = "Case Law"
    
    # Extract date patterns (MM/DD/YYYY or YYYY-MM-DD)
    date_patterns = [
        r'\b(\d{1,2})/(\d{1,2})/(\d{4})\b',  # MM/DD/YYYY
        r'\b(\d{4})-(\d{2})-(\d{2})\b',      # YYYY-MM-DD
        r'\b(January|February|March|April|May|June|July|August|September|October|November|December)\s+(\d{1,2}),?\s+(\d{4})\b'
    ]
    
    for pattern in date_patterns:
        match = re.search(pattern, content[:2000])  # Search first 2000 chars
        if match:
            # Store as ISO format
            try:
                if len(match.groups()) == 3:
                    date_str = match.group(0)
                    # Simple date parsing (expand for production)
                    metadata["publishedDate"] = date_str
                    break
            except:
                pass
    
    # Extract parties (look for "between X and Y" patterns)
    party_pattern = r'between\s+([A-Z][\w\s,]+?)\s+and\s+([A-Z][\w\s,]+?)\s+(\(|,|\.|$)'
    party_match = re.search(party_pattern, content[:5000], re.IGNORECASE)
    if party_match:
        metadata["parties"] = [party_match.group(1).strip(), party_match.group(2).strip()]
    
    # Extract jurisdiction/court
    court_keywords = [
        "Supreme Court", "District Court", "Court of Appeals",
        "Circuit Court", "Federal Court"
    ]
    for keyword in court_keywords:
        if keyword in content[:3000]:
            metadata["court"] = keyword
            break
    
    return metadata

# Test
sample_content = """
CONTRACT AGREEMENT
Dated: January 15, 2024

This Agreement is entered into between ABC Corporation and XYZ Holdings
on the terms and conditions set forth below...
"""

metadata = extract_legal_metadata("service_agreement_2024.pdf", sample_content)
print("\nExtracted metadata:")
for key, value in metadata.items():
    print(f"  {key}: {value}")

## 3. Process Legal Documents

In [None]:
# Process legal documents
pipeline = Pipeline(config)

try:
    print("üî® Processing legal documents...\n")
    status = await pipeline.run()
    
    print("\n" + "="*60)
    print("üìä Processing Results:")
    print("="*60)
    print(f"Successful: {status.successful_documents}")
    print(f"Failed: {status.failed_documents}")
    print(f"Total chunks: {status.total_chunks_indexed}")
    print(f"Processing time: {status.total_processing_time_seconds:.2f}s")
    
    # Per-document details
    if status.results:
        print("\nüìã Document Details:")
        for result in status.results:
            if result.success:
                print(f"  ‚úÖ {result.filename}")
                print(f"     Chunks: {result.chunks_indexed}")
                print(f"     Time: {result.processing_time_seconds:.2f}s")
            else:
                print(f"  ‚ùå {result.filename}")
                print(f"     Error: {result.error_message}")
    
finally:
    await pipeline.close()

## 4. Query Legal Documents

Search processed legal documents:

In [None]:
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
import os

# Create search client
search_client = SearchClient(
    endpoint=f"https://{os.getenv('AZURE_SEARCH_SERVICE')}.search.windows.net",
    index_name="legal-documents-index",
    credential=AzureKeyCredential(os.getenv('AZURE_SEARCH_KEY'))
)

# Example query: Confidentiality clauses
query = "confidentiality non-disclosure obligations"

results = search_client.search(
    search_text=query,
    top=5,
    scoring_profile="contentRAGProfile",
    select=["filename", "title", "content", "category", "pageNumber"]
)

print(f"\nüîç Search Results for: '{query}'\n")
print("="*60)

for i, result in enumerate(results, 1):
    print(f"\n{i}. {result['filename']} (Page {result.get('pageNumber', 'N/A')})")
    print(f"   Category: {result.get('category', 'N/A')}")
    if result.get('title'):
        print(f"   Section: {result['title']}")
    print(f"   Content: {result['content'][:300]}...")
    print("-" * 60)

## 5. Advanced Legal Queries

### Filter by Document Type

In [None]:
# Search only contracts
results = search_client.search(
    search_text="termination clause",
    filter="category eq 'Contract'",
    top=5
)

print("üîç Contracts with termination clauses:\n")
for result in results:
    print(f"  - {result['filename']}")

### Filter by Date Range

In [None]:
# Search documents from 2024
results = search_client.search(
    search_text="*",
    filter="publishedDate ge 2024-01-01T00:00:00Z and publishedDate lt 2025-01-01T00:00:00Z",
    top=10
)

print("üìÖ Documents from 2024:\n")
for result in results:
    print(f"  - {result['filename']} ({result.get('publishedDate', 'N/A')})")

### Semantic Search for Legal Concepts

In [None]:
# Use semantic search for concept-based queries
results = search_client.search(
    search_text="liability limitations and indemnification",
    query_type="semantic",
    semantic_configuration_name="my-semantic-config",
    top=3
)

print("üß† Semantic search results:\n")
for i, result in enumerate(results, 1):
    print(f"{i}. {result['filename']}")
    print(f"   {result['content'][:200]}...\n")

## 6. Legal RAG Pattern

Use search results to answer legal questions with LLM:

In [None]:
from openai import AzureOpenAI

# Create OpenAI client
openai_client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_KEY"),
    api_version="2024-02-01",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)

def legal_rag_query(question: str, top_k: int = 5):
    """
    Answer legal question using RAG pattern.
    
    Args:
        question: Legal question
        top_k: Number of chunks to retrieve
    
    Returns:
        Answer with citations
    """
    # 1. Search for relevant chunks
    results = search_client.search(
        search_text=question,
        query_type="semantic",
        semantic_configuration_name="my-semantic-config",
        top=top_k,
        select=["content", "filename", "pageNumber"]
    )
    
    # 2. Build context from search results
    context_parts = []
    sources = []
    
    for i, result in enumerate(results, 1):
        source = f"{result['filename']} (Page {result.get('pageNumber', 'N/A')})"
        sources.append(source)
        context_parts.append(f"[Source {i}] {source}:\n{result['content']}")
    
    context = "\n\n".join(context_parts)
    
    # 3. Generate answer with LLM
    messages = [
        {
            "role": "system",
            "content": """You are a legal research assistant. Answer questions based ONLY on the provided context.
            Cite sources using [Source N] notation. If the context doesn't contain enough information, say so."""
        },
        {
            "role": "user",
            "content": f"""Question: {question}
            
Context:
{context}

Please answer the question based on the context above."""
        }
    ]
    
    response = openai_client.chat.completions.create(
        model=os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT", "gpt-4"),
        messages=messages,
        temperature=0.1  # Low temperature for factual answers
    )
    
    answer = response.choices[0].message.content
    
    return {
        "question": question,
        "answer": answer,
        "sources": sources
    }

# Example query
result = legal_rag_query(
    "What are the termination conditions in the service agreements?"
)

print(f"‚ùì Question: {result['question']}\n")
print(f"üí° Answer:\n{result['answer']}\n")
print(f"üìö Sources:")
for i, source in enumerate(result['sources'], 1):
    print(f"  [{i}] {source}")

## Summary

Legal document processing with ingestor:

‚úÖ Layout-aware chunking preserves document structure  
‚úÖ Larger chunks maintain legal context  
‚úÖ Custom metadata extraction for document type, dates, parties  
‚úÖ Advanced filtering by category, date, jurisdiction  
‚úÖ Semantic search for legal concepts  
‚úÖ RAG pattern for answering legal questions  

## Best Practices for Legal Documents

1. **Chunk size**: Use 1500-2000 chars to preserve clause context
2. **Overlap**: 15-20% overlap to avoid splitting clauses
3. **Tables**: Keep intact for schedules and exhibits
4. **Metadata**: Extract document type, date, parties, jurisdiction
5. **Search**: Use semantic ranking for conceptual queries
6. **RAG**: Always cite sources in answers

## Next Steps

- **05_real_world_medical.ipynb**: Medical device manual processing
- **06_troubleshooting.ipynb**: Debug common issues
- **07_performance_tuning.ipynb**: Optimize for large document sets