# Document Processing with Azure OpenAI Multimodal Model

This notebook demonstrates how to process insurance documents using Azure OpenAI's multimodal GPT model, including both text and image analysis. The workflow includes:

1. **Upload Data to Azure Blob Storage**: Upload insurance policy documents (.md files), claim statements (.md files), and crash images (.jpg/.png files) to Azure Blob Storage containers for organized processing.

2. **Process Documents with Azure OpenAI GPT-4-1-mini**
   - **Text Processing**: Prepare markdown files containing insurance policies and claim statements for vectorization
   - **Image Analysis**: Use GPT's vision capabilities to generate detailed descriptions of crash scene images, analyzing vehicle damage, environmental conditions, and relevant details for insurance claim processing

3. **Extract Structured Information**: Utilize Azure OpenAI's structured output capabilities to extract key information from claim statements into structured JSON format, including policyholder details, incident information, vehicle data, and witness information.

4. **Store in Azure Cosmos DB**: Save the processed and structured claim information to Azure Cosmos DB for easy retrieval and analysis by insurance agents and automated systems.

5. **Retrieve Information in JSON Format**: Generate comprehensive JSON outputs containing both structured claim data and detailed image descriptions, ready for downstream processing like vectorization and RAG (Retrieval-Augmented Generation) systems.

Automating document processing is crucial for improving efficiency and accuracy in handling large volumes of data. By leveraging Azure's cloud services, organizations can streamline their workflows, reduce manual errors, and gain valuable insights from their documents. This approach not only saves time and resources but also enhances data accessibility and decision-making capabilities.



## 1. Setup and Configuration

Let's start with handling the import of our libraries and load the `.env` variables that we have saved in the previous challenge.

In [1]:
import os
import json
import base64
from pathlib import Path
from typing import Dict, List, Optional
import pandas as pd
from tqdm import tqdm

# Azure SDK imports
from azure.storage.blob import BlobServiceClient
from azure.core.exceptions import ResourceExistsError
from azure.cosmos import CosmosClient, PartitionKey
from azure.ai.projects import AIProjectClient
from azure.ai.projects.aio import AIProjectClient as AsyncAIProjectClient
from azure.identity import DefaultAzureCredential
from azure.ai.agents.models import MessageRole, ListSortOrder, AzureAISearchTool, AzureAISearchQueryType
# OpenAI imports
from openai import AzureOpenAI

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

print("✅ All imports successful!")

✅ All imports successful!


This cell initializes the Azure clients used throughout the notebook: a `BlobServiceClient` (from `AZURE_STORAGE_CONNECTION_STRING`) for uploading, listing, and downloading files in Blob Storage, and an `AzureOpenAI` client (from `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY`) for text and multimodal GPT processing. The `initialize_clients()` function builds both clients, prints connection diagnostics on success, and returns `(blob_service_client, openai_client)`.

In [2]:
# Configuration
class Config:
    # Storage configuration
    AZURE_STORAGE_CONNECTION_STRING = os.getenv('AZURE_STORAGE_CONNECTION_STRING')
    AZURE_STORAGE_ACCOUNT_NAME = os.getenv('AZURE_STORAGE_ACCOUNT_NAME')
    AZURE_STORAGE_ACCOUNT_KEY = os.getenv('AZURE_STORAGE_ACCOUNT_KEY')
    
    # Azure OpenAI configuration
    AZURE_OPENAI_ENDPOINT = os.getenv('AZURE_OPENAI_ENDPOINT')
    AZURE_OPENAI_API_KEY = os.getenv('AZURE_OPENAI_KEY')
    AZURE_OPENAI_API_VERSION = os.getenv('AZURE_OPENAI_API_VERSION', '2024-02-15-preview')
    AZURE_OPENAI_DEPLOYMENT_NAME = os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME', 'gpt-4.1-mini')
        
    # Cosmos DB configuration
    COSMOS_ENDPOINT = os.getenv('COSMOS_ENDPOINT')
    COSMOS_KEY = os.getenv('COSMOS_KEY')
    COSMOS_DATABASE = 'insurance_claims'
    COSMOS_CONTAINER = 'crash_reports'
    
    # Container names
    POLICIES_CONTAINER = 'policies'
    CLAIMS_CONTAINER = 'claims'
    PROCESSED_CONTAINER = 'processed-documents'
    STATEMENTS_CONTAINER = 'statements'
    
    # Local data paths
    DATA_DIR = Path('data')
    POLICIES_DIR = DATA_DIR / 'policies'
    CLAIMS_DIR = DATA_DIR / 'claims'
    STATEMENTS_DIR = DATA_DIR / 'statements'

# Validate configuration
required_vars = [
    Config.AZURE_STORAGE_CONNECTION_STRING,
    Config.AZURE_OPENAI_ENDPOINT,
    Config.AZURE_OPENAI_API_KEY
]

missing_vars = [var for var in required_vars if not var]
if missing_vars:
    print("❌ Missing environment variables. Please check your .env file.")
    print("Missing variables - please add these to your .env file:")
    if not Config.AZURE_OPENAI_ENDPOINT:
        print("  - AZURE_OPENAI_ENDPOINT")
    if not Config.AZURE_OPENAI_API_KEY:
        print("  - AZURE_OPENAI_API_KEY")
    if not Config.AZURE_STORAGE_CONNECTION_STRING:
        print("  - AZURE_STORAGE_CONNECTION_STRING")
else:
    print("✅ Configuration loaded successfully!")
    print(f"📁 Policies directory: {Config.POLICIES_DIR}")
    print(f"📁 Statements directory: {Config.STATEMENTS_DIR}")
    print(f"📁 Claims directory: {Config.CLAIMS_DIR}")
    print(f"🤖 OpenAI Deployment: {Config.AZURE_OPENAI_DEPLOYMENT_NAME}")

✅ Configuration loaded successfully!
📁 Policies directory: data/policies
📁 Statements directory: data/statements
📁 Claims directory: data/claims
🤖 OpenAI Deployment: gpt-4.1-mini


## 2. Azure Services Setup

The next cell initializes Azure service clients by creating a BlobServiceClient for Azure Storage and an AzureOpenAI client for GPT-4.1-mini processing, with error handling to ensure both connections are established successfully.


In [3]:
# Initialize Azure clients
def initialize_clients():
    """Initialize Azure service clients"""
    try:
        # Blob Storage client
        blob_service_client = BlobServiceClient.from_connection_string(
            Config.AZURE_STORAGE_CONNECTION_STRING
        )
        
        # Azure OpenAI client
        openai_client = AzureOpenAI(
            azure_endpoint=Config.AZURE_OPENAI_ENDPOINT,
            api_key=Config.AZURE_OPENAI_API_KEY,
            api_version=Config.AZURE_OPENAI_API_VERSION
        )
        
        print("✅ Azure clients initialized successfully!")
        return blob_service_client, openai_client
        
    except Exception as e:
        print(f"❌ Error initializing clients: {e}")
        return None, None

blob_service_client, openai_client = initialize_clients()

✅ Azure clients initialized successfully!


The next cell creates and tests Azure Blob Storage containers with enhanced error handling, checking connections, listing existing containers, and attempting to create the required containers (policies, claims, statements, processed-documents) while providing detailed diagnostics for any failures.

In [4]:
# Enhanced container creation with multiple authentication methods and diagnostics
def create_containers_enhanced(blob_service_client):
    """Create blob storage containers with enhanced error handling and diagnostics"""
    
    # First, test the connection
    try:
        print("🔍 Testing storage account connection...")
        account_info = blob_service_client.get_account_information()
        print(f"✅ Connected to storage account successfully")
        print(f"   Account kind: {account_info.get('account_kind', 'Unknown')}")
        print(f"   SKU name: {account_info.get('sku_name', 'Unknown')}")
    except Exception as e:
        print(f"❌ Failed to connect to storage account: {e}")
        return False
    
    # Test listing existing containers
    try:
        print("\n🔍 Checking existing containers...")
        existing_containers = []
        for container in blob_service_client.list_containers():
            existing_containers.append(container.name)
        print(f"✅ Found {len(existing_containers)} existing containers: {existing_containers}")
    except Exception as e:
        print(f"❌ Failed to list containers: {e}")
        print("   This might indicate insufficient permissions")
    
    # Try to create containers
    containers = [
        Config.POLICIES_CONTAINER,
        Config.CLAIMS_CONTAINER,
        Config.STATEMENTS_CONTAINER,  # Added statements container
        Config.PROCESSED_CONTAINER
    ]
    
    created_containers = []
    failed_containers = []
    
    for container_name in containers:
        try:
            # Check if container already exists first
            container_client = blob_service_client.get_container_client(container_name)
            
            try:
                # Try to get container properties (this will fail if it doesn't exist)
                properties = container_client.get_container_properties()
                print(f"ℹ️ Container '{container_name}' already exists")
                created_containers.append(container_name)
                continue
            except Exception:
                # Container doesn't exist, try to create it
                pass
            
            # Create the container
            print(f"🔨 Creating container '{container_name}'...")
            container_client.create_container()
            print(f"✅ Container '{container_name}' created successfully")
            created_containers.append(container_name)
            
        except Exception as e:
            print(f"❌ Error with container '{container_name}': {e}")
            failed_containers.append((container_name, str(e)))
            
            # Additional diagnostics for authorization errors
            if "AuthorizationFailure" in str(e):
                print(f"   🔍 Authorization issue detected for '{container_name}'")
                print(f"   This could be due to:")
                print(f"   - Storage account access keys disabled")
                print(f"   - Network access restrictions")
                print(f"   - Storage account permissions")
    
    print(f"\n📊 Container Creation Summary:")
    print(f"   Successful: {len(created_containers)} - {created_containers}")
    print(f"   Failed: {len(failed_containers)} - {[name for name, _ in failed_containers]}")
    
    return len(failed_containers) == 0

if blob_service_client:
    print("🚀 Running enhanced container creation...")
    success = create_containers_enhanced(blob_service_client)
    
        

🚀 Running enhanced container creation...
🔍 Testing storage account connection...
✅ Connected to storage account successfully
   Account kind: StorageV2
   SKU name: Standard_LRS

🔍 Checking existing containers...
✅ Found 4 existing containers: ['claims', 'policies', 'processed-documents', 'statements']
ℹ️ Container 'policies' already exists
ℹ️ Container 'claims' already exists
ℹ️ Container 'statements' already exists
ℹ️ Container 'processed-documents' already exists

📊 Container Creation Summary:
   Successful: 4 - ['policies', 'claims', 'statements', 'processed-documents']
   Failed: 0 - []


## 3. Document Upload Functions
The next cell creates a helpful DocumentUploader class that provides easy-to-use methods for uploading individual files or entire directories to Azure Blob Storage, complete with progress tracking and error handling to make document management seamless.


In [5]:
class DocumentUploader:
    def __init__(self, blob_service_client):
        self.blob_service_client = blob_service_client
    
    def upload_file(self, file_path: Path, container_name: str, blob_name: str = None) -> bool:
        """Upload a single file to blob storage"""
        if blob_name is None:
            blob_name = file_path.name
            
        try:
            blob_client = self.blob_service_client.get_blob_client(
                container=container_name, 
                blob=blob_name
            )
            
            with open(file_path, 'rb') as data:
                blob_client.upload_blob(data, overwrite=True)
            
            print(f"✅ Uploaded: {file_path.name} → {container_name}/{blob_name}")
            return True
            
        except Exception as e:
            print(f"❌ Error uploading {file_path.name}: {e}")
            return False
    
    def upload_directory(self, directory_path: Path, container_name: str) -> Dict[str, bool]:
        """Upload all files from a directory to blob storage"""
        results = {}
        
        if not directory_path.exists():
            print(f"❌ Directory not found: {directory_path}")
            return results
        
        files = list(directory_path.glob('*'))
        if not files:
            print(f"ℹ️ No files found in {directory_path}")
            return results
        
        print(f"📤 Uploading {len(files)} files from {directory_path} to {container_name}...")
        
        for file_path in tqdm(files, desc="Uploading files"):
            if file_path.is_file():
                success = self.upload_file(file_path, container_name)
                results[file_path.name] = success
        
        successful_uploads = sum(results.values())
        print(f"\n📊 Upload Summary: {successful_uploads}/{len(results)} files uploaded successfully")
        
        return results
    
    def list_blobs(self, container_name: str) -> List[str]:
        """List all blobs in a container"""
        try:
            container_client = self.blob_service_client.get_container_client(container_name)
            blob_list = container_client.list_blobs()
            return [blob.name for blob in blob_list]
        except Exception as e:
            print(f"❌ Error listing blobs in {container_name}: {e}")
            return []

# Initialize uploader
if blob_service_client:
    uploader = DocumentUploader(blob_service_client)
    print("✅ Document uploader initialized!")

✅ Document uploader initialized!


## 4. Upload Documents to Blob Storage

Separated by folder

In [6]:
# Upload policy documents
print("📄 Uploading Policy Documents...")
print("=" * 50)

policy_results = uploader.upload_directory(Config.POLICIES_DIR, Config.POLICIES_CONTAINER)

# Upload claims documents
print("\n🖼️ Uploading Claims Documents...")
print("=" * 50)

claims_results = uploader.upload_directory(Config.CLAIMS_DIR, Config.CLAIMS_CONTAINER)

# Upload statements documents
print("\n📄 Uploading Statements Documents...")
print("=" * 50)

statements_results = uploader.upload_directory(Config.STATEMENTS_DIR, Config.STATEMENTS_CONTAINER)


📄 Uploading Policy Documents...
📤 Uploading 5 files from data/policies to policies...


Uploading files:  60%|██████    | 3/5 [00:00<00:00, 27.16it/s]

✅ Uploaded: comprehensive_auto_policy.md → policies/comprehensive_auto_policy.md
✅ Uploaded: motorcycle_policy.md → policies/motorcycle_policy.md
✅ Uploaded: liability_only_policy.md → policies/liability_only_policy.md
✅ Uploaded: commercial_auto_policy.md → policies/commercial_auto_policy.md


Uploading files: 100%|██████████| 5/5 [00:00<00:00, 28.91it/s]


✅ Uploaded: high_value_vehicle_policy.md → policies/high_value_vehicle_policy.md

📊 Upload Summary: 5/5 files uploaded successfully

🖼️ Uploading Claims Documents...
📤 Uploading 5 files from data/claims to claims...


Uploading files:  40%|████      | 2/5 [00:00<00:00, 13.20it/s]

✅ Uploaded: crash3.jpg → claims/crash3.jpg
✅ Uploaded: crash5.jpg → claims/crash5.jpg
✅ Uploaded: crash2.jpg → claims/crash2.jpg
✅ Uploaded: crash1.jpg → claims/crash1.jpg


Uploading files: 100%|██████████| 5/5 [00:00<00:00, 16.05it/s]


✅ Uploaded: crash4.jpeg → claims/crash4.jpeg

📊 Upload Summary: 5/5 files uploaded successfully

📄 Uploading Statements Documents...
📤 Uploading 5 files from data/statements to statements...


Uploading files:   0%|          | 0/5 [00:00<?, ?it/s]

✅ Uploaded: crash2.md → statements/crash2.md
✅ Uploaded: crash1.md → statements/crash1.md
✅ Uploaded: crash4.md → statements/crash4.md


Uploading files: 100%|██████████| 5/5 [00:00<00:00, 31.70it/s]

✅ Uploaded: crash5.md → statements/crash5.md
✅ Uploaded: crash3.md → statements/crash3.md

📊 Upload Summary: 5/5 files uploaded successfully





## 5. Document Processing with Azure OpenAI GPT-4-1-mini

Perfect! As of this moment we have created 3 containers that have the data that we will use to our use case. Awesome! Now, it's time to process the data. We are currently handling `.md`and `.png` files. For such, we will create a class called `DocumentProcessor` that will have 2 key functions:
- **process_markdown_for_vectorization** - will process the markdown files as normal text files for vectorization
- **generate_image_description_with_gpt** - will use the multimodal capabilities of GPT-4.1-mini to process our image and give us a description. Later on, this will be really important for fraud analysis.

In [7]:
class DocumentProcessor:
    def __init__(self, openai_client, blob_service_client):
        self.openai_client = openai_client
        self.blob_service_client = blob_service_client
    
    def get_blob_content(self, container_name: str, blob_name: str) -> bytes:
        """Download blob content as bytes"""
        blob_client = self.blob_service_client.get_blob_client(
            container=container_name, 
            blob=blob_name
        )
        blob_data = blob_client.download_blob()
        return blob_data.readall()
    
    def encode_image_to_base64(self, image_bytes: bytes) -> str:
        """Encode image bytes to base64 string"""
        return base64.b64encode(image_bytes).decode('utf-8')
    
    def process_markdown_for_vectorization(self, container_name: str, blob_name: str) -> Dict:
        """Process markdown file for direct vectorization (no GPT processing)"""
        try:
            print(f"📄 Preparing markdown for vectorization: {blob_name}...")
            
            # Download and decode content
            blob_content = self.get_blob_content(container_name, blob_name)
            content = blob_content.decode('utf-8')
            
            metadata = {
                "file_name": blob_name,
                "container": container_name,
                "file_type": "markdown",
                "text_length": len(content),
                "processing_date": pd.Timestamp.now().isoformat(),
                "processing_method": "direct_vectorization",
                "ready_for_embedding": True
            }
            
            return {
                "success": True,
                "text": content,  # Original markdown content for vectorization
                "metadata": metadata
            }
            
        except Exception as e:
            print(f"❌ Error processing {blob_name}: {e}")
            return {
                "success": False,
                "error": str(e),
                "metadata": {"file_name": blob_name, "container": container_name, "file_type": "markdown"}
            }

    def generate_image_description_with_gpt(self, container_name: str, blob_name: str) -> Dict:
        try:
            print(f"🖼️ Generating description for image: {blob_name}...")
            
            # Download image content
            image_bytes = self.get_blob_content(container_name, blob_name)
            base64_image = self.encode_image_to_base64(image_bytes)
            
            # Determine image format from file extension
            file_extension = Path(blob_name).suffix.lower()
            if file_extension == ".jpg" or file_extension == ".jpeg":
                image_format = "jpeg"
            elif file_extension == ".png":
                image_format = "png"
            else:
                image_format = "jpeg"  # default
            
            # Process with GPT-4.1-mini for description generation
            response = self.openai_client.chat.completions.create(
                model=Config.AZURE_OPENAI_DEPLOYMENT_NAME,
                messages=[
                    {
                        "role": "system",
                        "content": """You are an expert insurance claims analyst with advanced image analysis capabilities. 
                        Your task is to provide detailed, professional descriptions of insurance-related images, particularly vehicle damage and accident scenes.
                        
                        Focus on:
                        - Type of vehicle and visible damage
                        - Location and extent of damage (scratches, dents, broken parts, etc.)
                        - Environmental context (road conditions, weather signs, location type)
                        - Any visible people, other vehicles, or relevant objects
                        - Overall severity assessment
                        - Any safety concerns or hazards visible
                        
                        Provide clear, objective descriptions that would be useful for insurance claim processing and risk assessment."""
                    },
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "text",
                                "text": "Please provide a detailed description of this insurance claim image. Focus on damage assessment, environmental factors, and any relevant details for insurance processing."
                            },
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": f"data:image/{image_format};base64,{base64_image}"
                                }
                            }
                        ]
                    }
                ],
                max_tokens=4000,
                temperature=0.3  # Slightly higher for more descriptive language
            )
            description = response.choices[0].message.content
            
            metadata = {
                "file_name": blob_name,
                "container": container_name,
                "file_type": "image",
                "image_format": image_format,
                "image_size_bytes": len(image_bytes),
                "description_length": len(description),
                "processing_date": pd.Timestamp.now().isoformat(),
                "model_used": Config.AZURE_OPENAI_DEPLOYMENT_NAME,
                "processing_type": "image_description",
                "ready_for_embedding": True
            }
            
            return {
                "success": True,
                "description": description,  # Changed from "text" to "description"
                "metadata": metadata
            }
            
        except Exception as e:
            print(f"❌ Error processing {blob_name}: {e}")
            return {
                "success": False,
                "error": str(e),
                "metadata": {"file_name": blob_name, "container": container_name, "file_type": "image"}
            }

    def process_all_documents(self) -> Dict[str, List[Dict]]:
        """Process documents: prepare markdown for vectorization, generate descriptions for images"""
        results = {
            "policies": [],
            "claims": [],
            "statements": []  # Added statements to results
        }
        
        # Process policy documents (markdown files) - prepare for vectorization only
        print("📄 Preparing Policy Documents for Vectorization...")
        print("=" * 50)
        
        policy_blobs = uploader.list_blobs(Config.POLICIES_CONTAINER)
        for blob_name in tqdm(policy_blobs, desc="Preparing policies"):
            if blob_name.endswith(".md"):
                result = self.process_markdown_for_vectorization(Config.POLICIES_CONTAINER, blob_name)
                results["policies"].append(result)
            else:
                print(f"⚠️ Skipping non-markdown file: {blob_name}")
        
        # Process statements documents (markdown files) - prepare for vectorization only
        print("\n📄 Preparing Statements Documents for Vectorization...")
        print("=" * 50)
        
        statements_blobs = uploader.list_blobs(Config.STATEMENTS_CONTAINER)
        for blob_name in tqdm(statements_blobs, desc="Preparing statements"):
            if blob_name.endswith(".md"):
                result = self.process_markdown_for_vectorization(Config.STATEMENTS_CONTAINER, blob_name)
                results["statements"].append(result)
            else:
                print(f"⚠️ Skipping non-markdown file: {blob_name}")
        
        # Process claims documents (images) - generate descriptions with GPT-4.1-mini 
        print("\n🖼️ Generating Image Descriptions with GPT-4.1-mini ...")
        print("=" * 50)
        
        claims_blobs = uploader.list_blobs(Config.CLAIMS_CONTAINER)
        for blob_name in tqdm(claims_blobs, desc="Generating descriptions"):
            if blob_name.lower().endswith((".jpg", ".jpeg", ".png")):
                result = self.generate_image_description_with_gpt(Config.CLAIMS_CONTAINER, blob_name)
                results["claims"].append(result)
            else:
                print(f"⚠️ Skipping non-image file: {blob_name}")
        
        return results
    
    def save_processed_results(self, results: Dict, output_file: str = "processed_documents_for_vectorization.json"):
        """Save processed results to JSON file and upload to blob storage"""
        try:
            # Save locally
            with open(output_file, "w", encoding="utf-8") as f:
                json.dump(results, f, indent=2, ensure_ascii=False)
            
            print(f"💾 Results saved locally: {output_file}")
            
            # Upload to blob storage
            success = uploader.upload_file(
                Path(output_file), 
                Config.PROCESSED_CONTAINER, 
                output_file
            )
            
            if success:
                print(f"☁️ Results uploaded to blob storage: {Config.PROCESSED_CONTAINER}/{output_file}")
            
        except Exception as e:
            print(f"❌ Error saving results: {e}")

## 6. Process All Documents with GPT-4.1-mini

Now, let's seat and watch the magic happen!

In [8]:
# Initialize processor with GPT
if openai_client and blob_service_client:
    processor = DocumentProcessor(openai_client, blob_service_client)
    print("✅ Document processor initialized with GPT!")
    
    # Process documents using GPT
    print("\n🚀 Starting document processing with GPT...")
    print("=" * 60)
    
    processing_results = processor.process_all_documents()
    
    print("\n✅ Document processing completed!")
else:
    print("❌ Cannot initialize processor - missing clients")

✅ Document processor initialized with GPT!

🚀 Starting document processing with GPT...
📄 Preparing Policy Documents for Vectorization...


Preparing policies:   0%|          | 0/5 [00:00<?, ?it/s]

📄 Preparing markdown for vectorization: commercial_auto_policy.md...
📄 Preparing markdown for vectorization: comprehensive_auto_policy.md...


Preparing policies:  60%|██████    | 3/5 [00:00<00:00, 29.04it/s]

📄 Preparing markdown for vectorization: high_value_vehicle_policy.md...
📄 Preparing markdown for vectorization: liability_only_policy.md...
📄 Preparing markdown for vectorization: motorcycle_policy.md...


Preparing policies: 100%|██████████| 5/5 [00:00<00:00, 31.53it/s]



📄 Preparing Statements Documents for Vectorization...


Preparing statements:   0%|          | 0/5 [00:00<?, ?it/s]

📄 Preparing markdown for vectorization: crash1.md...
📄 Preparing markdown for vectorization: crash2.md...
📄 Preparing markdown for vectorization: crash3.md...


Preparing statements:  80%|████████  | 4/5 [00:00<00:00, 36.52it/s]

📄 Preparing markdown for vectorization: crash4.md...
📄 Preparing markdown for vectorization: crash5.md...


Preparing statements: 100%|██████████| 5/5 [00:00<00:00, 31.79it/s]



🖼️ Generating Image Descriptions with GPT-4.1-mini ...


Generating descriptions:   0%|          | 0/5 [00:00<?, ?it/s]

🖼️ Generating description for image: crash1.jpg...


Generating descriptions:  20%|██        | 1/5 [00:08<00:34,  8.62s/it]

🖼️ Generating description for image: crash2.jpg...


Generating descriptions:  40%|████      | 2/5 [00:22<00:35, 11.97s/it]

🖼️ Generating description for image: crash3.jpg...


Generating descriptions:  60%|██████    | 3/5 [00:35<00:24, 12.15s/it]

🖼️ Generating description for image: crash4.jpeg...


Generating descriptions:  80%|████████  | 4/5 [00:55<00:15, 15.36s/it]

🖼️ Generating description for image: crash5.jpg...


Generating descriptions: 100%|██████████| 5/5 [01:09<00:00, 13.98s/it]


✅ Document processing completed!





## 7. Results Analysis and Summary

Perfect, now let's run the following code and we will be able to check locally the transcription of our files. Please do double check if they make sense. Special attention should be given to the information extracted from the images, as it should contain deeply detailed description of the crash in hand.

In [9]:
# Save processing results
processor.save_processed_results(processing_results)

💾 Results saved locally: processed_documents_for_vectorization.json
✅ Uploaded: processed_documents_for_vectorization.json → processed-documents/processed_documents_for_vectorization.json
☁️ Results uploaded to blob storage: processed-documents/processed_documents_for_vectorization.json


## Let's Cosmos our data!

But first... if you inspected correctly you might have seen that we have indeed extracted data from our files, but it is not structured at all. We might as well do that! We will use the Claim_ID on the top part of each submission to create a database. And of course, to do that we will use... Generative AI!

In [10]:
from pydantic import BaseModel
from openai import AzureOpenAI
from azure.cosmos import CosmosClient, PartitionKey

# Rename the model to something more appropriate
class ClaimInfo(BaseModel):
    claimant_id: str
    policyholder_name: str
    policyholder_address: str
    policyholder_phone: str
    policyholder_email: str
    policy_number: str
    vehicle_year_make_model: str
    vehicle_color: str
    vehicle_vin: str
    vehicle_license_plate: str
    incident_date: str
    incident_time: str
    incident_location: str
    incident_description: str
    damage_description: str
    witness_name: str
    witness_phone: str
    police_department: str
    police_report_number: str
    repair_shop_name: str
    repair_shop_address: str
    attachments: str
    claim_request: str
    signature_name: str
    signature_date: str


def extract_structured_claim_info(text_content: str, claim_id: str) -> dict:
    """Extract structured information from claim text using Azure OpenAI structured outputs"""
    try:
        client = AzureOpenAI(
            azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
            api_key=os.getenv("AZURE_OPENAI_KEY"),
            api_version="2024-08-01-preview"
        )
        
        completion = client.beta.chat.completions.parse(
            model=Config.AZURE_OPENAI_DEPLOYMENT_NAME,  # Use your deployment name
            messages=[
                {
                    "role": "system", 
                    "content": """You are an expert insurance claims processor. Extract structured information from crash statements and insurance claims. 
                    If any field is not available in the text, use "N/A" as the value. 
                    Be thorough and accurate in extracting all available information."""
                },
                {
                    "role": "user", 
                    "content": f"Extract the structured information from this crash statement for claim {claim_id}:\n\n{text_content}"
                },
            ],
            response_format=ClaimInfo,
        )
        
        structured_data = completion.choices[0].message.parsed
        print(f"✅ Extracted structured data for claim {claim_id}")
        return structured_data.model_dump()
        
    except Exception as e:
        print(f"❌ Error extracting structured data for claim {claim_id}: {e}")
        return None

In [16]:
def process_crash_reports_simplified():
    """Process JSON file and create simplified crash reports with only structured info and image descriptions"""
    
    # Load the processed JSON file
    with open('processed_documents_for_vectorization.json', 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    # Create lookup dictionaries
    statements_lookup = {item['metadata']['file_name']: item for item in data['statements'] if item['success']}
    claims_lookup = {item['metadata']['file_name']: item for item in data['claims'] if item['success']}
    
    # Define crash to claim ID mapping
    crash_to_claim_mapping = {
        'crash1': 'CL001',
        'crash2': 'CL002', 
        'crash3': 'CL003',
        'crash4': 'CL001',  # Same claim as crash1
        'crash5': 'CL004'
    }
    
    # Group by claim ID to handle multiple crashes per claim
    claims_data = {}
    
    for crash_num, claim_id in crash_to_claim_mapping.items():
        statement_file = f"{crash_num}.md"
        image_files = [f"{crash_num}.jpg", f"{crash_num}.jpeg"]
        
        # Find matching image file
        image_file = None
        for img in image_files:
            if img in claims_lookup:
                image_file = img
                break
        
        if statement_file in statements_lookup and image_file:
            statement_data = statements_lookup[statement_file]
            image_data = claims_lookup[image_file]
            
            if claim_id not in claims_data:
                claims_data[claim_id] = {
                    "structured_info": [],
                    "image_descriptions": []
                }
            
            # Extract structured information from the statement text
            print(f"🔍 Extracting structured info for {crash_num} (Claim: {claim_id})...")
            structured_info = extract_structured_claim_info(statement_data['text'], claim_id)
            
            if structured_info:
                claims_data[claim_id]["structured_info"].append({
                    "crash_number": crash_num,
                    "structured_data": structured_info
                })
            
            # Add image description
            claims_data[claim_id]["image_descriptions"].append({
                "crash_number": crash_num,
                "image_file": image_file,
                "description": image_data['description']
            })
    
    # Create simplified crash reports with only structured info and image descriptions
    simplified_reports = []
    for claim_id, claim_data in claims_data.items():
        # Combine structured info from all crashes in this claim
        combined_structured_info = {}
        if claim_data["structured_info"]:
            # Use the first crash's structured info as base
            combined_structured_info = claim_data["structured_info"][0]["structured_data"].copy()
            
            # For claims with multiple crashes, add them as additional crashes
            if len(claim_data["structured_info"]) > 1:
                combined_structured_info["additional_crashes"] = []
                for i in range(1, len(claim_data["structured_info"])):
                    additional_crash = {
                        "crash_number": claim_data["structured_info"][i]["crash_number"],
                        "structured_data": claim_data["structured_info"][i]["structured_data"]
                    }
                    combined_structured_info["additional_crashes"].append(additional_crash)
        
        simplified_report = {
            "claim_id": claim_id,
            "structured_claim_info": combined_structured_info,
            "image_descriptions": claim_data["image_descriptions"]
        }
        
        simplified_reports.append(simplified_report)
        crashes_count = len(claim_data["structured_info"])
        print(f"✅ Created simplified report for Claim ID: {claim_id} ({crashes_count} crashes)")
    
    return simplified_reports

def save_simplified_cosmos_db(simplified_reports):
    """Save simplified reports to Cosmos DB"""
    
    # Initialize Cosmos client
    client = CosmosClient(Config.COSMOS_ENDPOINT, Config.COSMOS_KEY)
    database = client.create_database_if_not_exists(id=Config.COSMOS_DATABASE)
    
    # Update partition key to use claim_id
    container = database.create_container_if_not_exists(
        id=Config.COSMOS_CONTAINER,
        partition_key=PartitionKey(path="/claim_id")
    )
    
    # Save simplified reports
    print("💾 Saving simplified crash reports...")
    for report in simplified_reports:
        try:
            # Add required id field for Cosmos DB
            report_with_id = report.copy()
            report_with_id["id"] = report["claim_id"]
            
            container.upsert_item(body=report_with_id)
            print(f"✅ Saved simplified Claim {report['claim_id']} to Cosmos DB")
            
            # Print sample of structured info for verification
            if report.get("structured_claim_info"):
                sample_fields = ["policyholder_name", "policy_number", "incident_date", "incident_location"]
                print(f"   📋 Sample structured data:")
                for field in sample_fields:
                    value = report["structured_claim_info"].get(field, "N/A")
                    print(f"      {field}: {value}")
                
        except Exception as e:
            print(f"❌ Error saving {report['claim_id']}: {e}")

# Execute the simplified processing
print("🚀 Starting simplified crash report processing...")
simplified_reports = process_crash_reports_simplified()
print(f"\n📊 Created {len(simplified_reports)} simplified crash reports")

# Display the simplified structure
print("\n📋 Simplified Report Structure:")
for report in simplified_reports:
    print(f"  Claim {report['claim_id']}:")
    print(f"    - Structured Info: {'✅' if report.get('structured_claim_info') else '❌'}")
    print(f"    - Image Descriptions: {len(report.get('image_descriptions', []))}")
    
    # Show sample structured data
    if report.get("structured_claim_info"):
        policyholder = report["structured_claim_info"].get("policyholder_name", "N/A")
        policy_num = report["structured_claim_info"].get("policy_number", "N/A")
        print(f"    - Policyholder: {policyholder}, Policy: {policy_num}")

# Save to Cosmos DB
save_simplified_cosmos_db(simplified_reports)

# Save the simplified reports locally
with open('simplified_crash_reports.json', 'w', encoding='utf-8') as f:
    json.dump(simplified_reports, f, indent=2, ensure_ascii=False)
print(f"💾 Simplified crash reports saved to 'simplified_crash_reports.json'")

# Show a sample of what the final JSON looks like
print("\n📄 Sample of simplified JSON structure:")
if simplified_reports:
    sample_report = simplified_reports[0]
    print(json.dumps({
        "sample_claim": {
            "claim_id": sample_report["claim_id"],
            "structured_claim_info": {
                "policyholder_name": sample_report["structured_claim_info"].get("policyholder_name", "N/A"),
                "policy_number": sample_report["structured_claim_info"].get("policy_number", "N/A"),
                "incident_date": sample_report["structured_claim_info"].get("incident_date", "N/A"),
                "...": "all other structured fields"
            },
            "image_descriptions": [
                {
                    "crash_number": sample_report["image_descriptions"][0]["crash_number"],
                    "image_file": sample_report["image_descriptions"][0]["image_file"],
                    "description": "Detailed image description..."
                }
            ]
        }
    }, indent=2))

🚀 Starting simplified crash report processing...
🔍 Extracting structured info for crash1 (Claim: CL001)...
✅ Extracted structured data for claim CL001
🔍 Extracting structured info for crash2 (Claim: CL002)...
✅ Extracted structured data for claim CL002
🔍 Extracting structured info for crash3 (Claim: CL003)...
✅ Extracted structured data for claim CL003
🔍 Extracting structured info for crash4 (Claim: CL001)...
✅ Extracted structured data for claim CL001
🔍 Extracting structured info for crash5 (Claim: CL004)...
✅ Extracted structured data for claim CL004
✅ Created simplified report for Claim ID: CL001 (2 crashes)
✅ Created simplified report for Claim ID: CL002 (1 crashes)
✅ Created simplified report for Claim ID: CL003 (1 crashes)
✅ Created simplified report for Claim ID: CL004 (1 crashes)

📊 Created 4 simplified crash reports

📋 Simplified Report Structure:
  Claim CL001:
    - Structured Info: ✅
    - Image Descriptions: 2
    - Policyholder: John Peterson, Policy: LIAB-AUTO-001
  Clai