# Document Vectorization with Azure AI Search Integrated Vectorization

This notebook demonstrates how to create a sophisticated search index using Azure AI Search's integrated vectorization capabilities for insurance document retrieval. The workflow includes:

1. **Retrieve Processed Documents from Azure Blob Storage**: Download the processed insurance documents (policies, claims, statements) that were created in the previous notebook, including structured claim data and detailed image descriptions.

2. **Create Azure AI Search Index with Integrated Vectorization**: 
   - **Index Schema Design**: Define a comprehensive search schema with fields for content, metadata, and vector embeddings
   - **Integrated Vectorization Setup**: Configure Azure AI Search to automatically generate embeddings using Azure OpenAI's text-embedding-ada-002 model
   - **Semantic Search Configuration**: Enable semantic search capabilities for natural language queries

3. **Intelligent Text Chunking**: Process large insurance documents into optimally-sized chunks with overlapping content to ensure comprehensive coverage while maintaining context for accurate retrieval.

4. **Upload Documents to Azure AI Search**: 
   - **Batch Processing**: Efficiently upload document chunks to the search index
   - **Automatic Embedding Generation**: Azure AI Search automatically creates vector embeddings for each document chunk using the configured OpenAI model
   - **Real-time Indexing**: Documents become immediately searchable upon upload

5. **Advanced Search Testing**: 
   - **Semantic Search**: Test natural language queries against insurance policies using AI-powered semantic understanding
   - **Vector Search**: Perform similarity-based searches using vector embeddings
   - **Hybrid Search**: Combine keyword and vector search for optimal results
   - **Interactive Testing**: Provide an interactive interface for real-time search testing

6. **Search Analytics and Validation**: Generate comprehensive statistics about the indexed documents, search performance, and readiness for AI agent integration.


## 1. Setup and Configuration
Let's start with handling the import of our libraries and load the `.env` variables that we have saved in the previous challenge.

In [1]:
import os
import json
import pandas as pd
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from tqdm import tqdm
import re
from datetime import datetime
import uuid

# Azure SDK imports
from azure.storage.blob import BlobServiceClient
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    VectorSearch,
    VectorSearchProfile,
    VectorSearchAlgorithmConfiguration,
    VectorSearchAlgorithmKind,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticField,
    SemanticSearch,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    HnswAlgorithmConfiguration,
    ExhaustiveKnnAlgorithmConfiguration
)
from azure.core.credentials import AzureKeyCredential
from azure.core.exceptions import ResourceNotFoundError

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

print("✅ All imports successful!")

✅ All imports successful!


In [2]:
# Configuration
class Config:
    # Storage configuration
    AZURE_STORAGE_CONNECTION_STRING = os.getenv('AZURE_STORAGE_CONNECTION_STRING')
    
    # Azure AI Search configuration
    SEARCH_SERVICE_NAME = os.getenv('SEARCH_SERVICE_NAME')
    SEARCH_SERVICE_ENDPOINT = os.getenv('SEARCH_SERVICE_ENDPOINT')
    SEARCH_ADMIN_KEY = os.getenv('SEARCH_ADMIN_KEY')
    
    # Azure OpenAI configuration (for integrated vectorization)
    AZURE_OPENAI_ENDPOINT = os.getenv('AZURE_OPENAI_ENDPOINT')
    AZURE_OPENAI_API_KEY = os.getenv('AZURE_OPENAI_KEY')
    AZURE_OPENAI_EMBEDDING_DEPLOYMENT = os.getenv('AZURE_OPENAI_EMBEDDING_DEPLOYMENT', 'text-embedding-ada-002')
    
    # Container names
    PROCESSED_CONTAINER = 'processed-documents'
    
    # Search index configuration
    SEARCH_INDEX_NAME = 'insurance-documents-index'
    CHUNK_SIZE = 1000  # Characters per chunk
    CHUNK_OVERLAP = 200  # Overlap between chunks

# Validate configuration
required_vars = [
    Config.AZURE_STORAGE_CONNECTION_STRING,
    Config.SEARCH_SERVICE_ENDPOINT,
    Config.SEARCH_ADMIN_KEY,
    Config.AZURE_OPENAI_ENDPOINT,
    Config.AZURE_OPENAI_API_KEY
]

missing_vars = [var for var in required_vars if not var]
if missing_vars:
    print("❌ Missing environment variables. Please check your .env file.")
    print("Missing variables:")
    if not Config.SEARCH_SERVICE_ENDPOINT:
        print("  - SEARCH_SERVICE_ENDPOINT")
    if not Config.SEARCH_ADMIN_KEY:
        print("  - SEARCH_ADMIN_KEY")
    if not Config.AZURE_OPENAI_ENDPOINT:
        print("  - AZURE_OPENAI_ENDPOINT")
    if not Config.AZURE_OPENAI_API_KEY:
        print("  - AZURE_OPENAI_API_KEY")
    if not Config.AZURE_STORAGE_CONNECTION_STRING:
        print("  - AZURE_STORAGE_CONNECTION_STRING")
else:
    print("✅ Configuration loaded successfully!")
    print(f"🔍 Search Service: {Config.SEARCH_SERVICE_NAME}")
    print(f"🔗 Search Endpoint: {Config.SEARCH_SERVICE_ENDPOINT}")
    print(f"📦 Processed Documents Container: {Config.PROCESSED_CONTAINER}")
    print(f"📇 Search Index: {Config.SEARCH_INDEX_NAME}")

✅ Configuration loaded successfully!
🔍 Search Service: msagthack-search-m3qz57ik6ngog
🔗 Search Endpoint: https://msagthack-search-m3qz57ik6ngog.search.windows.net/
📦 Processed Documents Container: processed-documents
📇 Search Index: insurance-documents-index


## 2. Initialize Azure Services

The next cell creates connections to Azure Blob Storage for document retrieval and Azure AI Search for index management, with comprehensive error handling and connection testing.

In [3]:
# Initialize Azure clients
def initialize_clients():
    """Initialize Azure service clients"""
    try:
        # Blob Storage client
        blob_service_client = BlobServiceClient.from_connection_string(
            Config.AZURE_STORAGE_CONNECTION_STRING
        )
        
        # Azure AI Search clients
        search_credential = AzureKeyCredential(Config.SEARCH_ADMIN_KEY)
        
        search_index_client = SearchIndexClient(
            endpoint=Config.SEARCH_SERVICE_ENDPOINT,
            credential=search_credential
        )
        
        search_client = SearchClient(
            endpoint=Config.SEARCH_SERVICE_ENDPOINT,
            index_name=Config.SEARCH_INDEX_NAME,
            credential=search_credential
        )
        
        # Test the connections
        containers = list(blob_service_client.list_containers())
        print(f"✅ Connected to Blob Storage - Found {len(containers)} containers")
        
        # Test search service (fixed the storage_size access)
        try:
            service_stats = search_index_client.get_service_statistics()
            storage_used = getattr(service_stats, 'storage_size', 'Unknown')
            print(f"✅ Connected to Azure AI Search - Storage used: {storage_used}")
        except Exception as e:
            print(f"✅ Connected to Azure AI Search - Service is available")
            print(f"   (Note: Could not get statistics: {e})")
        
        return blob_service_client, search_index_client, search_client
        
    except Exception as e:
        print(f"❌ Error initializing clients: {e}")
        return None, None, None

# Initialize clients
blob_service_client, search_index_client, search_client = initialize_clients()

✅ Connected to Blob Storage - Found 4 containers
✅ Connected to Azure AI Search - Storage used: Unknown


## 3. Create Azure AI Search Index with Integrated Vectorization
The next cell defines a `SearchIndexManager` class that creates a sophisticated search index with integrated vectorization, semantic search capabilities, and proper field schema for insurance documents.

In [4]:
class SearchIndexManager:
    """Class to manage Azure AI Search index with integrated vectorization"""
    
    def __init__(self, search_index_client: SearchIndexClient):
        self.search_index_client = search_index_client
        self.index_name = Config.SEARCH_INDEX_NAME

    def _format_azure_openai_endpoint(self, endpoint: str) -> str:
        """Format the Azure OpenAI endpoint for use with Azure AI Search vectorizer"""
        # Remove trailing slash if present
        endpoint = endpoint.rstrip('/')
        
        # Check if it already has the correct format
        if endpoint.endswith('.openai.azure.com'):
            return endpoint
        
        # Extract the resource name from various possible formats
        if '.cognitiveservices.azure.com' in endpoint:
            # Convert from cognitive services format to OpenAI format
            resource_name = endpoint.split('.')[0].split('//')[-1]
            return f"https://{resource_name}.openai.azure.com"
        elif '/openai/' in endpoint:
            # Extract resource name from URL with /openai/ path
            parts = endpoint.split('/')
            resource_name = parts[2].split('.')[0]
            return f"https://{resource_name}.openai.azure.com"
        else:
            # Try to extract resource name and format correctly
            if 'https://' in endpoint:
                resource_name = endpoint.split('//')[1].split('.')[0]
            else:
                resource_name = endpoint.split('.')[0]
            return f"https://{resource_name}.openai.azure.com"
    
    def create_search_index(self) -> bool:
        """Create a search index with integrated vectorization"""
        try:
            # Format the Azure OpenAI endpoint correctly
            formatted_endpoint = self._format_azure_openai_endpoint(Config.AZURE_OPENAI_ENDPOINT)
            print(f"🔗 Original endpoint: {Config.AZURE_OPENAI_ENDPOINT}")
            print(f"🔗 Formatted endpoint: {formatted_endpoint}")
            print(f"🚀 Using deployment: {Config.AZURE_OPENAI_EMBEDDING_DEPLOYMENT}")
            
            # Define the vectorizer for integrated vectorization
            vectorizer = AzureOpenAIVectorizer(
                vectorizer_name="insurance-vectorizer",
                parameters=AzureOpenAIVectorizerParameters(
                    resource_url=formatted_endpoint,  # Use formatted endpoint
                    deployment_name=Config.AZURE_OPENAI_EMBEDDING_DEPLOYMENT,  # Use config variable
                    model_name="text-embedding-ada-002",
                    api_key=Config.AZURE_OPENAI_API_KEY
                )
            )
            
            # Define vector search configuration
            vector_search = VectorSearch(
                algorithms=[
                    HnswAlgorithmConfiguration(name="insurance-algorithm", kind="hnsw"),
                    ExhaustiveKnnAlgorithmConfiguration(name="my-eknn-vector-config", kind="exhaustiveKnn")
                ],
                profiles=[
                    VectorSearchProfile(
                        name="insurance-profile",
                        algorithm_configuration_name="insurance-algorithm",
                        vectorizer_name="insurance-vectorizer"
                    )
                ],
                vectorizers=[vectorizer]
            )
            
            # Define semantic search configuration
            semantic_config = SemanticConfiguration(
                name="insurance-semantic",  # Fixed to match SearchTester
                prioritized_fields=SemanticPrioritizedFields(
                    title_field=SemanticField(field_name="title"),
                    content_fields=[SemanticField(field_name="content")],
                    keywords_fields=[
                        SemanticField(field_name="category"),
                        SemanticField(field_name="file_name")
                    ]
                )
            )
            
            semantic_search = SemanticSearch(
                configurations=[semantic_config]
            )
            
            # Define the search index schema
            fields = [
                SimpleField(name="id", type=SearchFieldDataType.String, key=True),
                SearchableField(name="title", type=SearchFieldDataType.String),
                SearchableField(name="content", type=SearchFieldDataType.String),
                SearchableField(name="category", type=SearchFieldDataType.String, filterable=True, facetable=True),
                SearchableField(name="file_name", type=SearchFieldDataType.String, filterable=True),
                SimpleField(name="file_type", type=SearchFieldDataType.String, filterable=True),
                SimpleField(name="chunk_id", type=SearchFieldDataType.Int32),
                SimpleField(name="chunk_count", type=SearchFieldDataType.Int32),
                SimpleField(name="original_length", type=SearchFieldDataType.Int32),
                SimpleField(name="chunk_length", type=SearchFieldDataType.Int32),
                SimpleField(name="processing_date", type=SearchFieldDataType.DateTimeOffset),
                
                # Vector field for integrated vectorization
                SearchField(
                    name="content_vector",
                    type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                    searchable=True,
                    vector_search_dimensions=1536,  # ada-002 embedding dimension
                    vector_search_profile_name="insurance-profile"
                )
            ]
            
            # Create the search index
            index = SearchIndex(
                name=self.index_name,
                fields=fields,
                vector_search=vector_search,
                semantic_search=semantic_search
            )
            
            # Create or update the index
            result = self.search_index_client.create_or_update_index(index)
            print(f"✅ Search index '{self.index_name}' created successfully!")
            print(f"📋 Index fields: {len(result.fields)}")
            print(f"🔍 Vector search enabled: {bool(result.vector_search)}")
            print(f"🧠 Semantic search enabled: {bool(result.semantic_search)}")
            
            return True
            
        except Exception as e:
            print(f"❌ Error creating search index: {e}")
            print(f"🔍 Debug info:")
            print(f"   - Original endpoint: {Config.AZURE_OPENAI_ENDPOINT}")
            print(f"   - Deployment name: {Config.AZURE_OPENAI_EMBEDDING_DEPLOYMENT}")
            print(f"   - API key present: {bool(Config.AZURE_OPENAI_API_KEY)}")
            
            # Add more detailed error information
            import traceback
            print(f"📋 Full error details:\n{traceback.format_exc()}")
            return False
    
    def delete_index_if_exists(self) -> bool:
        """Delete the index if it exists"""
        try:
            self.search_index_client.delete_index(self.index_name)
            print(f"✅ Deleted existing index: {self.index_name}")
            return True
        except ResourceNotFoundError:
            print(f"ℹ️ Index {self.index_name} doesn't exist - will create new")
            return True
        except Exception as e:
            print(f"❌ Error deleting index: {e}")
            return False
    
    def get_index_stats(self) -> Dict:
        """Get statistics about the search index"""
        try:
            index = self.search_index_client.get_index(self.index_name)
            stats = self.search_index_client.get_index_statistics(self.index_name)
            
            # Handle both object and dictionary responses
            if hasattr(stats, 'document_count'):
                # Object response
                return {
                    "name": index.name,
                    "field_count": len(index.fields),
                    "document_count": stats.document_count,
                    "storage_size": stats.storage_size,
                    "vector_index_size": getattr(stats, 'vector_index_size', 0)
                }
            else:
                # Dictionary response
                return {
                    "name": index.name,
                    "field_count": len(index.fields),
                    "document_count": stats.get('document_count', 0),
                    "storage_size": stats.get('storage_size', 0),
                    "vector_index_size": stats.get('vector_index_size', 0)
                }
        except Exception as e:
            print(f"❌ Error getting index stats: {e}")
            return {}

# Initialize search index manager
if search_index_client:
    index_manager = SearchIndexManager(search_index_client)
    
    # Option to recreate index (uncomment if needed)
    # print("🔄 Recreating search index...")
    # index_manager.delete_index_if_exists()
    
    success = index_manager.create_search_index()
    if success:
        print("\n📊 Index created successfully!")
    else:
        print("\n❌ Failed to create search index")
else:
    print("❌ Cannot create search index - missing search client")
    index_manager = None

🔗 Original endpoint: https://msagthack-aifoundry-m3qz57ik6ngog.cognitiveservices.azure.com/
🔗 Formatted endpoint: https://msagthack-aifoundry-m3qz57ik6ngog.openai.azure.com
🚀 Using deployment: text-embedding-ada-002
✅ Search index 'insurance-documents-index' created successfully!
📋 Index fields: 12
🔍 Vector search enabled: True
🧠 Semantic search enabled: True

📊 Index created successfully!


## 4. Document Retrieval and Processing

The next cell defines two essential classes: `DocumentRetriever` handles downloading processed documents from Azure Blob Storage, while `TextChunker` intelligently splits large documents into optimally-sized chunks with overlapping content. These components prepare the insurance documents for efficient indexing and retrieval in the search system.

In [5]:
# Reuse the DocumentRetriever class from previous notebook
class DocumentRetriever:
    """Class to handle document retrieval from blob storage"""
    
    def __init__(self, blob_service_client):
        self.blob_service_client = blob_service_client
    
    def get_all_processed_documents(self) -> Dict:
        """Get all processed documents ready for vectorization"""
        try:
            container_client = self.blob_service_client.get_container_client(Config.PROCESSED_CONTAINER)
            blob_client = container_client.get_blob_client("processed_documents_for_vectorization.json")
            
            blob_data = blob_client.download_blob().readall()
            documents = json.loads(blob_data.decode('utf-8'))
            
            print(f"✅ Downloaded processed documents")
            return documents
                
        except ResourceNotFoundError:
            print(f"❌ File not found: processed_documents_for_vectorization.json")
            return {}
        except Exception as e:
            print(f"❌ Error downloading documents: {e}")
            return {}

# Text chunking class (simplified for search index)
class TextChunker:
    """Class to handle intelligent text chunking for search index"""
    
    def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
    
    def clean_text(self, text: str) -> str:
        """Clean and normalize text"""
        text = re.sub(r'\s+', ' ', text)
        text = re.sub(r'\n+', '\n', text)
        return text.strip()
    
    def chunk_text_for_search(self, text: str, metadata: Dict) -> List[Dict]:
        """Create chunks optimized for search index"""
        text = self.clean_text(text)
        chunks = []
        
        if len(text) <= self.chunk_size:
            return [{
                'content': text,
                'chunk_id': 0,
                'chunk_count': 1,
                'metadata': metadata.copy()
            }]
        
        # Simple sliding window chunking
        start = 0
        chunk_id = 0
        
        while start < len(text):
            end = start + self.chunk_size
            
            # Try to break at sentence boundaries
            if end < len(text):
                sentence_end = text.rfind('.', start, end)
                if sentence_end > start:
                    end = sentence_end + 1
            
            chunk_text = text[start:end].strip()
            
            if chunk_text:
                chunks.append({
                    'content': chunk_text,
                    'chunk_id': chunk_id,
                    'chunk_count': 0,  # Will be updated later
                    'metadata': metadata.copy()
                })
                chunk_id += 1
            
            # Move start position with overlap
            start = max(start + self.chunk_size - self.chunk_overlap, end)
        
        # Update chunk count
        for chunk in chunks:
            chunk['chunk_count'] = len(chunks)
        
        return chunks

# Initialize processors
if blob_service_client:
    retriever = DocumentRetriever(blob_service_client)
    chunker = TextChunker(
        chunk_size=Config.CHUNK_SIZE,
        chunk_overlap=Config.CHUNK_OVERLAP
    )
    print("✅ Document processors initialized")

✅ Document processors initialized


## 5. Retrieve and Process Documents

The next cell implements an enhanced document retrieval system that downloads processed insurance documents from Azure Blob Storage and prepares them for search indexing with detailed error handling and debugging capabilities.


In [6]:
# Enhanced document retriever with better error handling and debugging
class EnhancedDocumentRetriever:
    """Enhanced class to handle document retrieval with detailed debugging"""
    
    def __init__(self, blob_service_client):
        self.blob_service_client = blob_service_client
    
    def get_all_processed_documents(self) -> Dict:
        """Get all processed documents with enhanced error handling"""
        try:
            print(f"🔍 Attempting to retrieve from container: {Config.PROCESSED_CONTAINER}")
            
            # Get container client
            container_client = self.blob_service_client.get_container_client(Config.PROCESSED_CONTAINER)
            
            # Try to access the specific file
            blob_name = "processed_documents_for_vectorization.json"
            print(f"📥 Downloading file: {blob_name}")
            
            blob_client = container_client.get_blob_client(blob_name)
            
            # Check if blob exists first
            try:
                blob_props = blob_client.get_blob_properties()
                file_size = blob_props.size
                print(f"✅ File found - Size: {file_size / (1024*1024):.2f} MB")
            except Exception as e:
                print(f"❌ File access error: {e}")
                return {}
            
            # Download the blob
            print("📥 Downloading blob content...")
            blob_data = blob_client.download_blob().readall()
            
            # Parse JSON
            print("🔄 Parsing JSON content...")
            documents = json.loads(blob_data.decode('utf-8'))
            
            print(f"✅ Successfully downloaded and parsed processed documents")
            print(f"📊 Found categories: {list(documents.keys())}")
            
            # Show some stats
            for category, docs in documents.items():
                successful_docs = [d for d in docs if d.get('success', False)]
                print(f"   - {category}: {len(successful_docs)}/{len(docs)} successful documents")
            
            return documents
                
        except ResourceNotFoundError:
            print(f"❌ File not found: {blob_name}")
            print(f"   Container: {Config.PROCESSED_CONTAINER}")
            print("   This means the file doesn't exist in the specified container")
            return {}
        except json.JSONDecodeError as e:
            print(f"❌ JSON parsing error: {e}")
            print("   The file exists but contains invalid JSON")
            return {}
        except Exception as e:
            print(f"❌ Unexpected error downloading documents: {e}")
            print(f"   Error type: {type(e).__name__}")
            import traceback
            print(f"   Full traceback: {traceback.format_exc()}")
            return {}

# Replace the original retriever and try document retrieval
if blob_service_client:
    retriever = EnhancedDocumentRetriever(blob_service_client)
    print("✅ Enhanced document retriever initialized")
    
    # Now try to retrieve the documents
    print("\n" + "="*60)
    print("📥 RETRIEVING PROCESSED DOCUMENTS FROM BLOB STORAGE")
    print("="*60)
    
    processed_documents = retriever.get_all_processed_documents()
    
    if processed_documents:
        print(f"\n🎉 SUCCESS! Retrieved processed documents from blob storage")
        print(f"📊 Available categories: {list(processed_documents.keys())}")
        
        # Filter to only process POLICIES
        policies_only = {'policies': processed_documents.get('policies', [])}
        
        print(f"🎯 Filtering to process POLICIES only...")
        print(f"📄 Found {len(policies_only['policies'])} policy documents")
        
        # Process only policy documents into search-ready chunks
        search_documents = []
        
        for category, docs in policies_only.items():
            print(f"\n📂 Processing {category} documents...")
            
            successful_docs = [doc for doc in docs if doc.get('success', False)]
            print(f"✅ Processing {len(successful_docs)} successful {category} documents")
            
            for doc in tqdm(successful_docs, desc=f"Processing {category}"):
                # Get text content from policies (markdown files)
                text_content = doc.get('text', '')
                if not text_content:
                    print(f"⚠️ Skipping document with no text content: {doc.get('metadata', {}).get('file_name', 'Unknown')}")
                    continue
                
                # Prepare metadata
                metadata = doc.get('metadata', {}).copy()
                metadata['category'] = category
                
                # Create chunks for this document
                chunks = chunker.chunk_text_for_search(text_content, metadata)
                
                # Convert chunks to search documents
                for chunk in chunks:
                    search_doc = {
                        'id': str(uuid.uuid4()),
                        'title': f"{metadata.get('file_name', 'Unknown')} - Part {chunk['chunk_id'] + 1}",
                        'content': chunk['content'],
                        'category': category,
                        'file_name': metadata.get('file_name', 'Unknown'),
                        'file_type': metadata.get('file_type', 'markdown'),
                        'chunk_id': chunk['chunk_id'],
                        'chunk_count': chunk['chunk_count'],
                        'original_length': len(text_content),
                        'chunk_length': len(chunk['content']),
                        'processing_date': datetime.now().isoformat() + 'Z'
                    }
                    search_documents.append(search_doc)
        
        print(f"\n✅ Prepared {len(search_documents)} policy document chunks for search index")
        
        # Show detailed statistics for policies only
        if search_documents:
            total_files = len(set(doc['file_name'] for doc in search_documents))
            total_chunks = len(search_documents)
            avg_chunk_length = sum(doc['chunk_length'] for doc in search_documents) / total_chunks
            
            print(f"\n📊 POLICIES INDEXING SUMMARY:")
            print(f"   📄 Total policy files: {total_files}")
            print(f"   🗂️ Total chunks created: {total_chunks}")
            print(f"   📏 Average chunk length: {avg_chunk_length:.0f} characters")
            
            # Show file breakdown
            file_stats = {}
            for doc in search_documents:
                file_name = doc['file_name']
                if file_name not in file_stats:
                    file_stats[file_name] = 0
                file_stats[file_name] += 1
            
            print(f"\n📋 Policy files breakdown:")
            for file_name, chunk_count in file_stats.items():
                print(f"   • {file_name}: {chunk_count} chunks")
        else:
            print("❌ No policy documents were processed successfully")
        
    else:
        print("\n❌ Still unable to retrieve documents from blob storage")
        print("💡 Troubleshooting steps:")
        print("   1. Verify the file exists in blob storage using Azure Portal") 
        print("   2. Check that the container name 'processed-documents' is correct")
        print("   3. Ensure your storage connection string has the right permissions")
        print("   4. Try running the document processing notebook (1.document-processing.ipynb) first")
        search_documents = []
        
else:
    print("❌ Cannot proceed - blob service client not available")
    search_documents = []

print(f"\n🔍 Final check - search_documents variable has {len(search_documents) if 'search_documents' in locals() else 0} documents")

✅ Enhanced document retriever initialized

📥 RETRIEVING PROCESSED DOCUMENTS FROM BLOB STORAGE
🔍 Attempting to retrieve from container: processed-documents
📥 Downloading file: processed_documents_for_vectorization.json
✅ File found - Size: 0.07 MB
📥 Downloading blob content...
🔄 Parsing JSON content...
✅ Successfully downloaded and parsed processed documents
📊 Found categories: ['policies', 'claims', 'statements']
   - policies: 5/5 successful documents
   - claims: 5/5 successful documents
   - statements: 5/5 successful documents

🎉 SUCCESS! Retrieved processed documents from blob storage
📊 Available categories: ['policies', 'claims', 'statements']
🎯 Filtering to process POLICIES only...
📄 Found 5 policy documents

📂 Processing policies documents...
✅ Processing 5 successful policies documents


Processing policies: 100%|██████████| 5/5 [00:00<00:00, 1709.31it/s]


✅ Prepared 44 policy document chunks for search index

📊 POLICIES INDEXING SUMMARY:
   📄 Total policy files: 5
   🗂️ Total chunks created: 44
   📏 Average chunk length: 845 characters

📋 Policy files breakdown:
   • commercial_auto_policy.md: 9 chunks
   • comprehensive_auto_policy.md: 7 chunks
   • high_value_vehicle_policy.md: 10 chunks
   • liability_only_policy.md: 7 chunks
   • motorcycle_policy.md: 11 chunks

🔍 Final check - search_documents variable has 44 documents





## 6. Upload Documents to Azure AI Search with Integrated Vectorization

The next cell implements a `SearchIndexUploader` class that efficiently uploads the processed policy document chunks to Azure AI Search in batches, with automatic embedding generation through integrated vectorization and comprehensive error handling and progress tracking.

In [7]:
class SearchIndexUploader:
    """Class to upload documents to Azure AI Search"""
    
    def __init__(self, search_client: SearchClient):
        self.search_client = search_client
    
    def upload_documents_batch(self, documents: List[Dict], batch_size: int = 50) -> bool:
        """Upload documents to search index in batches"""
        try:
            total_docs = len(documents)
            print(f"📤 Uploading {total_docs} documents to search index...")
            
            # Upload in batches
            for i in tqdm(range(0, total_docs, batch_size), desc="Uploading batches"):
                batch = documents[i:i + batch_size]
                
                # Prepare batch for upload (Azure AI Search will handle vectorization)
                upload_batch = []
                for doc in batch:
                    # Remove any fields that shouldn't be in the search document
                    search_doc = doc.copy()
                    upload_batch.append(search_doc)
                
                # Upload batch
                result = self.search_client.upload_documents(documents=upload_batch)
                
                # Check for errors
                failed_docs = [r for r in result if not r.succeeded]
                if failed_docs:
                    print(f"⚠️ Failed to upload {len(failed_docs)} documents in batch {i//batch_size + 1}")
                    for failed in failed_docs[:3]:  # Show first 3 errors
                        print(f"   Error: {failed.error_message}")
            
            print(f"✅ Document upload completed!")
            return True
            
        except Exception as e:
            print(f"❌ Error uploading documents: {e}")
            return False
    
    def get_document_count(self) -> int:
        """Get the current document count in the index"""
        try:
            # Simple search to get document count
            results = self.search_client.search("*", include_total_count=True, top=1)
            return results.get_count()
        except Exception as e:
            print(f"❌ Error getting document count: {e}")
            return 0

# Upload documents to search index
if search_client and search_documents:
    uploader = SearchIndexUploader(search_client)
    
    print("\n🚀 Starting POLICIES upload to Azure AI Search...")
    print("=" * 60)
    print("🎯 Uploading POLICY documents only")
    print("ℹ️ Azure AI Search will automatically generate embeddings using integrated vectorization")
    
    success = uploader.upload_documents_batch(search_documents)
    
    if success:
        # Wait a moment for indexing to complete
        import time
        print("\n⏳ Waiting for indexing to complete...")
        time.sleep(10)
        
        # Get final document count
        doc_count = uploader.get_document_count()
        print(f"✅ Index now contains {doc_count} policy document chunks")
        
        # Get index statistics
        if index_manager:
            stats = index_manager.get_index_stats()
            if stats:
                print(f"📊 Index statistics:")
                print(f"   - Policy documents: {stats.get('document_count', 'N/A')}")
                print(f"   - Storage size: {stats.get('storage_size', 'N/A')} bytes")
                print(f"   - Vector index size: {stats.get('vector_index_size', 'N/A')} bytes")
        
        print(f"\n🎯 SUCCESS: Only policy documents have been indexed!")
        print(f"📄 Your Azure AI Search index now contains comprehensive policy information")
        print(f"🔍 Ready for policy-related queries and AI agent integration")
        
    else:
        print("❌ Failed to upload policy documents to search index")
else:
    print("❌ Cannot upload documents - missing search client or policy documents")


🚀 Starting POLICIES upload to Azure AI Search...
🎯 Uploading POLICY documents only
ℹ️ Azure AI Search will automatically generate embeddings using integrated vectorization
📤 Uploading 44 documents to search index...


Uploading batches: 100%|██████████| 1/1 [00:00<00:00,  2.06it/s]


✅ Document upload completed!

⏳ Waiting for indexing to complete...
✅ Index now contains 44 policy document chunks
📊 Index statistics:
   - Policy documents: 0
   - Storage size: 0 bytes
   - Vector index size: 0 bytes

🎯 SUCCESS: Only policy documents have been indexed!
📄 Your Azure AI Search index now contains comprehensive policy information
🔍 Ready for policy-related queries and AI agent integration


## 7. Test the Search Index with Semantic and Vector Search

The next cell defines a `SearchTester` class that provides comprehensive testing capabilities for the Azure AI Search index, including semantic search with reranking, hybrid search combining keyword and vector approaches, and formatted result display with relevance scores and content previews.

In [8]:
class SearchTester:
    """Class to test the search index with various query types"""
    
    def __init__(self, search_client: SearchClient):
        self.search_client = search_client
    
    def vector_search(self, query: str, top_k: int = 5, category_filter: str = None) -> List[Dict]:
        """Perform vector search using integrated vectorization"""
        try:
            # Build search parameters
            search_params = {
                "search_text": query,
                "top": top_k,
                "search_mode": "any",
                "query_type": "semantic",
                "semantic_configuration_name": "insurance-semantic",
                "select": ["id", "title", "content", "category", "file_name", "chunk_id", "chunk_count"]
            }
            
            # Add category filter if specified
            if category_filter:
                search_params["filter"] = f"category eq '{category_filter}'"
            
            # Perform search
            results = self.search_client.search(**search_params)
            
            # Convert results to list
            search_results = []
            for result in results:
                search_results.append({
                    'id': result['id'],
                    'title': result['title'],
                    'content': result['content'],
                    'category': result['category'],
                    'file_name': result['file_name'],
                    'chunk_id': result['chunk_id'],
                    'chunk_count': result['chunk_count'],
                    'score': result.get('@search.score', 0),
                    'reranker_score': result.get('@search.reranker_score', 0)
                })
            
            return search_results
            
        except Exception as e:
            print(f"❌ Error in vector search: {e}")
            return []
    
    def hybrid_search(self, query: str, top_k: int = 5) -> List[Dict]:
        """Perform hybrid search (keyword + vector)"""
        try:
            results = self.search_client.search(
                search_text=query,
                top=top_k,
                search_mode="all",
                include_total_count=True,
                select=["id", "title", "content", "category", "file_name", "chunk_id"]
            )
            
            search_results = []
            for result in results:
                search_results.append({
                    'id': result['id'],
                    'title': result['title'],
                    'content': result['content'],
                    'category': result['category'],
                    'file_name': result['file_name'],
                    'chunk_id': result['chunk_id'],
                    'score': result.get('@search.score', 0)
                })
            
            return search_results
            
        except Exception as e:
            print(f"❌ Error in hybrid search: {e}")
            return []
    
    def display_search_results(self, query: str, results: List[Dict], search_type: str = "Search"):
        """Display search results in a formatted way"""
        print(f"\n🔍 {search_type} Results for: '{query}'")
        print("=" * 80)
        
        if not results:
            print("No results found.")
            return
        
        for i, result in enumerate(results, 1):
            score = result.get('score', 0)
            reranker_score = result.get('reranker_score', 0)
            
            print(f"\n{i}. 📄 {result['title']}")
            print(f"   📂 Category: {result['category']}")
            print(f"   📊 Score: {score:.4f}", end="")
            if reranker_score > 0:
                print(f" | Reranker: {reranker_score:.4f}")
            else:
                print()
            print(f"   📝 Chunk {result['chunk_id'] + 1}")
            
            # Show preview of content
            preview = result['content'][:300]
            if len(result['content']) > 300:
                preview += "..."
            print(f"   💬 Preview: {preview}")
            print("-" * 80)

# Initialize search tester
if search_client:
    search_tester = SearchTester(search_client)
    print("✅ Search tester initialized")
else:
    print("❌ Cannot initialize search tester - missing search client")
    search_tester = None

✅ Search tester initialized


## 8. Test with Sample Insurance Queries

The next cell executes a comprehensive test suite using predefined insurance-related queries to validate the search index functionality, demonstrating semantic search capabilities across various insurance topics like collision coverage, liability limits, and policy requirements with detailed result analysis.



In [9]:
# Test the search index with sample queries
if search_tester:
    test_queries = [
        "What is covered under collision insurance?",
        "How much does comprehensive coverage cost?", 
        "What are the liability limits for commercial vehicles?",
        "Does my policy cover theft and vandalism?",
        "What happens if I hit an uninsured driver?",
        "High value vehicle insurance requirements",
        "Motorcycle insurance coverage options"
    ]
    
    print("🧪 Testing Azure AI Search with integrated vectorization...")
    print("=" * 80)
    
    for query in test_queries:
        print(f"\n\n🔍 Testing query: '{query}'")
        
        # Test semantic search
        results = search_tester.vector_search(query, top_k=3)
        
        if results:
            print(f"✅ Found {len(results)} relevant chunks")
            search_tester.display_search_results(query, results, "Semantic Search")
        else:
            print("❌ No relevant documents found")
        
        print("-" * 40)
    
    print("\n✅ Query testing completed!")
else:
    print("❌ Cannot test queries - search tester not available")

🧪 Testing Azure AI Search with integrated vectorization...


🔍 Testing query: 'What is covered under collision insurance?'
✅ Found 3 relevant chunks

🔍 Semantic Search Results for: 'What is covered under collision insurance?'

1. 📄 comprehensive_auto_policy.md - Part 1
   📂 Category: policies
   📊 Score: 1.2905 | Reranker: 2.7784
   📝 Chunk 1
   💬 Preview: # Comprehensive Auto Insurance Policy **Policy Type:** Comprehensive Auto Insurance **Coverage Category:** Full Coverage **Policy Code:** COMP-AUTO-001 ## Section 1: Coverage Overview This comprehensive auto insurance policy provides extensive protection for your vehicle and liability coverage for d...
--------------------------------------------------------------------------------

2. 📄 motorcycle_policy.md - Part 2
   📂 Category: policies
   📊 Score: 1.6658 | Reranker: 2.6203
   📝 Chunk 2
   💬 Preview: 50cc-250cc) - Medium displacement (251cc-600cc) - Large displacement (601cc-1000cc) - High-performance (1000cc+) - Electric motorcy

## 9. Interactive Search Interface

The next cell provides an interactive search function that creates a user-friendly command-line interface for real-time testing of the Azure AI Search index, allowing users to enter natural language queries, apply category filters, and compare semantic versus hybrid search results interactively.



In [11]:
def interactive_search():
    """Interactive search interface for testing"""
    if not search_tester:
        print("❌ Search tester not available")
        return
    
    print("\n🔍 Interactive Azure AI Search Interface")
    print("=" * 50)
    print("Enter your search queries (type 'quit' to exit)")
    print("Optional commands:")
    print("  - Add 'category:policies' or 'category:claims' to filter results")
    print("  - Use natural language queries for best semantic search results")
    print()
    
    while True:
        try:
            query = input("\n🔍 Search: ").strip()
            
            if query.lower() in ['quit', 'exit', 'q']:
                break
            
            if not query:
                continue
            
            # Check for category filter
            category_filter = None
            if 'category:' in query:
                parts = query.split('category:')
                query = parts[0].strip()
                category_filter = parts[1].strip()
            
            # Perform semantic search
            print(f"\n🧠 Performing semantic search...")
            results = search_tester.vector_search(query, top_k=5, category_filter=category_filter)
            
            # Display results
            search_tester.display_search_results(query, results, "Semantic Search")
            
            # Also try hybrid search for comparison
            print(f"\n🔄 Hybrid search results:")
            hybrid_results = search_tester.hybrid_search(query, top_k=3)
            if hybrid_results:
                for i, result in enumerate(hybrid_results[:2], 1):  # Show top 2
                    print(f"{i}. {result['title']} (Score: {result['score']:.4f})")
            
        except KeyboardInterrupt:
            break
        except Exception as e:
            print(f"❌ Error: {e}")
    
    print("\n👋 Search session ended")

# Note: Uncomment the line below to start interactive search
interactive_search()


🔍 Interactive Azure AI Search Interface
Enter your search queries (type 'quit' to exit)
Optional commands:
  - Add 'category:policies' or 'category:claims' to filter results
  - Use natural language queries for best semantic search results


🧠 Performing semantic search...

🔍 Semantic Search Results for: 'Broken wheels'
No results found.

🔄 Hybrid search results:

👋 Search session ended


## 10. Summary and Next Steps

The next cell generates a comprehensive summary of the entire Azure AI Search integration process, collecting index statistics, document counts, and search capabilities to provide a detailed final report of what was accomplished and confirm the system's readiness for AI agent integration.

In [12]:
# Generate final summary
def generate_summary():
    """Generate a comprehensive summary of the Azure AI Search integration"""
    
    # Get index statistics
    index_stats = {}
    doc_count = 0
    
    if search_client and index_manager:
        try:
            doc_count = SearchIndexUploader(search_client).get_document_count()
            index_stats = index_manager.get_index_stats()
        except:
            pass
    
    summary = {
        "integration_summary": {
            "completion_date": datetime.now().isoformat(),
            "search_service": Config.SEARCH_SERVICE_NAME,
            "search_index": Config.SEARCH_INDEX_NAME,
            "total_documents_processed": len(processed_documents.get('policies', []) + processed_documents.get('claims', [])) if 'processed_documents' in globals() and processed_documents else 0,
            "total_chunks_indexed": doc_count,
            "embedding_model": Config.AZURE_OPENAI_EMBEDDING_DEPLOYMENT,
            "vectorization_method": "Azure AI Search Integrated Vectorization",
            "chunk_configuration": {
                "chunk_size": Config.CHUNK_SIZE,
                "chunk_overlap": Config.CHUNK_OVERLAP
            }
        },
        "search_capabilities": {
            "semantic_search": True,
            "vector_search": True,
            "hybrid_search": True,
            "automatic_vectorization": True,
            "real_time_indexing": True
        },
        "index_statistics": index_stats,
        "ready_for_ai_agents": bool(doc_count > 0)
    }
    
    # Add category breakdown if available
    if 'search_documents' in globals() and search_documents:
        category_stats = {}
        for doc in search_documents:
            category = doc['category']
            if category not in category_stats:
                category_stats[category] = {
                    'chunks': 0,
                    'total_characters': 0,
                    'files': set()
                }
            
            category_stats[category]['chunks'] += 1
            category_stats[category]['total_characters'] += doc['chunk_length']
            category_stats[category]['files'].add(doc['file_name'])
        
        # Convert sets to lists and add unique file counts
        for category in category_stats:
            category_stats[category]['files'] = list(category_stats[category]['files'])
            category_stats[category]['unique_files'] = len(category_stats[category]['files'])
        
        summary['categories_processed'] = category_stats
    
    return summary

# Generate and display summary
final_summary = generate_summary()

print("📋 AZURE AI SEARCH INTEGRATION SUMMARY")
print("=" * 60)
print(f"📅 Completed: {final_summary['integration_summary']['completion_date']}")
print(f"🔍 Search Service: {final_summary['integration_summary']['search_service']}")
print(f"📇 Search Index: {final_summary['integration_summary']['search_index']}")
print(f"📄 Documents Processed: {final_summary['integration_summary']['total_documents_processed']}")
print(f"🗂️ Chunks Indexed: {final_summary['integration_summary']['total_chunks_indexed']}")
print(f"🤖 Embedding Model: {final_summary['integration_summary']['embedding_model']}")
print(f"⚡ Vectorization: {final_summary['integration_summary']['vectorization_method']}")

print("\n🚀 SEARCH CAPABILITIES:")
capabilities = final_summary['search_capabilities']
for capability, enabled in capabilities.items():
    status = "✅" if enabled else "❌"
    print(f"  {status} {capability.replace('_', ' ').title()}")

if 'categories_processed' in final_summary:
    print("\n📊 BY CATEGORY:")
    for category, stats in final_summary['categories_processed'].items():
        print(f"  • {category.title()}:")
        print(f"    - Files: {stats['unique_files']}")
        print(f"    - Chunks: {stats['chunks']}")
        print(f"    - Total Characters: {stats['total_characters']:,}")

📋 AZURE AI SEARCH INTEGRATION SUMMARY
📅 Completed: 2025-09-25T18:47:40.057649
🔍 Search Service: msagthack-search-m3qz57ik6ngog
📇 Search Index: insurance-documents-index
📄 Documents Processed: 10
🗂️ Chunks Indexed: 44
🤖 Embedding Model: text-embedding-ada-002
⚡ Vectorization: Azure AI Search Integrated Vectorization

🚀 SEARCH CAPABILITIES:
  ✅ Semantic Search
  ✅ Vector Search
  ✅ Hybrid Search
  ✅ Automatic Vectorization
  ✅ Real Time Indexing

📊 BY CATEGORY:
  • Policies:
    - Files: 5
    - Chunks: 44
    - Total Characters: 37,186
