# Multimodal Retrieval Augmented Generation with Content Understanding

# Overview

Azure AI Content Understanding provides a powerful solution for extracting data from diverse content types, while preserving semantic integrity and contextual relationships ensuring optimal performance in Retrieval Augmented Generation (RAG) applications.

This sample demonstrates how to leverage Azure AI's Content Understanding capabilities to extract:

- OCR and layout information from documents
- Audio transcription with speaker diarization from audio files
- Shot detection, keyframe extraction, and audio transcription from videos

This notebook illustrates how to extract content from unstructured multimodal data and apply it to Retrieval Augmented Generation (RAG). The resulting markdown output can be used with LangChain's markdown header splitter for semantic chunking. These chunks are then indexed in Azure AI Search. When a user submits a query, Azure AI Search retrieves relevant chunks to generate a context-aware response.


# Scenario

SecureHome Insurance, a leading property insurance company, faces a significant challenge following a recent natural disaster that has led to an influx of insurance claims. The data analyst at SecureHome Insurance is tasked with accurately validating ingested data from claims and invoices being processed through the system. These claims include various multimodal content types, such as policy plans (text documents), photos of property damage (images), footage of the disaster impact (videos), and recorded statements from insurance adjusters (audio files). The goal is to streamline the process and ensure analysts have all necessary information at their fingertips to maintain accuracy and compliance.

To address this challenge, SecureHome Insurance uses Azure AI Content Understanding to create a unified system that extracts and analyzes data from multimodal sources. The system processes text documents to extract key information like policy details and invoice documents, analyzes images to assess the extent of property damage, processes videos to understand the impact of the disaster, and transcribes and analyzes audio files to capture adjuster reports and statements. By preserving semantic integrity and contextual relationships, the system ensures that all relevant information is accurately mapped to defined schemas such as policy plans, invoices, and insurance adjuster reports.

In practice, when a data analyst receives a batch of insurance claims, they use the integrated platform to search for relevant information. The system employs a Retrieval-Augmented Generation (RAG) approach, where it first retrieves relevant data from text documents, images, videos, and audio files.  This retrieved data is then used to generate comprehensive and contextually accurate responses to the analyst's queries. 

By leveraging the RAG approach, the system ensures that the analyst has access to the most relevant and up-to-date information, enabling them to accurately validate and process the claims efficiently. This integration of Azure AI Content Understanding with RAG significantly enhances the claims processing system, leading to improved accuracy and efficiency in handling insurance claims.


# Pre-requisites
1. Follow [README](../README.md#configure-azure-ai-service-resource) to create essential resource that will be used in this sample
2. Install required packages

In [None]:
%pip install -r ../requirements.txt
! pip install python-dotenv langchain langchain-community langchain-openai langchainhub openai tiktoken azure-identity azure-search-documents==11.6.0b3

# Load environment variables

In [None]:

import os
from dotenv import load_dotenv
load_dotenv()

# Load and validate Azure AI Services configs
AZURE_AI_SERVICE_ENDPOINT = os.getenv("AZURE_AI_SERVICE_ENDPOINT")
AZURE_AI_SERVICE_API_VERSION = os.getenv("AZURE_AI_SERVICE_API_VERSION") or "2024-12-01-preview"
AZURE_DOCUMENT_INTELLIGENCE_API_VERSION = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_API_VERSION") or "2024-11-30"

# Load and validate Azure OpenAI configs
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME")
AZURE_OPENAI_CHAT_API_VERSION = os.getenv("AZURE_OPENAI_CHAT_API_VERSION") or "2024-08-01-preview"
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
AZURE_OPENAI_EMBEDDING_API_VERSION = os.getenv("AZURE_OPENAI_EMBEDDING_API_VERSION") or "2023-05-15"

# Load and validate Azure Search Services configs
AZURE_SEARCH_ENDPOINT = os.getenv("AZURE_SEARCH_ENDPOINT")
AZURE_SEARCH_INDEX_NAME = os.getenv("AZURE_SEARCH_INDEX_NAME") or "sample-doc-index"

# Create custom analyzer

In [None]:
import json
import sys
import uuid
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# Add the parent directory to the path to use shared modules
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))


### Create analyzer with pre-defined schemas.

In [None]:
from pathlib import Path
from python.content_understanding_client import AzureContentUnderstandingClient
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

#CREATE ANALYZERS
analyzer_configs = [
    {
        "id": "doc-analyzer" + str(uuid.uuid4()),
        "template_path": "../analyzer_templates/content_document.json",
        "location": Path("../data/sample_layout.pdf"),
    },
    {
        "id": "image-analyzer" + str(uuid.uuid4()),
        "template_path": "../analyzer_templates/image_chart_diagram_understanding.json",
        "location": Path("../data/sample_report.pdf"),
    },
    {
        "id": "audio-analyzer" + str(uuid.uuid4()),
        "template_path": "../analyzer_templates/call_recording_analytics.json",
        "location": Path("../data/callCenterRecording.mp3"),
    },
    {
        "id": "video-analyzer" + str(uuid.uuid4()),
        "template_path": "../analyzer_templates/video_content_understanding.json",
        "location": Path("../data/FlightSimulator.mp4"),
    },
]
# Create Content Understanding client
content_understanding_client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_SERVICE_ENDPOINT,
    api_version=AZURE_AI_SERVICE_API_VERSION,
    token_provider=token_provider,
    x_ms_useragent="azure-ai-content-understanding-python/content_extraction", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.
)

# Iterate through each analyzer and create it using the content understanding client
for analyzer in analyzer_configs:
    analyzer_id = analyzer["id"]
    template_path = analyzer["template_path"]

    try:
        
        # Create the analyzer using the content understanding client
        response = content_understanding_client.begin_create_analyzer(
            analyzer_id=analyzer_id,
            analyzer_template_path=template_path
        )
        result = content_understanding_client.poll_result(response)
        print(f"Successfully created analyzer: {analyzer_id}")
        
    except Exception as e:
        print(f"Failed to create analyzer: {analyzer_id}")
        print(f"Error: {e}")

### Use created analyzers to extract multimodal content

In [None]:
# Use analyzer to extract document content with layout analysis
#Iterate through each analyzer and analyze the content for each modality
analyzer_results =[]
extracted_markdown = []
analyzer_content = []
for analyzer in analyzer_configs:
    analyzer_id = analyzer["id"]
    template_path = analyzer["template_path"]
    file_location = analyzer["location"]
    try:
           # Analyze content
            response = content_understanding_client.begin_analyze(analyzer_id, file_location)
            result = content_understanding_client.poll_result(response)
            analyzer_results.append({"id":analyzer_id, "result": result["result"]})
            analyzer_content.append({"id": analyzer_id, "content": result["result"]["contents"]})
            #analyzer_results.append({"id":analyzer_id, "result": result.get("result", {})})
            #analyzer_content.append({"id": analyzer_id, "content": result.get("result", {}).get("content", [])})

            # Extract markdown from the content list
            extracted_markdown.append({"id": analyzer_id, "markdown": analyzer_content["content"]["markdown"]})
            print(f"Markdown", extracted_markdown["markdown"])   

            # extracted_markdown.append({"id": analyzer_id, "markdown": analyzer_content.get("content", []).get("markdown", "")})
            
             
    except Exception as e:
            print(e)
            print("Error in creating analyzer. Please double-check your analysis settings.\nIf there is a conflict, you can delete the analyzer and then recreate it, or move to the next cell and use the existing analyzer.")

print("Analyzer Results:")
for analyzer_result in analyzer_results:
    print(f"Analyzer ID: {analyzer_result['id']}")
    print(json.dumps(analyzer_result["result"], indent=2))            
# Delete the analyzer if it is no longer needed
#content_understanding_client.delete_analyzer(ANALYZER_ID)

## Organize multimodal content extraction markdown data

### Split document content into semantic chunks

This is a simple starting point. Feel free to give your own chunking strategies a try!

In [None]:
from langchain import hub
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.vectorstores.azuresearch import AzureSearch
from langchain_core.prompts import ChatPromptTemplate
from langchain.schema import Document
# Configure langchain text splitting settings
EMBEDDING_CHUNK_SIZE = 512
EMBEDDING_CHUNK_OVERLAP = 20

# Split the document into chunks base on markdown headers.
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

docs_string = extracted_markdown[0]["markdown"] #first item is in extracted_markdown list is the document analyzer markdown output
docs_splits = text_splitter.split_text(docs_string)

print("Length of splits: " + str(len(docs_splits)))

### Preprocess output data

In [None]:
def convert_values_to_strings(json_obj):
    return [str(value) for value in json_obj]

#convert audio content to JSON object        
def process_image_audio_video_contents(image_contents, audio_contents, video_contents):
    output = []

    image_splits = [
       "This is a json string representing an image verbalization and OCR extraction" 
       + v
       + "```"
       for v in convert_values_to_strings(image_contents)
    ]
    image = [Document(page_content=v) for v in image_splits]
    output+=image

    audio_splits = [
        "This is a json string representing an audio segment with transcription" 
       + v
       + "```"
       for v in convert_values_to_strings(audio_contents)
    ]
    audio = [Document(page_content=v) for v in audio_splits]
    output += audio


    video_splits = [
        "The following is a json string representing a video segment with scene description and transcript ```"
        + v
        + "```"
        for v in convert_values_to_strings(video_contents)
    ]
    video = [Document(page_content=v) for v in video_splits]
    output+=video    
    
    
    return output

# Print the content analysis result
#print(f"Video Content Understanding result: ", video_cu_result["result"]["contents"])
docs_splits = process_image_audio_video_contents([], analyzer_content[1]["content"], analyzer_content[2]["content"])
docs_splits+=docs_string

#docs = process_image_audio_video_contents(analyzer_content[1].get("content", []),analyzer_content[2].get("content", []), analyzer_content[3].get("content", []))
#docs.append(Document(page_content=docs_string))

print("There are " + str(len(docs_splits)) + " documents.") 

for doc in docs_splits:
    print(f"doc content", doc.page_content)

# Embed and index the chunks

In [None]:
# Embed the splitted documents and insert into Azure Search vector store
def embed_and_index_chunks(docs):
    aoai_embeddings = AzureOpenAIEmbeddings(
        azure_deployment=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME,
        openai_api_version=AZURE_OPENAI_EMBEDDING_API_VERSION,  # e.g., "2023-12-01-preview"
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        azure_ad_token_provider=token_provider
    )

    vector_store: AzureSearch = AzureSearch(
        azure_search_endpoint=AZURE_SEARCH_ENDPOINT,
        azure_search_key=None,
        index_name=AZURE_SEARCH_INDEX_NAME,
        embedding_function=aoai_embeddings.embed_query
    )
    vector_store.add_documents(documents=docs)
    return vector_store


# embed and index the docs:
vector_store = embed_and_index_chunks(docs_splits)

# Retrieve relevant chunks based on a question

## Retrieve relevant content
#### Execute a pure vector similarity search

In [None]:
# Set your query
query = "japan"

In [None]:
# Perform a similarity search
docs = vector_store.similarity_search(
    query=query,
    k=3,
    search_type="similarity",
)
for doc in docs:
    print(doc.page_content)

#### Execute hybrid search. Vector and nonvector text fields are queried in parallel, results are merged, and top matches of the unified result set are returned.

In [None]:
# Perform a hybrid search using the search_type parameter
docs = vector_store.hybrid_search(query=query, k=3)
for doc in docs:
    print(doc.page_content)

## Q&A
We can utilize OpenAI GPT completion models + Azure Search to conversationally search for and chat about the results. (If you are using GitHub Codespaces, there will be an input prompt near the top of the screen)

In [None]:
# Setup rag chain
prompt_str = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:"""


def setup_rag_chain(vector_store):
    retriever = vector_store.as_retriever(search_type="similarity", k=3)

    prompt = ChatPromptTemplate.from_template(prompt_str)
    llm = AzureChatOpenAI(
        openai_api_version=AZURE_OPENAI_CHAT_API_VERSION,
        azure_deployment=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
        azure_ad_token_provider=token_provider,
        temperature=0.7,
    )

    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    return rag_chain


# Setup conversational search
def conversational_search(rag_chain, query):
    print(rag_chain.invoke(query))


rag_chain = setup_rag_chain(vector_store)
while True:
    query = input("Enter your query: ")
    if query=="":
        break
    conversational_search(rag_chain, query)