# Multimodal Retrieval Augmented Generation with Content Understanding

# Overview

Azure AI Content Understanding provides a powerful solution for extracting data from diverse content types, while preserving semantic integrity and contextual relationships ensuring optimal performance in Retrieval Augmented Generation (RAG) applications.

This sample demonstrates how to leverage Azure AI's Content Understanding capabilities to extract:

- OCR and layout information from documents
- Image description, summarization and classification
- Audio transcription with speaker diarization from audio files
- Shot detection, keyframe extraction, and audio transcription from videos

This notebook illustrates how to extract content from unstructured multimodal data and apply it to Retrieval Augmented Generation (RAG). The resulting output can be converted to vector embeddings and indexed in Azure AI Search. When a user submits a query, Azure AI Search retrieves relevant chunks to generate a context-aware response.


# Load environment variables

In [31]:

import os
from dotenv import load_dotenv
load_dotenv()

# Load and validate Azure AI Services configs
AZURE_AI_SERVICE_ENDPOINT = "https://ai-swethasundar6403ai189371267970.services.ai.azure.com/"
AZURE_AI_SERVICE_API_VERSION = os.getenv("AZURE_AI_SERVICE_API_VERSION") or "2025-05-01-preview"
AZURE_DOCUMENT_INTELLIGENCE_API_VERSION = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_API_VERSION") or "2024-11-30"

# Load and validate Azure OpenAI configs
AZURE_OPENAI_ENDPOINT = "https://ai-swethasundar6403ai189371267970.openai.azure.com/"
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME = "gpt-4.1"
AZURE_OPENAI_CHAT_API_VERSION = os.getenv("AZURE_OPENAI_CHAT_API_VERSION") or "2024-08-01-preview"
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME = "text-embedding-ada-002"
AZURE_OPENAI_EMBEDDING_API_VERSION = os.getenv("AZURE_OPENAI_EMBEDDING_API_VERSION") or "2023-05-15"

# Load and validate Azure Search Services configs
AZURE_SEARCH_ENDPOINT = "https://agentichack-search.search.windows.net/"
AZURE_SEARCH_INDEX_NAME = "kevin-cu"

# Create custom analyzer

In [21]:
from langchain import hub
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.vectorstores.azuresearch import AzureSearch
from langchain_core.prompts import ChatPromptTemplate
from langchain.schema import Document
import requests
import json
import sys
import uuid
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# Add the parent directory to the path to use shared modules
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))


### Create analyzers with pre-defined schemas.
Feel free to start with the provided sample data as a reference and experiment with your own data to explore its capabilities.

In [25]:
from pathlib import Path
from python.content_understanding_client import AzureContentUnderstandingClient
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

#set analyzer configs
analyzer_configs = [
    {
        "id": "financial-report-analyzer" + str(uuid.uuid4()),
        "template_path": "./analyzer_templates/financial_metrics.json",
        "location": Path("./data/tesco/TESCO - FY22.pdf"),
    }
]

# Create Content Understanding client
content_understanding_client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_SERVICE_ENDPOINT,
    api_version=AZURE_AI_SERVICE_API_VERSION,
    subscription_key="677G3ZH6bImljTOn8SIpUUPc4CCpfAlDahurgJuB9ohhPbpB3iDpJQQJ99BFACfhMk5XJ3w3AAAAACOG2TB5",
    token_provider=token_provider,
    x_ms_useragent="azure-ai-content-understanding-python/content_extraction", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.
)

# Iterate through each config and create an analyzer
for analyzer in analyzer_configs:
    analyzer_id = analyzer["id"]
    template_path = analyzer["template_path"]

    try:
        
        # Create the analyzer using the content understanding client
        response = content_understanding_client.begin_create_analyzer(
            analyzer_id=analyzer_id,
            analyzer_template_path=template_path
        )
        result = content_understanding_client.poll_result(response)
        print(f"Successfully created analyzer: {analyzer_id}")
        
    except Exception as e:
        print(f"Failed to create analyzer: {analyzer_id}")
        print(f"Error: {e}")

Failed to create analyzer: financial-report-analyzer6a97c623-a507-4b9f-b1d9-9c5410a55263
Error: 400 Client Error: Bad Request for url: https://ai-swethasundar6403ai189371267970.cognitiveservices.azure.com/contentunderstanding/analyzers/financial-report-analyzer6a97c623-a507-4b9f-b1d9-9c5410a55263?api-version=2025-05-01-preview


### Use created analyzers to extract multimodal content

In [27]:

#Iterate through each analyzer created and analyze content for each modality

analyzer_results =[]
extracted_markdown = []
analyzer_content = []
for analyzer in analyzer_configs:
    analyzer_id = "hackathon_financial"
    template_path = analyzer["template_path"]
    file_location = Path("./data/sample_finance.png")
    try:
           # Analyze content
            response = content_understanding_client.begin_analyze(analyzer_id, file_location)
            result = content_understanding_client.poll_result(response)
            analyzer_results.append({"id":analyzer_id, "result": result["result"]})
            analyzer_content.append({"id": analyzer_id, "content": result["result"]["contents"]})
                       
    except Exception as e:
            print(e)
            print("Error in creating analyzer. Please double-check your analysis settings.\nIf there is a conflict, you can delete the analyzer and then recreate it, or move to the next cell and use the existing analyzer.")

print("Analyzer Results:")
for analyzer_result in analyzer_results:
    print(f"Analyzer ID: {analyzer_result['id']}")
    print(json.dumps(analyzer_result["result"], indent=2))            

# Delete the analyzer if it is no longer needed
#content_understanding_client.delete_analyzer(ANALYZER_ID)

Analyzer Results:
Analyzer ID: hackathon_financial
{
  "analyzerId": "hackathon_financial",
  "apiVersion": "2025-05-01-preview",
  "createdAt": "2025-07-07T17:20:53Z",
  "contents": [
    {
      "markdown": "<!-- PageHeader=\"Strategic report\" -->\n<!-- PageHeader=\"Financial review\" -->\n\n\n# Group review of performance.\n\nGroup sales increased by +3.0% at constant rates, with growth\nacross all regions on top of exceptionally strong sales last year.\nRevenue increased by +6.4% at constant rates including fuel sales\ngrowth of +48.1% as customers travelled more following the easing\nof government restrictions. While two-year like-for-like\" fuel sales\ngrowth was negative at (6.4)%, this primarily reflects lower demand\nin the first half, with fuel sales ahead of pre-pandemic levels by the\nend of the year.\n\nGroup adjusted operating profit* grew by +58.9% at constant rates,\nreflecting the strong sales performance across the retail businesses,\na reduction in COVID-19 related 

In [28]:
analyzer_content

[{'id': 'hackathon_financial',
  'content': [{'markdown': '<!-- PageHeader="Strategic report" -->\n<!-- PageHeader="Financial review" -->\n\n\n# Group review of performance.\n\nGroup sales increased by +3.0% at constant rates, with growth\nacross all regions on top of exceptionally strong sales last year.\nRevenue increased by +6.4% at constant rates including fuel sales\ngrowth of +48.1% as customers travelled more following the easing\nof government restrictions. While two-year like-for-like" fuel sales\ngrowth was negative at (6.4)%, this primarily reflects lower demand\nin the first half, with fuel sales ahead of pre-pandemic levels by the\nend of the year.\n\nGroup adjusted operating profit* grew by +58.9% at constant rates,\nreflecting the strong sales performance across the retail businesses,\na reduction in COVID-19 related costs and a return to profitability\nin Tesco Bank. These benefits were partially offset by inflationary\npressures in the cost base, particularly in distri

# Organize multimodal data
This is a simple starting point. Feel free to give your own chunking strategies a try!

### Preprocess JSON output data

In [30]:
def convert_values_to_strings(json_obj):
    return [str(value) for value in json_obj]

#process all content and convert to string      
def process_allJSON_content(all_content):

    # Initialize empty list to store string of all content
    output = []

    document_splits = [
        "This is a json string representing a document with text and metadata for the file located in "+str(analyzer_configs[0]["location"])+" "
        + v 
        + "```"
        for v in convert_values_to_strings(all_content[0]["content"])
    ]
    docs = [Document(page_content=v) for v in document_splits]
    output += docs  
    
    return output

all_splits = process_allJSON_content(analyzer_content)

print("There are " + str(len(all_splits)) + " documents.") 
# Print the content of all doc splits
for doc in all_splits:
    print(f"doc content", doc.page_content)

There are 1 documents.
doc content This is a json string representing a document with text and metadata for the file located in data/tesco/TESCO - FY22.pdf {'markdown': '<!-- PageHeader="Strategic report" -->\n<!-- PageHeader="Financial review" -->\n\n\n# Group review of performance.\n\nGroup sales increased by +3.0% at constant rates, with growth\nacross all regions on top of exceptionally strong sales last year.\nRevenue increased by +6.4% at constant rates including fuel sales\ngrowth of +48.1% as customers travelled more following the easing\nof government restrictions. While two-year like-for-like" fuel sales\ngrowth was negative at (6.4)%, this primarily reflects lower demand\nin the first half, with fuel sales ahead of pre-pandemic levels by the\nend of the year.\n\nGroup adjusted operating profit* grew by +58.9% at constant rates,\nreflecting the strong sales performance across the retail businesses,\na reduction in COVID-19 related costs and a return to profitability\nin Tesco

##### *Optional* - Split document markdown into semantic chunks


In [None]:

# Configure langchain text splitting settings
EMBEDDING_CHUNK_SIZE = 512
EMBEDDING_CHUNK_OVERLAP = 20

# Split the document into chunks base on markdown headers.
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

docs_string = analyzer_content[0]['content'][0]['markdown'] #extract document analyzer markdown (first item in the list) is the document analyzer markdown output
docs_splits = text_splitter.split_text(docs_string)

print("Length of splits: " + str(len(docs_splits)))

# Embed and index the chunks

In [32]:
# Embed the splitted documents and insert into Azure Search vector store
def embed_and_index_chunks(docs):
    aoai_embeddings = AzureOpenAIEmbeddings(
        azure_deployment=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME,
        openai_api_version=AZURE_OPENAI_EMBEDDING_API_VERSION,  # e.g., "2023-12-01-preview"
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        azure_ad_token_provider=token_provider
    )

    vector_store: AzureSearch = AzureSearch(
        azure_search_endpoint=AZURE_SEARCH_ENDPOINT,
        azure_search_key=None,
        index_name=AZURE_SEARCH_INDEX_NAME,
        embedding_function=aoai_embeddings.embed_query
    )
    vector_store.add_documents(documents=docs)
    return vector_store


# embed and index the docs:
vector_store = embed_and_index_chunks(all_splits)

# Retrieve relevant chunks based on a question
#### Execute a pure vector similarity search

In [33]:
# Set your query
query = "what's the total revenue"

In [37]:
# Perform a similarity search
aoai_embeddings = AzureOpenAIEmbeddings(
    azure_deployment=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME,
    openai_api_version=AZURE_OPENAI_EMBEDDING_API_VERSION,  # e.g., "2023-12-01-preview"
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    azure_ad_token_provider=token_provider
)
vector_store1 = AzureSearch(
    azure_search_endpoint=AZURE_SEARCH_ENDPOINT,
    azure_search_key=None,
    index_name="tesco_report_agent",
    embedding_function=aoai_embeddings.embed_query
)
docs = vector_store1.similarity_search(
    query=query,
    k=3,
    search_type="similarity",
)
for doc in docs:
    print(doc.page_content)

KeyboardInterrupt: 

#### Execute hybrid search. Vector and nonvector text fields are queried in parallel, results are merged, and top matches of the unified result set are returned.

In [None]:
# Perform a hybrid search using the search_type parameter
docs = vector_store.hybrid_search(query=query, k=3)
for doc in docs:
    print(doc.page_content)

## Q&A
We can utilize OpenAI GPT completion models + Azure Search to conversationally search for and chat about the results. (If you are using GitHub Codespaces, there will be an input prompt near the top of the screen)

In [36]:
# Setup rag chain
prompt_str = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:"""


def setup_rag_chain(vector_store):
    retriever = vector_store.as_retriever(search_type="similarity", k=3)

    prompt = ChatPromptTemplate.from_template(prompt_str)
    llm = AzureChatOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        openai_api_version=AZURE_OPENAI_CHAT_API_VERSION,
        azure_deployment=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
        azure_ad_token_provider=token_provider,
        temperature=0.7,
    )

    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    return rag_chain


# Setup conversational search
def conversational_search(rag_chain, query):
    print(rag_chain.invoke(query))


rag_chain = setup_rag_chain(vector_store)
while True:
    query = input("Enter your query: ")
    if query=="":
        break
    conversational_search(rag_chain, query)

The total revenue for Tesco for FY 2021/22 was £61,344 million. This figure includes fuel sales and excludes VAT. The revenue for the previous year (FY 2020/21) was £57,887 million.
