# 5. Document content extraction using Azure Content Understanding

<img src="https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/media/overview/content-understanding-framework-2025.png#lightbox">

Azure Content Understanding in Foundry Tools is an Foundry Tool that's available as part of the Microsoft Foundry Resource in the Azure portal. It uses generative AI to process/ingest content of many types (documents, images, videos, and audio) into a user-defined output format. Content Understanding offers a streamlined process to reason over large amounts of unstructured data, accelerating time-to-value by generating an output that can be integrated into automation and analytical workflows.

Content Understanding is now a Generally Available (GA) service with the release of the 2025-11-01 API version. It's now available in a broader range of regions.

### Core Documentation
1. **[What is Azure Content Understanding in Foundry Tools?](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/overview)** - Main overview page
2. **[FAQ - Frequently Asked Questions](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/faq)** - Common questions and answers
3. **[Choosing the Right Tool: Document Intelligence vs Content Understanding](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/choosing-right-ai-tool)** - Comparison guide
4. **[Models and Deployments](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/models-deployments)** - Supported models configuration
5. **[Pricing Explainer](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/pricing-explainer)** - Pricing details and optimization

### Modality-Specific Documentation
6. **[Document Processing Overview](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/document/overview)** - Field extraction and grounding
7. **[Video Solutions (Preview)](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/video/overview)** - Video analysis capabilities
8. **[Image Solutions (Preview)](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/image/overview)** - Image extraction and analysis
9. **[Face Solutions (Preview)](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/face/overview)** - Face detection and recognition

### Additional Resources
10. **[Transparency Note](https://learn.microsoft.com/en-us/azure/ai-foundry/responsible-ai/content-understanding/transparency-note)** - Responsible AI information
11. **[Code Samples on GitHub](https://github.com/Azure-Samples/azure-ai-content-understanding-python)** - Python implementation examples
12. **[Azure Content Understanding Pricing](https://azure.microsoft.com/pricing/details/content-understanding/)** - Official pricing page

## Document Content

The `prebuilt-documentSearch` analyzer transforms unstructured documents into structured, machine-readable data optimized for retrieval-augmented generation (RAG) and automated workflows. It extracts content and layout elements while preserving document structure and semantic relationships.

Key capabilities include:
1. **Content Analysis:** Extracts text (printed and handwritten), selection marks, barcodes (12+ types), mathematical formulas (LaTeX), hyperlinks, and annotations.
2. **Figure Analysis:** Generates descriptions for images/charts/diagrams, converts charts to Chart.js syntax, and diagrams to Mermaid.js syntax.
3. **Structure Analysis:** Identifies paragraphs with contextual roles (title, section heading, page header/footer), detects tables with complex layouts (merged cells, multi-page), and maps hierarchical sections.
4. **GitHub Flavored Markdown:** Outputs richly formatted markdown that preserves document structure for LLM comprehension and AI-powered analysis.
5. **Broad Format Support:** Processes PDFs, images, Office documents (Word, Excel, PowerPoint), text files (HTML, Markdown), structured files (XML, JSON, CSV), and email formats (EML, MSG).

For detailed information about document elements and markdown representation, see [Document elements](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/document/elements) and [Document markdown](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/document/markdown).

> **Note:** Figure analysis (descriptions and chart/diagram analysis) is only supported for PDF and image file formats.

In [1]:
import json
import os
import sys

from azure.identity import DefaultAzureCredential
from datetime import datetime
from dotenv import load_dotenv
from helper.content_understanding_client import AzureContentUnderstandingClient
from helper.document_processor import DocumentProcessor
from helper.sample_helper import save_json_to_file 
from IPython.display import FileLink
from PIL import Image

In [2]:
sys.version

'3.10.18 (main, Jun  5 2025, 13:14:17) [GCC 11.2.0]'

In [3]:
print(f"Today is {datetime.today().strftime('%d-%b-%Y %H:%M:%S')}")

Today is 02-Dec-2025 13:27:12


## 1. Azure Content Understanding client

In [4]:
load_dotenv("azure.env")

AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
API_VERSION = "2025-11-01"  # Subject to change. Check the documentation
GPT_4_1_DEPLOYMENT = "gpt-4.1"  # Name of the model deployed in Microsoft Foundry
GPT_4_1_MINI_DEPLOYMENT = "gpt-4.1-mini"  # Name of the model deployed in Microsoft Foundry
TEXT_EMBEDDING_3_LARGE_DEPLOYMENT = "text-embedding-3-large"  # Name of the model deployed in Microsoft Foundry

In [5]:
def token_provider():
    """Provides fresh Azure Cognitive Services tokens."""
    try:
        credential = DefaultAzureCredential()
        token = credential.get_token(
            "https://cognitiveservices.azure.com/.default")
        return token.token
    except Exception as e:
        print(f"‚ùå Token acquisition failed: {e}")
        raise


try:
    if not AZURE_AI_ENDPOINT or not API_VERSION:
        raise ValueError("AZURE_AI_ENDPOINT and API_VERSION must be set")

    print("Initializing Azure Content Understanding Client...")
    client = AzureContentUnderstandingClient(
        endpoint=AZURE_AI_ENDPOINT,
        api_version=API_VERSION,
        token_provider=token_provider,
        x_ms_useragent="azure-ai-content-understanding-python-sample-ga")
    print("‚úÖ Done")

except ValueError as e:
    print(f"‚ùå Configuration error: {e}")
    raise
except Exception as e:
    print(f"‚ùå Client creation failed: {e}")
    raise

Initializing Azure Content Understanding Client...
‚úÖ Done


In [6]:
missing_deployments = []

if not GPT_4_1_DEPLOYMENT:
    missing_deployments.append("GPT_4_1_DEPLOYMENT")
if not GPT_4_1_MINI_DEPLOYMENT:
    missing_deployments.append("GPT_4_1_MINI_DEPLOYMENT")
if not TEXT_EMBEDDING_3_LARGE_DEPLOYMENT:
    missing_deployments.append("TEXT_EMBEDDING_3_LARGE_DEPLOYMENT")

if missing_deployments:
    print(f"‚ùå Warning: Missing required model deployment configuration(s):")
    for deployment in missing_deployments:
        print(f"   - {deployment}")
    print(
        "\n   Prebuilt analyzers require GPT-4.1, GPT-4.1-mini, and text-embedding-3-large deployments."
    )
    print("   Please:")
    print("   1. Deploy all three models in Azure AI Foundry")
    print("   2. Add the following to notebooks/.env:")
    print("      GPT_4_1_DEPLOYMENT=<your-gpt-4.1-deployment-name>")
    print("      GPT_4_1_MINI_DEPLOYMENT=<your-gpt-4.1-mini-deployment-name>")
    print(
        "      TEXT_EMBEDDING_3_LARGE_DEPLOYMENT=<your-text-embedding-3-large-deployment-name>"
    )
    print("   3. Restart the kernel and run this cell again")

else:
    print(f"üìã Configuring default model deployments...")
    print(f"   GPT-4.1 deployment: {GPT_4_1_DEPLOYMENT}")
    print(f"   GPT-4.1-mini deployment: {GPT_4_1_MINI_DEPLOYMENT}")
    print(
        f"   text-embedding-3-large deployment: {TEXT_EMBEDDING_3_LARGE_DEPLOYMENT}"
    )
    try:
        result = client.update_defaults({
            "gpt-4.1":
            GPT_4_1_DEPLOYMENT,
            "gpt-4.1-mini":
            GPT_4_1_MINI_DEPLOYMENT,
            "text-embedding-3-large":
            TEXT_EMBEDDING_3_LARGE_DEPLOYMENT
        })
        print(f"\n‚úÖ Default model deployments configured successfully")
        print(f"   Model mappings:")
        for model, deployment in result.get("modelDeployments", {}).items():
            print(f"     {model} ‚Üí {deployment}")
    except Exception as e:
        print(f"‚ùå Failed to configure defaults: {e}")
        print(f"   This may happen if:")
        print(
            f"   - One or more deployment names don't exist in your Azure AI Foundry project"
        )
        print(f"   - You don't have permission to update defaults")
        raise

üìã Configuring default model deployments...
   GPT-4.1 deployment: gpt-4.1
   GPT-4.1-mini deployment: gpt-4.1-mini
   text-embedding-3-large deployment: text-embedding-3-large

‚úÖ Default model deployments configured successfully
   Model mappings:
     gpt-4.1 ‚Üí gpt-4.1
     gpt-4.1-mini ‚Üí gpt-4.1-mini
     text-embedding-3-large ‚Üí text-embedding-3-large


In [7]:
try:
    defaults = client.get_defaults()
    print(f"‚úÖ Retrieved default settings")

    model_deployments = defaults.get("modelDeployments", {})

    if model_deployments:
        print(f"\n‚úÖ Model Deployments:")
        for model_name, deployment_name in model_deployments.items():
            print(f"   {model_name}: {deployment_name}")
    else:
        print("‚ùå No model deployments configured")

except Exception as e:
    print(f"‚ùå  Error retrieving defaults: {e}")
    print("This is expected if no defaults have been configured yet.")

‚úÖ Retrieved default settings

‚úÖ Model Deployments:
   gpt-4.1: gpt-4.1
   gpt-4.1-mini: gpt-4.1-mini
   text-embedding-3-large: text-embedding-3-large


## 2. Document processing

In [8]:
DOCS_DIR = "documents"

In [9]:
document_file = os.path.join(DOCS_DIR, "invoice.pdf")

!ls $document_file -lh

-rwxrwxrwx 1 root root 148K Dec  2 13:20 documents/invoice.pdf


In [10]:
doc_link = FileLink(path=document_file)
doc_link

In [11]:
# Analyze document from local file
analyzer_id = 'prebuilt-documentSearch'

print(f"üîç Analyzing {document_file} with {analyzer_id}...")
response = client.begin_analyze_binary(
    analyzer_id=analyzer_id,
    file_location=document_file,
)

result = client.poll_result(response)

print("\033[1;31;34m")
print("üìÑ Markdown Content:")
print("=" * 50)
# Extract markdown from the first content element
contents = result.get("result", {}).get("contents", [])
if contents:
    content = contents[0]
    markdown = content.get("markdown", "")
    print(markdown)
print("=" * 50)

# Check if this is document content to access document-specific properties
if content.get("kind") == "document":
    document_content = content
    print(f"\nüìö Document Information:")
    print(f"Start page: {document_content.get('startPageNumber')}")
    print(f"End page: {document_content.get('endPageNumber')}")
    print(
        f"Total pages: {document_content.get('endPageNumber') - document_content.get('startPageNumber') + 1}"
    )

    # Check for pages
    pages = document_content.get("pages")
    if pages is not None:
        print(f"\nüìÑ Pages ({len(pages)}):")
        for i, page in enumerate(pages):
            unit = document_content.get("unit", "units")
            print(
                f"  Page {page.get('pageNumber')}: {page.get('width')} x {page.get('height')} {unit}"
            )

    # Check if there are tables in the document
    tables = document_content.get("tables")
    if tables is not None:
        print(f"\nüìä Tables ({len(tables)}):")
        table_counter = 1
        for table in tables:
            row_count = table.get("rowCount")
            col_count = table.get("columnCount")
            print(
                f"  Table {table_counter}: {row_count} rows x {col_count} columns"
            )
            table_counter += 1
else:
    print("\nüìö Document Information: Not available for this content type")

# Save the result
saved_json_path = save_json_to_file(
    result, filename_prefix="content_analyzers_analyze_binary")
print(
    f"\nüìã Full analysis result saved. Review the complete JSON at: {saved_json_path}"
)

üîç Analyzing documents/invoice.pdf with prebuilt-documentSearch...
[1;31;34m
üìÑ Markdown Content:
CONTOSO LTD.


# INVOICE

Contoso Headquarters
123 456th St
New York, NY, 10001

INVOICE: INV-100

INVOICE DATE: 11/15/2019

DUE DATE: 12/15/2019

CUSTOMER NAME: MICROSOFT CORPORATION

SERVICE PERIOD: 10/14/2019 - 11/14/2019

CUSTOMER ID: CID-12345

Microsoft Corp
123 Other St,
Redmond WA, 98052

BILL TO:
Microsoft Finance
123 Bill St,
Redmond WA, 98052

SHIP TO:
Microsoft Delivery
123 Ship St,
Redmond WA, 98052

SERVICE ADDRESS:
Microsoft Services
123 Service St,
Redmond WA, 98052


<table>
<tr>
<th>SALESPERSON</th>
<th>P.O. NUMBER</th>
<th>REQUISITIONER</th>
<th>SHIPPED VIA</th>
<th>F.O.B. POINT</th>
<th>TERMS</th>
</tr>
<tr>
<td></td>
<td>PO-3333</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</table>


<table>
<tr>
<th>DATE</th>
<th>ITEM CODE</th>
<th>DESCRIPTION</th>
<th>QTY</th>
<th>UM</th>
<th>PRICE</th>
<th>TAX</th>
<th>AMOUNT</th>
</tr>
<tr>
<td>3/4/2021</td>
<td>A123</td

## 3. Analyzing Documents from URLs

You can also analyze documents directly from publicly accessible URLs without downloading them first. This is useful for processing documents hosted on web servers, cloud storage, or GitHub repositories.

In [12]:
document_url = 'https://github.com/Azure-Samples/azure-ai-content-understanding-python/raw/refs/heads/main/data/invoice.pdf'
analyzer_id = 'prebuilt-documentSearch'

print(f"üîç Analyzing document from URL: {document_url}")
print(f"üìä Using analyzer: {analyzer_id}\n")

response = client.begin_analyze_url(
    analyzer_id=analyzer_id,
    url=document_url,
)

result = client.poll_result(response)

print("\033[1;31;34m")
print("üìÑ Markdown Content:")
print("=" * 50)

# Extract markdown from the first content element
contents = result.get("result", {}).get("contents", [])
if contents:
    content = contents[0]
    markdown = content.get("markdown", "")
    print(markdown)
print("=" * 50)

# Check if this is document content to access document-specific properties
if content.get("kind") == "document":
    document_content = content
    print(f"\nüìö Document Information:")
    print(f"Start page: {document_content.get('startPageNumber')}")
    print(f"End page: {document_content.get('endPageNumber')}")
    print(
        f"Total pages: {document_content.get('endPageNumber') - document_content.get('startPageNumber') + 1}"
    )

    # Check for pages
    pages = document_content.get("pages")
    if pages is not None:
        print(f"\nüìÑ Pages ({len(pages)}):")
        for i, page in enumerate(pages):
            unit = document_content.get("unit", "units")
            print(
                f"  Page {page.get('pageNumber')}: {page.get('width')} x {page.get('height')} {unit}"
            )

    # Check if there are tables in the document
    tables = document_content.get("tables")
    if tables is not None:
        print(f"\nüìä Tables ({len(tables)}):")
        table_counter = 1
        for table in tables:
            row_count = table.get("rowCount")
            col_count = table.get("columnCount")
            print(
                f"  Table {table_counter}: {row_count} rows x {col_count} columns"
            )
            table_counter += 1
else:
    print("\nüìö Document Information: Not available for this content type")

# Save the result
saved_json_path = save_json_to_file(
    result, filename_prefix="content_analyzers_url_document")
print(
    f"\nüìã Full analysis result saved. Review the complete JSON at: {saved_json_path}"
)

üîç Analyzing document from URL: https://github.com/Azure-Samples/azure-ai-content-understanding-python/raw/refs/heads/main/data/invoice.pdf
üìä Using analyzer: prebuilt-documentSearch

[1;31;34m
üìÑ Markdown Content:
CONTOSO LTD.


# INVOICE

Contoso Headquarters
123 456th St
New York, NY, 10001

INVOICE: INV-100

INVOICE DATE: 11/15/2019

DUE DATE: 12/15/2019

CUSTOMER NAME: MICROSOFT CORPORATION

SERVICE PERIOD: 10/14/2019 - 11/14/2019

CUSTOMER ID: CID-12345

Microsoft Corp
123 Other St,
Redmond WA, 98052

BILL TO:
Microsoft Finance
123 Bill St,
Redmond WA, 98052

SHIP TO:
Microsoft Delivery
123 Ship St,
Redmond WA, 98052

SERVICE ADDRESS:
Microsoft Services
123 Service St,
Redmond WA, 98052


<table>
<tr>
<th>SALESPERSON</th>
<th>P.O. NUMBER</th>
<th>REQUISITIONER</th>
<th>SHIPPED VIA</th>
<th>F.O.B. POINT</th>
<th>TERMS</th>
</tr>
<tr>
<td></td>
<td>PO-3333</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</table>


<table>
<tr>
<th>DATE</th>
<th>ITEM CODE</th>
<th>DESCRIPT

In [13]:
!ls $saved_json_path -lh

-rwxrwxrwx 1 root root 136K Dec  2 13:27 results/content_analyzers_url_document_20251202_132726.json
