# 3. Custom Field extraction using Azure Content Understanding

<img src="https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/media/overview/content-understanding-framework-2025.png#lightbox">

Azure Content Understanding in Foundry Tools is an Foundry Tool that's available as part of the Microsoft Foundry Resource in the Azure portal. It uses generative AI to process/ingest content of many types (documents, images, videos, and audio) into a user-defined output format. Content Understanding offers a streamlined process to reason over large amounts of unstructured data, accelerating time-to-value by generating an output that can be integrated into automation and analytical workflows.

Content Understanding is now a Generally Available (GA) service with the release of the 2025-11-01 API version. It's now available in a broader range of regions.

### Core Documentation
1. **[What is Azure Content Understanding in Foundry Tools?](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/overview)** - Main overview page
2. **[FAQ - Frequently Asked Questions](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/faq)** - Common questions and answers
3. **[Choosing the Right Tool: Document Intelligence vs Content Understanding](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/choosing-right-ai-tool)** - Comparison guide
4. **[Models and Deployments](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/models-deployments)** - Supported models configuration
5. **[Pricing Explainer](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/pricing-explainer)** - Pricing details and optimization

### Modality-Specific Documentation
6. **[Document Processing Overview](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/document/overview)** - Field extraction and grounding
7. **[Video Solutions (Preview)](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/video/overview)** - Video analysis capabilities
8. **[Image Solutions (Preview)](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/image/overview)** - Image extraction and analysis
9. **[Face Solutions (Preview)](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/face/overview)** - Face detection and recognition

### Additional Resources
10. **[Transparency Note](https://learn.microsoft.com/en-us/azure/ai-foundry/responsible-ai/content-understanding/transparency-note)** - Responsible AI information
11. **[Code Samples on GitHub](https://github.com/Azure-Samples/azure-ai-content-understanding-python)** - Python implementation examples
12. **[Azure Content Understanding Pricing](https://azure.microsoft.com/pricing/details/content-understanding/)** - Official pricing page

# Custom Analyzers

Now let's explore creating custom analyzers to extract specific fields tailored to your needs. Custom analyzers allow you to define exactly what information you want to extract and how it should be structured.


**Key Analyzer Configuration Components:**

- **`baseAnalyzerId`**: Specifies which prebuilt analyzer to inherit from. Available base analyzers:
  - **`prebuilt-document`** - For document-based custom analyzers (PDFs, images, Office docs)
  - **`prebuilt-audio`** - For audio-based custom analyzers
  - **`prebuilt-video`** - For video-based custom analyzers
  - **`prebuilt-image`** - For image-based custom analyzers

- **`fieldSchema`**: Defines the structured data to extract from content:
  - **`fields`**: Object defining each field to extract, with field names as keys
  - Each field definition includes:
    - **`type`**: Data type (`string`, `number`, `boolean`, `date`, `object`, `array`)
    - **`description`**: Clear explanation of the field - acts as a prompt to guide extraction accuracy
    - **`method`**: Extraction method to use:
      - **`"extract"`** - Extract values as they appear in content (literal text extraction). Requires `estimateSourceAndConfidence: true`. Only supported for document analyzers.
      - **`"generate"`** - Generate values using AI based on content understanding (best for complex fields)
      - **`"classify"`** - Classify values against predefined categories (use with `enum`)
    - **`enum`**: (Optional) Fixed list of possible values for classification
    - **`items`**: (For arrays) Defines structure of array elements
    - **`properties`**: (For objects) Defines nested field structure

- **`config`**: Processing options that control analysis behavior:
  - **`returnDetails`**: Include confidence scores, bounding boxes, metadata (default: false)
  - **`enableOcr`**: Extract text from images/scans (default: true, document only)
  - **`enableLayout`**: Extract layout info like paragraphs, structure (default: true, document only)
  - **`estimateFieldSourceAndConfidence`**: Return source locations and confidence for extracted fields (document only)
  - **`locales`**: Language codes for transcription (audio/video, e.g., `["en-US"]`)
  - **`contentCategories`**: Define categories for classification and segmentation
  - **`enableSegment`**: Split content into categorized chunks (document/video)

- **`models`**: Specifies which AI models to use:
  - **`completion`**: Model for extraction/generation tasks (e.g., `"gpt-4o"`, `"gpt-4o-mini"`)
  - **`embedding`**: Model for embedding tasks when using knowledge bases

For complete details, see the [Analyzer Reference Documentation](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/analyzer-reference).

In [1]:
import json
import os
import sys

from azure.identity import DefaultAzureCredential
from datetime import datetime
from dotenv import load_dotenv
from helper.content_understanding_client import AzureContentUnderstandingClient
from helper.document_processor import DocumentProcessor
from helper.sample_helper import save_json_to_file 
from IPython.display import display, Markdown
from PIL import Image

In [2]:
sys.version

'3.10.18 (main, Jun  5 2025, 13:14:17) [GCC 11.2.0]'

In [3]:
print(f"Today is {datetime.today().strftime('%d-%b-%Y %H:%M:%S')}")

Today is 02-Dec-2025 13:25:52


## 1. Azure Content Understanding client

In [4]:
load_dotenv("azure.env")

AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
API_VERSION = "2025-11-01"  # Subject to change. Check the documentation
GPT_4_1_DEPLOYMENT = "gpt-4.1"  # Name of the model deployed in Microsoft Foundry
GPT_4_1_MINI_DEPLOYMENT = "gpt-4.1-mini"  # Name of the model deployed in Microsoft Foundry
TEXT_EMBEDDING_3_LARGE_DEPLOYMENT = "text-embedding-3-large"  # Name of the model deployed in Microsoft Foundry

In [5]:
def token_provider():
    """Provides fresh Azure Cognitive Services tokens."""
    try:
        credential = DefaultAzureCredential()
        token = credential.get_token(
            "https://cognitiveservices.azure.com/.default")
        return token.token
    except Exception as e:
        print(f"‚ùå Token acquisition failed: {e}")
        raise


try:
    if not AZURE_AI_ENDPOINT or not API_VERSION:
        raise ValueError("AZURE_AI_ENDPOINT and API_VERSION must be set")

    print("Initializing Azure Content Understanding Client...")
    client = AzureContentUnderstandingClient(
        endpoint=AZURE_AI_ENDPOINT,
        api_version=API_VERSION,
        token_provider=token_provider,
        x_ms_useragent="azure-ai-content-understanding-python-sample-ga")
    print("‚úÖ Done")

except ValueError as e:
    print(f"‚ùå Configuration error: {e}")
    raise
except Exception as e:
    print(f"‚ùå Client creation failed: {e}")
    raise

Initializing Azure Content Understanding Client...
‚úÖ Done


In [6]:
missing_deployments = []

if not GPT_4_1_DEPLOYMENT:
    missing_deployments.append("GPT_4_1_DEPLOYMENT")
if not GPT_4_1_MINI_DEPLOYMENT:
    missing_deployments.append("GPT_4_1_MINI_DEPLOYMENT")
if not TEXT_EMBEDDING_3_LARGE_DEPLOYMENT:
    missing_deployments.append("TEXT_EMBEDDING_3_LARGE_DEPLOYMENT")

if missing_deployments:
    print(f"‚ùå Warning: Missing required model deployment configuration(s):")
    for deployment in missing_deployments:
        print(f"   - {deployment}")
    print(
        "\n   Prebuilt analyzers require GPT-4.1, GPT-4.1-mini, and text-embedding-3-large deployments."
    )
    print("   Please:")
    print("   1. Deploy all three models in Azure AI Foundry")
    print("   2. Add the following to notebooks/.env:")
    print("      GPT_4_1_DEPLOYMENT=<your-gpt-4.1-deployment-name>")
    print("      GPT_4_1_MINI_DEPLOYMENT=<your-gpt-4.1-mini-deployment-name>")
    print(
        "      TEXT_EMBEDDING_3_LARGE_DEPLOYMENT=<your-text-embedding-3-large-deployment-name>"
    )
    print("   3. Restart the kernel and run this cell again")

else:
    print(f"üìã Configuring default model deployments...")
    print(f"   GPT-4.1 deployment: {GPT_4_1_DEPLOYMENT}")
    print(f"   GPT-4.1-mini deployment: {GPT_4_1_MINI_DEPLOYMENT}")
    print(
        f"   text-embedding-3-large deployment: {TEXT_EMBEDDING_3_LARGE_DEPLOYMENT}"
    )
    try:
        result = client.update_defaults({
            "gpt-4.1":
            GPT_4_1_DEPLOYMENT,
            "gpt-4.1-mini":
            GPT_4_1_MINI_DEPLOYMENT,
            "text-embedding-3-large":
            TEXT_EMBEDDING_3_LARGE_DEPLOYMENT
        })
        print(f"\n‚úÖ Default model deployments configured successfully")
        print(f"   Model mappings:")
        for model, deployment in result.get("modelDeployments", {}).items():
            print(f"     {model} ‚Üí {deployment}")
    except Exception as e:
        print(f"‚ùå Failed to configure defaults: {e}")
        print(f"   This may happen if:")
        print(
            f"   - One or more deployment names don't exist in your Azure AI Foundry project"
        )
        print(f"   - You don't have permission to update defaults")
        raise

üìã Configuring default model deployments...
   GPT-4.1 deployment: gpt-4.1
   GPT-4.1-mini deployment: gpt-4.1-mini
   text-embedding-3-large deployment: text-embedding-3-large

‚úÖ Default model deployments configured successfully
   Model mappings:
     gpt-4.1 ‚Üí gpt-4.1
     gpt-4.1-mini ‚Üí gpt-4.1-mini
     text-embedding-3-large ‚Üí text-embedding-3-large


In [7]:
try:
    defaults = client.get_defaults()
    print(f"‚úÖ Retrieved default settings")

    model_deployments = defaults.get("modelDeployments", {})

    if model_deployments:
        print(f"\n‚úÖ Model Deployments:")
        for model_name, deployment_name in model_deployments.items():
            print(f"   {model_name}: {deployment_name}")
    else:
        print("‚ùå No model deployments configured")

except Exception as e:
    print(f"‚ùå  Error retrieving defaults: {e}")
    print("This is expected if no defaults have been configured yet.")

‚úÖ Retrieved default settings

‚úÖ Model Deployments:
   gpt-4.1: gpt-4.1
   gpt-4.1-mini: gpt-4.1-mini
   text-embedding-3-large: text-embedding-3-large


## 2. Document directory

In [8]:
DOCS_DIR = "documents"

## 3. Custom analyzer

Let's extract fields from an invoice PDF. This analyzer identifies essential invoice elements such as vendor information, amounts, dates, and line items.

In [9]:
analyzer_id = f"custom_analyzer_{datetime.today().strftime('%d%b%Y_%H%M%S')}"

custom_analyzer = {
    "baseAnalyzerId": "prebuilt-document",
    "description":
    "Sample invoice analyzer that extracts vendor information, line items, and totals from commercial invoices",
    "config": {
        "returnDetails": True,
        "enableOcr": True,
        "enableLayout": True,
        "estimateFieldSourceAndConfidence": True
    },
    "fieldSchema": {
        "name": "InvoiceFields",
        "fields": {
            "VendorName": {
                "type":
                "string",
                "method":
                "extract",
                "description":
                "Name of the vendor or supplier, typically found in the header section"
            },
            "Items": {
                "type": "array",
                "method": "generate",
                "description":
                "List of items or services on the invoice, typically in a table format",
                "items": {
                    "type": "object",
                    "properties": {
                        "Description": {
                            "type": "string",
                            "description": "Item or service description"
                        },
                        "Amount": {
                            "type": "number",
                            "description": "Line total amount for this item"
                        }
                    }
                }
            }
        }
    },
    "models": {
        "completion": "gpt-4.1"
    }
}
print(f"{json.dumps(analyzer_id, indent=2)}")
# Start the analyzer creation operation
response = client.begin_create_analyzer(
    analyzer_id=analyzer_id,
    analyzer_template=custom_analyzer,
)

"custom_analyzer_02Dec2025_132553"


In [10]:
print(f"‚è≥ Waiting for analyzer creation to complete...")
client.poll_result(response)
print(f"‚úÖ Done")

‚è≥ Waiting for analyzer creation to complete...
‚úÖ Done


In [11]:
document_file = os.path.join(DOCS_DIR, "invoice.pdf")

print(f"üîç Starting document analysis with analyzer '{analyzer_id}'...")

analysis_response = client.begin_analyze_binary(
    analyzer_id=analyzer_id,
    file_location=document_file,
)

print(f"‚è≥ Waiting for document analysis to complete...")
analysis_result = client.poll_result(analysis_response)
print(f"‚úÖ Done")

üîç Starting document analysis with analyzer 'custom_analyzer_02Dec2025_132553'...
‚è≥ Waiting for document analysis to complete...
‚úÖ Done


In [12]:
print("\033[1;31;34m")
print(json.dumps(analysis_result, indent=5))

[1;31;34m
{
     "id": "2d04bbd0-454d-4520-a580-aba42cadbb96",
     "status": "Succeeded",
     "result": {
          "analyzerId": "custom_analyzer_02Dec2025_132553",
          "apiVersion": "2025-11-01",
          "createdAt": "2025-12-02T13:25:58Z",
          "contents": [
               {
                    "path": "input1",
                    "markdown": "CONTOSO LTD.\n\n\n# INVOICE\n\nContoso Headquarters\n123 456th St\nNew York, NY, 10001\n\nINVOICE: INV-100\n\nINVOICE DATE: 11/15/2019\n\nDUE DATE: 12/15/2019\n\nCUSTOMER NAME: MICROSOFT CORPORATION\n\nSERVICE PERIOD: 10/14/2019 - 11/14/2019\n\nCUSTOMER ID: CID-12345\n\nMicrosoft Corp\n123 Other St,\nRedmond WA, 98052\n\nBILL TO:\nMicrosoft Finance\n123 Bill St,\nRedmond WA, 98052\n\nSHIP TO:\nMicrosoft Delivery\n123 Ship St,\nRedmond WA, 98052\n\nSERVICE ADDRESS:\nMicrosoft Services\n123 Service St,\nRedmond WA, 98052\n\n\n<table>\n<tr>\n<th>SALESPERSON</th>\n<th>P.O. NUMBER</th>\n<th>REQUISITIONER</th>\n<th>SHIPPED VIA</th>\

In [13]:
if analysis_result and "result" in analysis_result:
    result = analysis_result["result"]
    contents = result.get("contents", [])

    if contents:
        first_content = contents[0]
        fields = first_content.get("fields", {})
        print("üìä Extracted Fields:")
        print("-" * 80)
        if fields:
            for field_name, field_value in fields.items():
                field_type = field_value.get("type")
                if field_type == "string":
                    print(f"{field_name}: {field_value.get('valueString')}")
                elif field_type == "number":
                    print(f"{field_name}: {field_value.get('valueNumber')}")
                elif field_type == "array":
                    print(
                        f"{field_name} (array with {len(field_value.get('valueArray', []))} items):"
                    )
                    for idx, item in enumerate(
                            field_value.get('valueArray', []), 1):
                        if item.get('type') == 'object':
                            print(f"  Item {idx}:")
                            for key, val in item.get('valueObject',
                                                     {}).items():
                                if val.get('type') == 'string':
                                    print(
                                        f"    {key}: {val.get('valueString')}")
                                elif val.get('type') == 'number':
                                    print(
                                        f"    {key}: {val.get('valueNumber')}")
                elif field_type == "object":
                    print(f"{field_name}: {field_value.get('valueObject')}")
                print()
        else:
            print("No fields extracted")
        print()

        # Display content metadata
        print("üìã Content Metadata:")
        print("-" * 80)
        print(f"Kind: {first_content.get('kind')}")
        print(
            f"Pages: {first_content.get('startPageNumber')} - {first_content.get('endPageNumber')}"
        )
        print(f"Unit: {first_content.get('unit')}")
        print()

    # Save full result to file
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_file = f"results/invoice_analysis_result_{timestamp}.json"
    os.makedirs("results", exist_ok=True)

    with open(output_file, 'w') as f:
        json.dump(analysis_result, f, indent=2)

    print(f"üíæ Full analysis result saved to: {output_file}")
else:
    print("No analysis result available")

üìä Extracted Fields:
--------------------------------------------------------------------------------
VendorName: CONTOSO LTD.

Items (array with 3 items):
  Item 1:
    Description: Consulting Services
    Amount: 60
  Item 2:
    Description: Document Fee
    Amount: 30
  Item 3:
    Description: Printing Fee
    Amount: 10


üìã Content Metadata:
--------------------------------------------------------------------------------
Kind: document
Pages: 1 - 1
Unit: inch

üíæ Full analysis result saved to: results/invoice_analysis_result_20251202_132606.json


In [14]:
with open(output_file, 'r') as file:
    data = json.load(file)

print("\033[1;31;34m")
print(json.dumps(data, indent=5))

[1;31;34m
{
     "id": "2d04bbd0-454d-4520-a580-aba42cadbb96",
     "status": "Succeeded",
     "result": {
          "analyzerId": "custom_analyzer_02Dec2025_132553",
          "apiVersion": "2025-11-01",
          "createdAt": "2025-12-02T13:25:58Z",
          "contents": [
               {
                    "path": "input1",
                    "markdown": "CONTOSO LTD.\n\n\n# INVOICE\n\nContoso Headquarters\n123 456th St\nNew York, NY, 10001\n\nINVOICE: INV-100\n\nINVOICE DATE: 11/15/2019\n\nDUE DATE: 12/15/2019\n\nCUSTOMER NAME: MICROSOFT CORPORATION\n\nSERVICE PERIOD: 10/14/2019 - 11/14/2019\n\nCUSTOMER ID: CID-12345\n\nMicrosoft Corp\n123 Other St,\nRedmond WA, 98052\n\nBILL TO:\nMicrosoft Finance\n123 Bill St,\nRedmond WA, 98052\n\nSHIP TO:\nMicrosoft Delivery\n123 Ship St,\nRedmond WA, 98052\n\nSERVICE ADDRESS:\nMicrosoft Services\n123 Service St,\nRedmond WA, 98052\n\n\n<table>\n<tr>\n<th>SALESPERSON</th>\n<th>P.O. NUMBER</th>\n<th>REQUISITIONER</th>\n<th>SHIPPED VIA</th>\

In [15]:
print(f"üóëÔ∏è Deleting analyzer '{analyzer_id}'...")
client.delete_analyzer(analyzer_id=analyzer_id)
print(f"‚úÖ Done")

üóëÔ∏è Deleting analyzer 'custom_analyzer_02Dec2025_132553'...
‚úÖ Done
