# 4. Classifier using Azure Content Understanding

<img src="https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/media/overview/content-understanding-framework-2025.png#lightbox">

Azure Content Understanding in Foundry Tools is an Foundry Tool that's available as part of the Microsoft Foundry Resource in the Azure portal. It uses generative AI to process/ingest content of many types (documents, images, videos, and audio) into a user-defined output format. Content Understanding offers a streamlined process to reason over large amounts of unstructured data, accelerating time-to-value by generating an output that can be integrated into automation and analytical workflows.

Content Understanding is now a Generally Available (GA) service with the release of the 2025-11-01 API version. It's now available in a broader range of regions.

### Core Documentation
1. **[What is Azure Content Understanding in Foundry Tools?](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/overview)** - Main overview page
2. **[FAQ - Frequently Asked Questions](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/faq)** - Common questions and answers
3. **[Choosing the Right Tool: Document Intelligence vs Content Understanding](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/choosing-right-ai-tool)** - Comparison guide
4. **[Models and Deployments](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/models-deployments)** - Supported models configuration
5. **[Pricing Explainer](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/pricing-explainer)** - Pricing details and optimization

### Modality-Specific Documentation
6. **[Document Processing Overview](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/document/overview)** - Field extraction and grounding
7. **[Video Solutions (Preview)](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/video/overview)** - Video analysis capabilities
8. **[Image Solutions (Preview)](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/image/overview)** - Image extraction and analysis
9. **[Face Solutions (Preview)](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/face/overview)** - Face detection and recognition

### Additional Resources
10. **[Transparency Note](https://learn.microsoft.com/en-us/azure/ai-foundry/responsible-ai/content-understanding/transparency-note)** - Responsible AI information
11. **[Code Samples on GitHub](https://github.com/Azure-Samples/azure-ai-content-understanding-python)** - Python implementation examples
12. **[Azure Content Understanding Pricing](https://azure.microsoft.com/pricing/details/content-understanding/)** - Official pricing page

This notebook demonstrates how to use the Azure AI Content Understanding service to:
1. Create a classifier for document categorization
2. Create a custom analyzer to extract specific fields
3. Combine the classifier and analyzers to classify, optionally split, and analyze documents within a flexible processing pipeline

For more detailed information before getting started, please refer to the official documentation:
[Understanding Classifiers in Azure AI Services](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/classifier)

## Create a Basic Classifier
Classify document from URL using begin_classify API.

High-level steps:
1. Create a custom classifier
2. Classify a document from a remote URL
3. Save the classification result to a file
4. Clean up the created classifier

In Azure AI Content Understanding, classification is integrated directly into the analyzer operation rather than requiring a separate API. To create a classifier, you define **`contentCategories`** within the analyzer's configuration, specifying up to 200 category names and descriptions that the service will use to categorize your input files. 

The **`enableSegment`** parameter controls how the classifier handles multi-document files: when set to `true`, it automatically splits and classifies different document types within a single file (useful for processing combined documents like a loan application package containing multiple forms), while setting it to `false` treats the entire file as a single document. 

For more detailed information about classification capabilities, best practices, and advanced scenarios, see the [Content Understanding classification documentation](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/classifier).

In [1]:
import json
import os
import sys

from azure.identity import DefaultAzureCredential
from datetime import datetime
from dotenv import load_dotenv
from helper.content_understanding_client import AzureContentUnderstandingClient
from helper.document_processor import DocumentProcessor
from helper.sample_helper import save_json_to_file 
from IPython.display import FileLink

In [2]:
sys.version

'3.10.18 (main, Jun  5 2025, 13:14:17) [GCC 11.2.0]'

In [3]:
print(f"Today is {datetime.today().strftime('%d-%b-%Y %H:%M:%S')}")

Today is 02-Dec-2025 13:26:04


## 1. Azure Content Understanding client

In [4]:
load_dotenv("azure.env")

AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
API_VERSION = "2025-11-01"  # Subject to change. Check the documentation
GPT_4_1_DEPLOYMENT = "gpt-4.1"  # Name of the model deployed in Microsoft Foundry
GPT_4_1_MINI_DEPLOYMENT = "gpt-4.1-mini"  # Name of the model deployed in Microsoft Foundry
TEXT_EMBEDDING_3_LARGE_DEPLOYMENT = "text-embedding-3-large"  # Name of the model deployed in Microsoft Foundry

In [5]:
def token_provider():
    """Provides fresh Azure Cognitive Services tokens."""
    try:
        credential = DefaultAzureCredential()
        token = credential.get_token(
            "https://cognitiveservices.azure.com/.default")
        return token.token
    except Exception as e:
        print(f"‚ùå Token acquisition failed: {e}")
        raise


try:
    if not AZURE_AI_ENDPOINT or not API_VERSION:
        raise ValueError("AZURE_AI_ENDPOINT and API_VERSION must be set")

    print("Initializing Azure Content Understanding Client...")
    client = AzureContentUnderstandingClient(
        endpoint=AZURE_AI_ENDPOINT,
        api_version=API_VERSION,
        token_provider=token_provider,
        x_ms_useragent="azure-ai-content-understanding-python-sample-ga")
    print("‚úÖ Done")

except ValueError as e:
    print(f"‚ùå Configuration error: {e}")
    raise
except Exception as e:
    print(f"‚ùå Client creation failed: {e}")
    raise

Initializing Azure Content Understanding Client...
‚úÖ Done


In [6]:
missing_deployments = []

if not GPT_4_1_DEPLOYMENT:
    missing_deployments.append("GPT_4_1_DEPLOYMENT")
if not GPT_4_1_MINI_DEPLOYMENT:
    missing_deployments.append("GPT_4_1_MINI_DEPLOYMENT")
if not TEXT_EMBEDDING_3_LARGE_DEPLOYMENT:
    missing_deployments.append("TEXT_EMBEDDING_3_LARGE_DEPLOYMENT")

if missing_deployments:
    print(f"‚ùå Warning: Missing required model deployment configuration(s):")
    for deployment in missing_deployments:
        print(f"   - {deployment}")
    print(
        "\n   Prebuilt analyzers require GPT-4.1, GPT-4.1-mini, and text-embedding-3-large deployments."
    )
    print("   Please:")
    print("   1. Deploy all three models in Azure AI Foundry")
    print("   2. Add the following to notebooks/.env:")
    print("      GPT_4_1_DEPLOYMENT=<your-gpt-4.1-deployment-name>")
    print("      GPT_4_1_MINI_DEPLOYMENT=<your-gpt-4.1-mini-deployment-name>")
    print(
        "      TEXT_EMBEDDING_3_LARGE_DEPLOYMENT=<your-text-embedding-3-large-deployment-name>"
    )
    print("   3. Restart the kernel and run this cell again")

else:
    print(f"üìã Configuring default model deployments...")
    print(f"   GPT-4.1 deployment: {GPT_4_1_DEPLOYMENT}")
    print(f"   GPT-4.1-mini deployment: {GPT_4_1_MINI_DEPLOYMENT}")
    print(
        f"   text-embedding-3-large deployment: {TEXT_EMBEDDING_3_LARGE_DEPLOYMENT}"
    )
    try:
        result = client.update_defaults({
            "gpt-4.1":
            GPT_4_1_DEPLOYMENT,
            "gpt-4.1-mini":
            GPT_4_1_MINI_DEPLOYMENT,
            "text-embedding-3-large":
            TEXT_EMBEDDING_3_LARGE_DEPLOYMENT
        })
        print(f"\n‚úÖ Default model deployments configured successfully")
        print(f"   Model mappings:")
        for model, deployment in result.get("modelDeployments", {}).items():
            print(f"     {model} ‚Üí {deployment}")
    except Exception as e:
        print(f"‚ùå Failed to configure defaults: {e}")
        print(f"   This may happen if:")
        print(
            f"   - One or more deployment names don't exist in your Azure AI Foundry project"
        )
        print(f"   - You don't have permission to update defaults")
        raise

üìã Configuring default model deployments...
   GPT-4.1 deployment: gpt-4.1
   GPT-4.1-mini deployment: gpt-4.1-mini
   text-embedding-3-large deployment: text-embedding-3-large

‚úÖ Default model deployments configured successfully
   Model mappings:
     gpt-4.1 ‚Üí gpt-4.1
     gpt-4.1-mini ‚Üí gpt-4.1-mini
     text-embedding-3-large ‚Üí text-embedding-3-large


In [7]:
try:
    defaults = client.get_defaults()
    print(f"‚úÖ Retrieved default settings")

    model_deployments = defaults.get("modelDeployments", {})

    if model_deployments:
        print(f"\n‚úÖ Model Deployments:")
        for model_name, deployment_name in model_deployments.items():
            print(f"   {model_name}: {deployment_name}")
    else:
        print("‚ùå No model deployments configured")

except Exception as e:
    print(f"‚ùå  Error retrieving defaults: {e}")
    print("This is expected if no defaults have been configured yet.")

‚úÖ Retrieved default settings

‚úÖ Model Deployments:
   gpt-4.1: gpt-4.1
   gpt-4.1-mini: gpt-4.1-mini
   text-embedding-3-large: text-embedding-3-large


## 2. Generate a unique classifier ID

In [8]:
analyzer_id = f"sample_classifier_{datetime.today().strftime('%d%b%Y_%H%M%S')}"

# Define the classifier as a dictionary
content_analyzer = {
    "baseAnalyzerId": "prebuilt-document",
    "description":
    f"Custom classifier for URL classification demo: {analyzer_id}",
    "config": {
        "returnDetails": True,
        "enableSegment": True,
        "contentCategories": {
            "Loan application": {
                "description":
                "Documents submitted by individuals or businesses to request funding, typically including personal or business details, financial history, loan amount, purpose, and supporting documentation."
            },
            "Invoice": {
                "description":
                "Billing documents issued by sellers or service providers to request payment for goods or services, detailing items, prices, taxes, totals, and payment terms."
            },
            "Bank_Statement": {
                "description":
                "Official statements issued by banks that summarize account activity over a period, including deposits, withdrawals, fees, and balances."
            }
        }
    },
    "models": {
        "completion": "gpt-4.1"
    },
    "tags": {
        "demo_type": "url_classification"
    }
}

# Create a custom classifier
print(f"üîß Creating custom classifier '{analyzer_id}'...")

# Start the classifier creation operation
response = client.begin_create_analyzer(
    analyzer_id=analyzer_id,
    analyzer_template=content_analyzer,
)

# Wait for the classifier to be created
print(f"‚è≥ Waiting for classifier creation to complete...")
client.poll_result(response)
print(f"‚úÖ Done")

üîß Creating custom classifier 'sample_classifier_02Dec2025_132605'...
‚è≥ Waiting for classifier creation to complete...
‚úÖ Done


## 3. Classify Your Document

Now, use the classifier to categorize your document.

In [9]:
document_file = "documents/document.pdf"

!ls $document_file -lh

-rwxrwxrwx 1 root root 260K Nov 27 09:04 documents/document.pdf


In [10]:
doc_link = FileLink(path=document_file)
doc_link

In [11]:
print(f"üìÑ Reading document file: {document_file}")

# Begin binary classification operation
print(f"üîç Starting binary classification with classifier '{analyzer_id}'...")

analysis_response = client.begin_analyze_binary(
    analyzer_id=analyzer_id,
    file_location=document_file,
)

üìÑ Reading document file: documents/document.pdf
üîç Starting binary classification with classifier 'sample_classifier_02Dec2025_132605'...


In [12]:
print(f"‚è≥ Waiting for document analysis to complete...")
analysis_result = client.poll_result(analysis_response)
print(f"‚úÖ Done")

‚è≥ Waiting for document analysis to complete...
‚úÖ Done


In [13]:
if analysis_result and "result" in analysis_result:
    result = analysis_result["result"]
    contents = result.get("contents", [])

    if contents:
        first_content = contents[0]
        segments = first_content.get("segments", [])
        print("\033[1;31;34m")
        if segments:
            print("üìä Classification Results:")
            for idx, segment in enumerate(segments, 1):
                print(f"\nSegment {idx}:")
                print(f"   Category: {segment.get('category', 'N/A')}")
                print(
                    f"   Start Page: {segment.get('startPageNumber', 'N/A')}")
                print(f"   End Page: {segment.get('endPageNumber', 'N/A')}")
                print(f"   Segment ID: {segment.get('segmentId', 'N/A')}")
    else:
        print("No contents available in analysis result")
else:
    print("No analysis result available")

[1;31;34m
üìä Classification Results:

Segment 1:
   Category: Invoice
   Start Page: 1
   End Page: 1
   Segment ID: segment1

Segment 2:
   Category: Bank_Statement
   Start Page: 2
   End Page: 3
   Segment ID: segment2

Segment 3:
   Category: Loan application
   Start Page: 4
   End Page: 4
   Segment ID: segment3


## 4. Saving Classification Results
The classification result is saved to a JSON file for later analysis.

In [14]:
saved_file_path = save_json_to_file(analysis_result, filename_prefix="classification_get_result")

üíæ Analysis result saved to: results/classification_get_result_20251202_132613.json


In [15]:
print(f"üóëÔ∏è Deleting classifier '{analyzer_id}'...")
client.delete_analyzer(analyzer_id=analyzer_id)
print(f"‚úÖ Done")

üóëÔ∏è Deleting classifier 'sample_classifier_02Dec2025_132605'...
‚úÖ Done


## 5. Create a Custom Analyzer (Advanced)

Create a custom analyzer to extract specific fields from documents.
This example extracts common fields from loan application documents and generates document excerpts.

In [16]:
# Generate a unique analyzer ID for loan applications
loan_analyzer_id = f"sample_loan_analyzer_{datetime.today().strftime('%d%b%Y_%H%M%S')}"

# Define custom analyzer as a dictionary
custom_analyzer = {
    "baseAnalyzerId": "prebuilt-document",
    "description":
    "Loan application analyzer - extracts key information from loan applications",
    "config": {
        "returnDetails": True,
        "enableLayout": True,
        "enableFormula": False,
        "estimateFieldSourceAndConfidence": True
    },
    "fieldSchema": {
        "fields": {
            "ApplicationDate": {
                "type": "date",
                "method": "generate",
                "description":
                "The date when the loan application was submitted."
            },
            "ApplicantName": {
                "type": "string",
                "method": "generate",
                "description": "Full name of the loan applicant or company."
            },
            "LoanAmountRequested": {
                "type": "number",
                "method": "generate",
                "description":
                "The total loan amount requested by the applicant."
            },
            "LoanPurpose": {
                "type": "string",
                "method": "generate",
                "description": "The stated purpose or reason for the loan."
            },
            "CreditScore": {
                "type": "number",
                "method": "generate",
                "description": "Credit score of the applicant, if available."
            },
            "Summary": {
                "type":
                "string",
                "method":
                "generate",
                "description":
                "A brief summary overview of the loan application details."
            }
        }
    },
    "models": {
        "completion": "gpt-4.1"
    },
    "tags": {
        "demo": "loan-application"
    }
}

# Create the custom analyzer
print(f"üîß Creating custom analyzer '{loan_analyzer_id}'...")
response = client.begin_create_analyzer(
    analyzer_id=loan_analyzer_id,
    analyzer_template=custom_analyzer,
)
client.poll_result(response)
print(f"‚úÖ Done")

üîß Creating custom analyzer 'sample_loan_analyzer_02Dec2025_132614'...
‚úÖ Done


## 6. Create an Enhanced Classifier with Custom Analyzer

Now create a new classifier that uses the prebuilt invoice analyzer for invoices and the custom analyzer for loan application documents.
This combines document classification with field extraction in one operation.

In [17]:
# Generate a unique enhanced classifier ID
enhanced_classifier_id = f"sample_enhanced_classifier_{datetime.today().strftime('%d%b%Y_%H%M%S')}"

# Define enhanced classifier with custom analyzer for loan applications
enhanced_analyzer = {
    "baseAnalyzerId": "prebuilt-document",
    "description":
    f"Enhanced classifier with custom loan analyzer: {enhanced_classifier_id}",
    "config": {
        "returnDetails": True,
        "enableSegment": True,
        "contentCategories": {
            "Loan application": {
                "description":
                "Documents submitted by individuals or businesses to request funding, typically including personal or business details, financial history, loan amount, purpose, and supporting documentation.",
                "analyzerId": loan_analyzer_id  # Use the custom loan analyzer
            },
            "Invoice": {
                "description":
                "Billing documents issued by sellers or service providers to request payment for goods or services, detailing items, prices, taxes, totals, and payment terms."
            },
            "Bank_Statement": {
                "description":
                "Official statements issued by banks that summarize account activity over a period, including deposits, withdrawals, fees, and balances."
            }
        }
    },
    "models": {
        "completion": "gpt-4.1"
    },
    "tags": {
        "demo_type": "enhanced_classification"
    }
}

# Create the enhanced classifier
print(f"üîß Creating enhanced classifier '{enhanced_classifier_id}'...")
response = client.begin_create_analyzer(
    analyzer_id=enhanced_classifier_id,
    analyzer_template=enhanced_analyzer,
)

# Wait for the classifier to be created
print(f"‚è≥ Waiting for classifier creation to complete...")
client.poll_result(response)
print(f"‚úÖ Done")

üîß Creating enhanced classifier 'sample_enhanced_classifier_02Dec2025_132617'...
‚è≥ Waiting for classifier creation to complete...
‚úÖ Done


## 7. Process Document with Enhanced Classifier

Process the document again using the enhanced classifier.
Invoices and loan applications will now have additional fields extracted.

In [18]:
document_file = "documents/document.pdf"

!ls $document_file -lh

-rwxrwxrwx 1 root root 260K Nov 27 09:04 documents/document.pdf


In [19]:
print(f"üìÑ Reading document file: {document_file}")

# Begin binary classification operation with enhanced classifier
print(
    f"üîç Starting binary classification with enhanced classifier '{enhanced_classifier_id}'..."
)
enhanced_analysis_response = client.begin_analyze_binary(
    analyzer_id=enhanced_classifier_id,
    file_location=document_file,
)

# Wait for classification completion
print(f"‚è≥ Waiting for classification to complete...")
enhanced_analysis_result = client.poll_result(enhanced_analysis_response)
print(f"‚úÖ Done")

üìÑ Reading document file: documents/document.pdf
üîç Starting binary classification with enhanced classifier 'sample_enhanced_classifier_02Dec2025_132617'...
‚è≥ Waiting for classification to complete...
‚úÖ Done


In [20]:
# Display enhanced classification results
if enhanced_analysis_result and "result" in enhanced_analysis_result:
    result = enhanced_analysis_result["result"]
    contents = result.get("contents", [])

    if contents:
        print("\033[1;31;34m")
        print("üìä Enhanced Classification Results with Field Extraction:")
        print("=" * 80)

        for idx, content_item in enumerate(contents, 1):
            print(f"\nüîñ Segment {idx}:")
            print(f"   Category: {content_item.get('category', 'N/A')}")
            print(
                f"   Pages: {content_item.get('startPageNumber', 'N/A')} - {content_item.get('endPageNumber', 'N/A')}"
            )

            # Display extracted fields if available
            fields = content_item.get("fields", {})
            if fields:
                print(f"\n   üìã Extracted Fields:")
                for field_name, field_value in fields.items():
                    field_type = field_value.get("type")
                    if field_type == "string":
                        print(
                            f"      ‚Ä¢ {field_name}: {field_value.get('valueString')}"
                        )
                    elif field_type == "number":
                        print(
                            f"      ‚Ä¢ {field_name}: {field_value.get('valueNumber')}"
                        )
                    elif field_type == "date":
                        print(
                            f"      ‚Ä¢ {field_name}: {field_value.get('valueDate')}"
                        )
            else:
                print(f"   (No custom fields extracted for this category)")

        print("\n" + "=" * 80)

        # Display document information for the first segment
        first_content = contents[0]
        if first_content.get("kind") == "document":
            print(f"\nüìö Document Information:")
            pages = first_content.get("pages")
            if pages:
                print(f"Total pages in document: {len(pages)}")
                unit = first_content.get("unit", "units")
                print(
                    f"Page dimensions: {pages[0].get('width')} x {pages[0].get('height')} {unit}"
                )
    else:
        print("No contents available in enhanced analysis result")
else:
    print("No enhanced analysis result available")

[1;31;34m
üìä Enhanced Classification Results with Field Extraction:

üîñ Segment 1:
   Category: N/A
   Pages: 1 - 4
   (No custom fields extracted for this category)

üîñ Segment 2:
   Category: Loan application
   Pages: 4 - 4

   üìã Extracted Fields:
      ‚Ä¢ ApplicationDate: 2025-07-14
      ‚Ä¢ ApplicantName: John Smith
      ‚Ä¢ LoanAmountRequested: 25000
      ‚Ä¢ LoanPurpose: Debt Consolidation
      ‚Ä¢ CreditScore: None
      ‚Ä¢ Summary: John Smith applied for a $25,000 loan from Contoso Bank on July 14, 2025 for debt consolidation. He is a Software Engineer at Contoso Technologies with a monthly income of $6,500 and has been employed for 5 years. The application includes personal, employment, and loan details, but does not specify a credit score.


üìö Document Information:
Total pages in document: 4
Page dimensions: 8.5 x 11 inch


## 8. Saving Classification Results
The classification result is saved to a JSON file for later analysis.

In [21]:
saved_file_path = save_json_to_file(
    enhanced_analysis_result,
    filename_prefix="enhanced_classification_get_result")

print("\033[1;31;34m")
print(json.dumps(enhanced_analysis_result, indent=5))

üíæ Analysis result saved to: results/enhanced_classification_get_result_20251202_132630.json
[1;31;34m
{
     "id": "74aeb72c-e4a6-4085-a0fc-0141ed8338ce",
     "status": "Succeeded",
     "result": {
          "analyzerId": "sample_enhanced_classifier_02Dec2025_132617",
          "apiVersion": "2025-11-01",
          "createdAt": "2025-12-02T13:26:19Z",
          "contents": [
               {
                    "path": "input1",
                    "markdown": "CONTOSO LTD.\n\n\n# INVOICE\n\nContoso Headquarters\n123 456th St\nNew York, NY, 10001\n\nINVOICE: INV-100\n\nINVOICE DATE: 11/15/2019\n\nDUE DATE: 12/15/2019\n\nCUSTOMER NAME: MICROSOFT CORPORATION\n\nSERVICE PERIOD: 10/14/2019 - 11/14/2019\n\nCUSTOMER ID: CID-12345\n\nMicrosoft Corp\n123 Other St,\nRedmond WA, 98052\n\nBILL TO:\nMicrosoft Finance\n123 Bill St,\nRedmond WA, 98052\n\nSHIP TO:\nMicrosoft Delivery\n123 Ship St,\nRedmond WA, 98052\n\nSERVICE ADDRESS:\nMicrosoft Services\n123 Service St,\nRedmond WA, 98052\n\n

## 9. Deleting the two customer analyzers

In [22]:
print(f"üóëÔ∏è Deleting analyzer '{loan_analyzer_id}'...")
client.delete_analyzer(analyzer_id=loan_analyzer_id)
print(f"‚úÖ Done")

üóëÔ∏è Deleting analyzer 'sample_loan_analyzer_02Dec2025_132614'...
‚úÖ Done


In [23]:
print(f"üóëÔ∏è Deleting classifier '{enhanced_classifier_id}'...")
client.delete_analyzer(analyzer_id=enhanced_classifier_id)
print(f"‚úÖ Done")

üóëÔ∏è Deleting classifier 'sample_enhanced_classifier_02Dec2025_132617'...
‚úÖ Done
