# Extract Custom Fields from Your File

This notebook demonstrates how to use analyzers to extract custom fields from your input files.

## Prerequisites
1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)
2. Install the required packages to run the sample.

In [200]:
%pip install -r ../requirements.txt








## Analyzer Templates

Below is a collection of analyzer templates designed to extract fields from various input file types.

These templates are highly customizable, allowing you to modify them to suit your specific needs. For additional verified templates from Microsoft, please visit [here](../analyzer_templates/README.md).

In [201]:
extraction_templates = {
    "receipt":            ('../analyzer_templates/receipt.json',         '../data/receipt.png'            ),
    "invoice":            ('../analyzer_templates/invoice.json',         '../data/invoice.pdf'            ),
    "driverguide":        ('../analyzer_templates/invoice.json',         '../data/driverguide.pdf'            ),
    "chart":              ('../analyzer_templates/image_chart.json',     '../data/pieChart.jpg'           ),
    "call_recording":     ('../analyzer_templates/call_recording_analytics.json', '../data/callCenterRecording.mp3'),
    "conversation_audio": ('../analyzer_templates/conversational_audio_analytics.json', '../data/callCenterRecording.mp3'),
    "marketing_video":    ('../analyzer_templates/marketing_video.json', '../data/FlightSimulator.mp4'  ),      
    "driving_video":    ('../analyzer_templates/video_segment_description.json', '../data/redlight_front.mp4'       )
}

Specify the analyzer template you want to use and provide a name for the analyzer to be created based on the template.

In [202]:
import uuid

#ANALYZER_TEMPLATE = "driverguide"
ANALYZER_TEMPLATE = "invoice"
ANALYZER_TEMPLATE2 = "driving_video"
ANALYZER_ID = "field-extraction-sample-" + str(uuid.uuid4())
ANALYZER_ID2 = "field-extraction-sample-" + str(uuid.uuid4())

(analyzer_template_path, analyzer_sample_file_path) = extraction_templates[ANALYZER_TEMPLATE]
(analyzer_template_path2, analyzer_sample_file_path2) = extraction_templates[ANALYZER_TEMPLATE2]

## Create Azure AI Content Understanding Client

> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class containing functions to interact with the Content Understanding API. Before the official release of the Content Understanding SDK, it can be regarded as a lightweight SDK.


In [203]:
import logging
import json
import os
import sys
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
AZURE_AI_API_VERSION = os.getenv("AZURE_AI_API_VERSION", "2024-12-01-preview")

# Add the parent directory to the path to use shared modules
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
from python.content_understanding_client import AzureContentUnderstandingClient

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_ENDPOINT,
    api_version=AZURE_AI_API_VERSION,
    token_provider=token_provider,
    x_ms_useragent="azure-ai-content-understanding-python/field_extraction", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.
)

INFO:azure.identity._credentials.environment:No environment configuration found.
INFO:azure.identity._credentials.managed_identity:ManagedIdentityCredential will use IMDS
INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=REDACTED&resource=REDACTED'
Request method: 'GET'
Request headers:
    'User-Agent': 'azsdk-python-identity/1.19.0 Python/3.12.9 (Windows-11-10.0.26100-SP0)'
No body was attached to the request
INFO:azure.identity._credentials.chained:DefaultAzureCredential acquired a token from AzureCliCredential


## Create Analyzer from the Template

In [204]:
response = client.begin_create_analyzer(ANALYZER_ID, analyzer_template_path=analyzer_template_path)
response2 = client.begin_create_analyzer(ANALYZER_ID2, analyzer_template_path=analyzer_template_path2)
result1 = client.poll_result(response)
result2 = client.poll_result(response2)

print(json.dumps(result1, indent=2))
print(json.dumps(result2, indent=2))

INFO:python.content_understanding_client:Analyzer field-extraction-sample-d2aca421-b74c-423e-bdaa-54dd07a96dce create request accepted.
INFO:python.content_understanding_client:Analyzer field-extraction-sample-8f8467ff-509d-4480-8d87-cf12306192c8 create request accepted.
INFO:python.content_understanding_client:Request 05a5e0df-9a53-4c8f-9df7-13344071af80 in progress ...
INFO:python.content_understanding_client:Request 05a5e0df-9a53-4c8f-9df7-13344071af80 in progress ...
INFO:python.content_understanding_client:Request 05a5e0df-9a53-4c8f-9df7-13344071af80 in progress ...
INFO:python.content_understanding_client:Request result is ready after 8.37 seconds.
INFO:python.content_understanding_client:Request result is ready after 0.00 seconds.


{
  "id": "05a5e0df-9a53-4c8f-9df7-13344071af80",
  "status": "Succeeded",
  "result": {
    "analyzerId": "field-extraction-sample-d2aca421-b74c-423e-bdaa-54dd07a96dce",
    "description": "Sample invoice analyzer",
    "createdAt": "2025-02-06T21:52:02Z",
    "lastModifiedAt": "2025-02-06T21:52:12Z",
    "config": {
      "returnDetails": false,
      "enableOcr": true,
      "enableLayout": true,
      "enableBarcode": false,
      "enableFormula": false
    },
    "fieldSchema": {
      "fields": {
        "VendorName": {
          "type": "string",
          "method": "extract",
          "description": "Vendor issuing the invoice"
        },
        "Items": {
          "type": "array",
          "method": "extract",
          "items": {
            "type": "object",
            "properties": {
              "Description": {
                "type": "string",
                "method": "extract",
                "description": "Description of the item"
              },
            

## Extract Fields Using the Analyzer

After the analyzer is successfully created, we can use it to analyze our input files.

In [205]:
response = client.begin_analyze(ANALYZER_ID, file_location=analyzer_sample_file_path)
response2 = client.begin_analyze(ANALYZER_ID2, file_location=analyzer_sample_file_path2)
result1 = client.poll_result(response)
result2 = client.poll_result(response2)

print(json.dumps(result1, indent=2))
#print(json.dumps(result2, indent=2))

# Save result1 to a JSON file  
with open('result_pdf.json', 'w') as json_file:  
    json.dump(result1, json_file, indent=2)  

# Save result2 to a JSON file  
with open('result_video.json', 'w') as json_file:  
    json.dump(result2, json_file, indent=2)  



INFO:python.content_understanding_client:Analyzing file ../data/invoice.pdf with analyzer: field-extraction-sample-d2aca421-b74c-423e-bdaa-54dd07a96dce
INFO:python.content_understanding_client:Analyzing file ../data/redlight_front.mp4 with analyzer: field-extraction-sample-8f8467ff-509d-4480-8d87-cf12306192c8
INFO:python.content_understanding_client:Request result is ready after 0.00 seconds.
INFO:python.content_understanding_client:Request 5659e0ff-9e48-4447-a315-c193507ece50 in progress ...
INFO:python.content_understanding_client:Request 5659e0ff-9e48-4447-a315-c193507ece50 in progress ...
INFO:python.content_understanding_client:Request 5659e0ff-9e48-4447-a315-c193507ece50 in progress ...
INFO:python.content_understanding_client:Request 5659e0ff-9e48-4447-a315-c193507ece50 in progress ...
INFO:python.content_understanding_client:Request 5659e0ff-9e48-4447-a315-c193507ece50 in progress ...
INFO:python.content_understanding_client:Request 5659e0ff-9e48-4447-a315-c193507ece50 in progr

{
  "id": "145737e6-dc9d-41f1-a9f7-3226ab8ffec0",
  "status": "Succeeded",
  "result": {
    "analyzerId": "field-extraction-sample-d2aca421-b74c-423e-bdaa-54dd07a96dce",
    "apiVersion": "2024-12-01-preview",
    "createdAt": "2025-02-06T21:52:21Z",
    "contents": [
      {
        "markdown": "CONTOSO LTD.\n\n\n# INVOICE\n\nContoso Headquarters\n123 456th St\nNew York, NY, 10001\n\nINVOICE: INV-100\n\nINVOICE DATE: 11/15/2019\n\nDUE DATE: 12/15/2019\n\nCUSTOMER NAME: MICROSOFT CORPORATION\n\nSERVICE PERIOD: 10/14/2019 - 11/14/2019\n\nCUSTOMER ID: CID-12345\n\nMicrosoft Corp\n123 Other St,\nRedmond WA, 98052\n\nBILL TO:\n\nMicrosoft Finance\n\n123 Bill St,\n\nRedmond WA, 98052\n\nSHIP TO:\n\nMicrosoft Delivery\n\n123 Ship St,\n\nRedmond WA, 98052\n\nSERVICE ADDRESS:\nMicrosoft Services\n123 Service St,\nRedmond WA, 98052\n\n\n<table>\n<tr>\n<th>SALESPERSON</th>\n<th>P.O. NUMBER</th>\n<th>REQUISITIONER</th>\n<th>SHIPPED VIA</th>\n<th>F.O.B. POINT</th>\n<th>TERMS</th>\n</tr>\n<tr>\n<t

In [206]:
## result analysis
import json  
  
# Load the JSON data from result.json  
with open('result_pdf.json', 'r') as json_file:  
    pdf_results = json.load(json_file)  
  
# Load the JSON data from result_video.json  
with open('result_video.json', 'r') as json_file:  
    video_results = json.load(json_file)  
  
# Extract boolean values from result_video.json  
boolean_values = []  
for item in video_results.get('contents', []):  
    for field in item.get('fields', {}).values():  
        if field.get('type') == 'boolean':  
            boolean_values.append(field.get('valueBoolean'))  
  
# Function to analyze a single entry from pdf_results  
def analyze_pdf_result(pdf_entry, boolean_values):  
    # Check for specific conditions in pdf_entry  
    red_light = pdf_entry.get('fields', {}).get('redlight', {}).get('valueBoolean', False)  
    stop_sign = pdf_entry.get('fields', {}).get('stopsign', {}).get('valueBoolean', False)  
  
    # Compare these with boolean values from result2.json  
    if red_light and False in boolean_values:  
        return 'Failed'  
    elif stop_sign and True in boolean_values:  
        return 'Failed'  
    else:  
        return 'Pass'  
  
# Analyze each entry in pdf_results  
analysis_results = []  
for content in pdf_results.get('contents', []):  
    analysis = analyze_pdf_result(content, boolean_values)  
    analysis_results.append({  
        'content': content,  
        'analysis': analysis  
    })  
  
# Output the analysis results  
for analysis in analysis_results:  
    print(f"Content: {analysis['content']}, Analysis: {analysis['analysis']}")  
  
# Optionally save the analysis results to a new JSON file  
with open('analysis_results.json', 'w') as analysis_file:  
    json.dump(analysis_results, analysis_file, indent=2)  


## Clean Up
Optionally, delete the sample analyzer from your resource. In typical usage scenarios, you would analyze multiple files using the same analyzer.

In [207]:
client.delete_analyzer(ANALYZER_ID)

INFO:python.content_understanding_client:Analyzer field-extraction-sample-d2aca421-b74c-423e-bdaa-54dd07a96dce deleted.


<Response [204]>