# Extract fields through custom fields analyzer

These code snippets demonstrates how to use custom fields analyzer to extracting document content with suitable steps order. This code snippets focuse more on the sequential process, so that you could quickly overview the process of extracting document content through custom fields analyzer. 

## Prerequisites
Link to environment creation

In [None]:
%pip install -r ../requirements.txt

## Analyzer template examples

Here we provide a list of analyzer template examples that can be used to extract fields from different input file types.

In [2]:
extraction_samples = {
    "sample_invoice": ('../analyzer_templates/sample_invoice_analyzer.json', '../data/invoice.pdf'),
    "sample_chart": ('../analyzer_templates/sample_chart_analyzer.json', '../data/pieChart.jpg'),
    "sample_call_transcript": ('../analyzer_templates/sample_call_transcript_analyzer.json', '../data/callCenterRecording.mp3'),
    "sample_marketing_video": ('../analyzer_templates/sample_marketing_video_analyzer.json', '../data/video.mp4')
}

Set the target to the sample analyzer that you want to try.

In [3]:
target_sample = "sample_invoice"

## Create Azure content understanding client
>The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is utility Class which contain the functions to interact with the Content Understanding server. Before Content Understanding SDK release, we can regard it as a lightweight SDK. Fill the constant **AZURE_CU_ENDPOINT**, **AZURE_CU_API_VERSION**, **AZURE_CU_API_KEY** with the information from your Azure AI Content Understanding Service.

In [4]:
import logging
import json
import os
import sys
from dotenv import find_dotenv, load_dotenv

# import utility package from python samples root directory
py_samples_root_dir = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(py_samples_root_dir)
from python.content_understanding_client import AzureContentUnderstandingClient

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

client = AzureContentUnderstandingClient(
    endpoint=os.getenv("AZURE_CU_ENDPOINT"),
    api_version=os.getenv("AZURE_CU_API_VERSION", "2024-12-01-preview"),
    subscription_key=os.getenv("AZURE_CU_API_KEY"),
    api_token=os.getenv("AZURE_CU_API_TOKEN"),
)

## Create analyzer with defined schema
Before creating the custom fields analyzer, you should fill the constant ANALYZER_ID with a business-related name. Here we randomly generate a name for demo purpose.

In [5]:
import uuid
ANALYZER_ID = "extraction-sample-" + str(uuid.uuid4())

response = client.begin_create_analyzer(ANALYZER_ID, analyzer_schema_path=extraction_samples[target_sample][0])
result = client.poll_result(response)

logging.info(json.dumps(result, indent=2))

INFO:python.content_understanding_client:Analyzer extraction-sample-f226ef3d-6e68-4f0e-827c-e4befb8e747c create request accepted.
INFO:python.content_understanding_client:Request f4ea55e8-5a17-44ff-bedb-badd860cb9db in progress ...
INFO:python.content_understanding_client:Request result is ready after 2.75 seconds.
INFO:root:{
  "id": "f4ea55e8-5a17-44ff-bedb-badd860cb9db",
  "status": "Succeeded",
  "result": {
    "analyzerId": "extraction-sample-f226ef3d-6e68-4f0e-827c-e4befb8e747c",
    "description": "Sample invoice analyzer",
    "createdAt": "2024-12-06T23:26:42Z",
    "lastModifiedAt": "2024-12-06T23:26:44Z",
    "config": {
      "returnDetails": false,
      "enableOcr": true,
      "enableLayout": true,
      "enableBarcode": false,
      "enableFormula": false
    },
    "fieldSchema": {
      "fields": {
        "VendorName": {
          "type": "string",
          "method": "extract",
          "description": "Vendor issuing the invoice"
        },
        "Items": {
    

## Use created analyzer to extract document content


After the analyzer is successfully created, we can use it to analyze our input files.

In [6]:
response = client.begin_analyze(ANALYZER_ID, file_location=extraction_samples[target_sample][1])
result = client.poll_result(response)

logging.info(json.dumps(result, indent=2))

INFO:python.content_understanding_client:Analyzing file ../data/invoice.pdf with analyzer: extraction-sample-f226ef3d-6e68-4f0e-827c-e4befb8e747c
INFO:python.content_understanding_client:Request 1b59bf8d-ae4e-48e0-8221-94ebbf978b8a in progress ...
INFO:python.content_understanding_client:Request 1b59bf8d-ae4e-48e0-8221-94ebbf978b8a in progress ...
INFO:python.content_understanding_client:Request result is ready after 5.41 seconds.
INFO:root:{
  "id": "1b59bf8d-ae4e-48e0-8221-94ebbf978b8a",
  "status": "Succeeded",
  "result": {
    "analyzerId": "extraction-sample-f226ef3d-6e68-4f0e-827c-e4befb8e747c",
    "apiVersion": "2024-12-01-preview",
    "createdAt": "2024-12-06T23:26:47Z",
    "contents": [
      {
        "markdown": "CONTOSO LTD.\n\n\n# INVOICE\n\nContoso Headquarters\n123 456th St\nNew York, NY, 10001\n\nINVOICE: INV-100\n\nINVOICE DATE: 11/15/2019\n\nDUE DATE: 12/15/2019\n\nCUSTOMER NAME: MICROSOFT CORPORATION\n\nSERVICE PERIOD: 10/14/2019 - 11/14/2019\n\nCUSTOMER ID: CID-

## Delete exist analyzer in AI Understanding Content Service
This snippet is not required, but it's only used to prevent the testing analyzer from residing in your service. The custom fields analyzer could be stored in your service for reusing by subsequent business in real usage scenarios.



In [7]:
client.delete_analyzer(ANALYZER_ID)

INFO:python.content_understanding_client:Analyzer extraction-sample-f226ef3d-6e68-4f0e-827c-e4befb8e747c deleted.


<Response [204]>