# Extract custom fields in your file

This notebook demonstrates how to use analyzers to extract custom fields from your input file.

## Prerequisites
1. Follow steps in [README](../README.md#Configure-Azure-AI-Service-resource) to create `.env` file to configure your Azure AI Service.
1. Install packages needed to run the sample

In [None]:
%pip install -r ../requirements.txt

## Analyzer template examples

Below is a collection of analyzer template examples designed to extract fields from various input file types.

These templates are highly customizable, allowing you to modify them to suit your specific needs. For additional verified templates from Microsoft, please visit [HERE](../analyzer_templates/README.md).

In [13]:
extraction_samples = {
    "sample_invoice": ('../analyzer_templates/invoice.json', '../data/invoice.pdf'),
    "sample_chart": ('../analyzer_templates/image_chart.json', '../data/pieChart.jpg'),
    "sample_call_transcript": ('../analyzer_templates/call_transcript.json', '../data/callCenterRecording.mp3'),
    "sample_marketing_video": ('../analyzer_templates/marketing_video.json', '../data/video.mp4')
}

Set the target to the sample analyzer that you want to try.

In [14]:
target_sample = "sample_invoice"

## Create Azure content understanding client
>The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is utility Class which contain the functions to interact with the Content Understanding server. Before Content Understanding SDK release, we can regard it as a lightweight SDK. Fill the constant **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, **AZURE_AI_API_KEY** with the information from your Azure AI Service.

In [15]:
import logging
import json
import os
import sys
from dotenv import find_dotenv, load_dotenv

# import utility package from python samples root directory
py_samples_root_dir = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(py_samples_root_dir)
from python.content_understanding_client import AzureContentUnderstandingClient

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

client = AzureContentUnderstandingClient(
    endpoint=os.getenv("AZURE_AI_ENDPOINT"),
    api_version=os.getenv("AZURE_AI_API_VERSION", "2024-12-01-preview"),
    subscription_key=os.getenv("AZURE_AI_API_KEY"),
    api_token=os.getenv("AZURE_AI_API_TOKEN"),
    x_ms_useragent="azure-ai-content-understanding-python/field_extraction",
)

## Create analyzer with defined schema
Before creating the custom fields analyzer, you should fill the constant ANALYZER_ID with a business-related name. Here we randomly generate a name for demo purpose.

In [16]:
import uuid
ANALYZER_ID = "extraction-sample-" + str(uuid.uuid4())

response = client.begin_create_analyzer(ANALYZER_ID, analyzer_schema_path=extraction_samples[target_sample][0])
result = client.poll_result(response)

logging.info(json.dumps(result, indent=2))

INFO:python.content_understanding_client:Analyzer extraction-sample-dcac80bd-cd4f-4122-aecc-756cddaf7bf0 create request accepted.
INFO:python.content_understanding_client:Request 279e861c-c510-47d7-a858-726c1500400b in progress ...
INFO:python.content_understanding_client:Request 279e861c-c510-47d7-a858-726c1500400b in progress ...
INFO:python.content_understanding_client:Request result is ready after 5.65 seconds.
INFO:root:{
  "id": "279e861c-c510-47d7-a858-726c1500400b",
  "status": "Succeeded",
  "result": {
    "analyzerId": "extraction-sample-dcac80bd-cd4f-4122-aecc-756cddaf7bf0",
    "description": "Sample invoice analyzer",
    "createdAt": "2024-12-09T18:56:45Z",
    "lastModifiedAt": "2024-12-09T18:56:51Z",
    "config": {
      "returnDetails": false,
      "enableOcr": true,
      "enableLayout": true,
      "enableBarcode": false,
      "enableFormula": false
    },
    "fieldSchema": {
      "fields": {
        "VendorName": {
          "type": "string",
          "method

## Use created analyzer to extract document content


After the analyzer is successfully created, we can use it to analyze our input files.

In [17]:
response = client.begin_analyze(ANALYZER_ID, file_location=extraction_samples[target_sample][1])
result = client.poll_result(response)

logging.info(json.dumps(result, indent=2))

INFO:python.content_understanding_client:Analyzing file ../data/invoice.pdf with analyzer: extraction-sample-dcac80bd-cd4f-4122-aecc-756cddaf7bf0
INFO:python.content_understanding_client:Request 6a7c9abb-ff59-4d0c-af08-306b8cd02dc8 in progress ...
INFO:python.content_understanding_client:Request 6a7c9abb-ff59-4d0c-af08-306b8cd02dc8 in progress ...
INFO:python.content_understanding_client:Request 6a7c9abb-ff59-4d0c-af08-306b8cd02dc8 in progress ...
INFO:python.content_understanding_client:Request result is ready after 8.42 seconds.
INFO:root:{
  "id": "6a7c9abb-ff59-4d0c-af08-306b8cd02dc8",
  "status": "Succeeded",
  "result": {
    "analyzerId": "extraction-sample-dcac80bd-cd4f-4122-aecc-756cddaf7bf0",
    "apiVersion": "2024-12-01-preview",
    "createdAt": "2024-12-09T18:56:54Z",
    "contents": [
      {
        "markdown": "CONTOSO LTD.\n\n\n# INVOICE\n\nContoso Headquarters\n123 456th St\nNew York, NY, 10001\n\nINVOICE: INV-100\n\nINVOICE DATE: 11/15/2019\n\nDUE DATE: 12/15/2019\n

## Delete exist analyzer in Content Understanding Service
This snippet is not required, but it's only used to prevent the testing analyzer from residing in your service. The custom fields analyzer could be stored in your service for reusing by subsequent business in real usage scenarios.



In [18]:
client.delete_analyzer(ANALYZER_ID)

INFO:python.content_understanding_client:Analyzer extraction-sample-dcac80bd-cd4f-4122-aecc-756cddaf7bf0 deleted.


<Response [204]>