# Email Classification with Azure OpenAI and Form Recognizer
This code demonstrates how to use Azure Form Recognizer with OpenAI and Azure Python SDK to classify documents

## Prerequisites
1. To run the code, install the following packages. Please use the latest pre-release version `pip install azure-ai-formrecognizer==3.3.0`.


- > ! pip install azure-ai-formrecognizer==3.3.0
- > ! pip install openai

## Login to Azure Document Intelligence Service

- Need to get Admin Client connection to train/build classifier
- Need regular Client connecton to classify user document

In [1]:
import fr

# Your Azure Document Intelligence Service Instance
MY_FORM_RECOGNIZER_ENDPOINT = 'https://tr-docai-form-recognizer.cognitiveservices.azure.com/'

formRecognizerCredential = fr.getFormRecognizerCredential()

formRecognizerClient = fr.getDocumentAnalysisClient(
                            endpoint=MY_FORM_RECOGNIZER_ENDPOINT,
                            credential=formRecognizerCredential
                        )
formRecognizerAdminClient = fr.getDocumentModelAdminClient(
                            endpoint=MY_FORM_RECOGNIZER_ENDPOINT,
                            credential=formRecognizerCredential
                        )


Got Azure Form Recognizer API Key from environment variable


## Load all the Azure AI Translator API parameters

Currently custom document classification by Azure Document Intelligency Service supports English only. So, we are going to translation input to English first.

- pip install azure-ai-translation-text==1.0.0b1
- Deploy Azure AI Translator

In [3]:
MY_TRANSLATOR_ENDPOINT = 'https://api.cognitive.microsofttranslator.com/'
MY_TRANSLATOR_REGION = 'eastus'
import aitr

translatorCredential = aitr.getAITranslatorCredential(MY_TRANSLATOR_REGION)
translatorClient = aitr.getAITranslatorClient(MY_TRANSLATOR_ENDPOINT, translatorCredential)


Could not get API key from environment variable AI_TRANSLATOR_API_KEY. Trying Managed ID

Authenticated successfully with AAD token


## Load all the AOAI API keys and model parameters

In [4]:
import aoai

MY_AOAI_ENDPOINT = 'https://tr-non-prod-gpt4.openai.azure.com/'
MY_AOAI_VERSION = '2023-07-01-preview'
MY_GPT_ENGINE = 'tr-gpt4'
MY_AOAI_EMBEDDING_ENGINE = 'tr-embedding-ada'

status = aoai.setupOpenai(aoai_endpoint=MY_AOAI_ENDPOINT, 
                 aoai_version=MY_AOAI_VERSION)
if status > 0:
    print("AOAI setup succeeded")
else:
    print("AOAI setup failed")


Could not get API key from environment variable OPENAI_API_KEY. Trying Managed ID

Authenticated successfully with AAD token
AOAI setup succeeded


#### Set the parameters

In [2]:
# TODO: Read from Blob Store
# Assuming you are running notebook from the notebook folder
MY_INPUT_DATA_FILE = r'..\..\..\data\sample-auto-insurance-emails\cleaned-emails-with-classes-for-training.json'
MY_OUTPUT_DATA_FOLDER = r'..\..\..\data\sample-auto-insurance-emails\output'

MY_BLOB_STORE_PATH = r'sample-auto-insurance-emails'
MY_BLOB_STORE_URL = r'https://trxdocaixblob.blob.core.windows.net/docai'

# The different classes
categories = ["PolicyCancellation","IncisoCancellation","PersonChange",
                "VINNumberChange","CoverageChange","SubsequenteRegister",
                "PaymentMethodChange","UseChange","DiscountChange","VehicleChange",
                "BillingChange","VehicleDataChange","Transactionoutofscope"]

## Create the email files and list files

<font color=red>You do NOT need to run this cell if the files were already generated from  
\DocAI\data\sample-auto-insurance-emails\cleaned-emails-with-classes.json.</font>


In [None]:
import os
import json
from fpdf import FPDF

with open(MY_INPUT_DATA_FILE, 'r', encoding='utf-8') as file:
    input_data = json.load(file)

for item in input_data:    
    email_file_name = item['FileName']
    email_body = item['EmailBody']
    email_body_in_english = aitr.translate(
                                    translator=translatorClient, 
                                    content=email_body, 
                                    to_lang='en')
        
    # Write email to the pdf file
    pdf = FPDF()
    pdf.compress = False
    pdf.accept_page_break()
    pdf.set_margins(left=30.0, top=30.0, right=-1)
    pdf.add_page()
    pdf.add_font(family='arial', fname=r'c:\WINDOWS\Fonts\arial.ttf', uni=True)
    pdf.set_font(family='arial', size=10)
    pdf.write(5,email_body_in_english)
    #pdf.cell(ln=10, h=0, align='L', w=0, txt=email_body, border=0)
    pdf.output(f'{MY_OUTPUT_DATA_FOLDER}\emails\{email_file_name}', 'F')
    pdf.close()
    
    #f = open(f'{MY_OUTPUT_DATA_FOLDER}\emails\{email_file_name}', "a", encoding="utf-8")
    #f.write(email_body)
    #f.close()
    
    # Write file list in each category file
    for category in categories:
        if item[category] == True:
            f = open(f'{MY_OUTPUT_DATA_FOLDER}\{category}.jsonl', "a")
            file_path = f'{MY_BLOB_STORE_PATH}/output/emails/{email_file_name}'
            f.write('{"file":"' + file_path + '"}\n')
            f.close

#### Load the sample-auto-insurance-emails folder to your blob store
This blob store will be read by Azure Document Intelligence Service to read the emails and the classes to train the classifier  
<b>TODO:</b> Automatically upload the files to the blob store.  

For now manually upload the <b>sample-auto-insurance-emails</b> folder at the root of your container in your Blob Store

## Form Recognizer Model for Classification
- Create Model
- Analyse documents to create the ocr files
- Train with classifier

In [6]:
## CREATE BUILD LAYOUT

import json

MY_BLOB_STORE_CONTAINER_SAS_URL = r'https://trxdocaixblob.blob.core.windows.net/docai?sp=racwdlmeop&st=2023-10-24T18:11:31Z&se=2024-10-25T02:11:31Z&spr=https&sv=2022-11-02&sr=c&sig=kVn2BJR5Yoeesm3MfQYUvUS%2FMZLAbe7fpJbELa8qHlw%3D'

with open(MY_INPUT_DATA_FILE, 'r', encoding='utf-8') as file:
    input_data = json.load(file)

for item in input_data:    
    email_file_name = item['FileName']

    URL_PATH=f'{MY_BLOB_STORE_URL}/{MY_BLOB_STORE_PATH}/output/emails/{email_file_name}'
    print(f"File: {URL_PATH}")
    fr_api_version, model_id, is_handwritten, result = fr.extractResultFromOnlineDocument(
                                                            client=formRecognizerClient,
                                                            model='prebuilt-layout',
                                                            url=URL_PATH
                                                        )

    print(f'Document Intelligence API version = {fr_api_version}\n \
            Document Extraction Model Id = {model_id}\n \
            Does document have any hand written text? {is_handwritten}\n'
         )

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/9eabeba6-1b78-4076-a8e0-19a81abd3ebd.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/b0d110ab-f103-46a1-b16b-6da80fc55e14.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/36040456-f559-4bf7-9d5f-c5c762446653.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insur

Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/b4bc5140-38d8-4c2a-89f1-70165175f23b.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/fac0e189-688f-419b-bf50-e42838dd8468.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/d8c3a267-21be-4017-b3d0-a1711a76e4df.pdf
Document Intelligence API version = 2023-07-31
             Document Extr

Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/bde8d29f-4e2f-4f9b-a5b6-59949c68b9ec.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/64d97682-21fd-4c18-904e-87403259fc63.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/28f795eb-2077-4803-9de1-61c61c7ca3db.pdf
Document Intelligence API version = 2023-07-31
             Document Extr

Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/5c5da039-cfcc-44d2-8d1a-ef734d78228e.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/522f5e8d-ae73-49be-b4c6-c6499ffb57bf.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/62ffd6cf-c39e-48c1-8fe3-1b0652d76f88.pdf
Document Intelligence API version = 2023-07-31
             Document Extr

Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/81c8e7e2-4879-4f1d-9140-d80648148db5.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/b6ba905b-221d-4665-b431-9fd3fc2f9a27.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/8febf4b0-9225-4496-b24d-abdd2d2669fe.pdf
Document Intelligence API version = 2023-07-31
             Document Extr

Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/0c412846-8776-4120-a8d8-8a9745f4fa6f.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/c689f4a5-1382-4772-b28e-04d7660d9755.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/b09fc190-a956-459e-ba63-ce9a9ce70dd2.pdf
Document Intelligence API version = 2023-07-31
             Document Extr

Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/3ab05314-8659-47eb-ab8b-463b1fb18248.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/4f720875-be6e-47e2-aba4-467be92a5cb0.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/6e992864-7217-423e-8f82-b8a1fed91af5.pdf
Document Intelligence API version = 2023-07-31
             Document Extr

Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/fee83368-c02b-4bdb-8387-0f776f391822.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/9b735c11-79d6-4ab6-a4f6-0dd28040233b.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/9377a5ef-1cd7-449d-9399-e54e964f55f5.pdf
Document Intelligence API version = 2023-07-31
             Document Extr

Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/0e9d0418-1330-43b2-ae37-bdf5ec173781.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/fbb1c921-353a-4918-a2f2-a96cbfa74b69.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/ff8884b3-76ba-4826-840d-4de4ba651ffe.pdf
Document Intelligence API version = 2023-07-31
             Document Extr

Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/edf0867b-c03f-4105-8881-e14526ca25ba.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/36252318-b5d4-4413-9e4f-f7af16bd42e7.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/4c785677-37f1-44d5-b9f0-7494289fac0e.pdf
Document Intelligence API version = 2023-07-31
             Document Extr

Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/9e6ec581-09a1-4d2b-88b9-6470ca18ddb1.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/2a9e339c-a871-4676-a0da-a38e8f0f4284.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/c4d44962-3d30-4340-882b-4049b9b930be.pdf
Document Intelligence API version = 2023-07-31
             Document Extr

Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/37ecea43-a4e8-41a8-a5b3-ac7d780be687.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/b0bed9a5-db00-4d43-8a4e-a11c33ee50cc.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/eb7a9652-04ef-4aed-a000-3d326329639a.pdf
Document Intelligence API version = 2023-07-31
             Document Extr

Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/5f2de7fb-4b5c-4f58-8383-7609b28034e0.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/22e6e015-3515-4959-a445-2e31ebcc5614.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/b328691d-4fde-403f-ad6e-d3730963089f.pdf
Document Intelligence API version = 2023-07-31
             Document Extr

In [9]:
# Create the categoryFileMap, needed by Form Recognizer for training
MY_BLOB_URL = r'https://trxdocaixblob.blob.core.windows.net/docai'
categoryFileMap = {}
for category in categories:
    categoryFileMap[category] = f'{MY_BLOB_STORE_PATH}/output/{category}.jsonl'
result = fr.trainClassifier(
                            admin_client=formRecognizerAdminClient,
                            blob_url=MY_BLOB_URL,
                            class_file_list=categoryFileMap
                           )
classifierId = result.classifier_id
print(f"Classifier ID: {classifierId}")
print(f"API version used to build the classifier model: {result.api_version}")
print(f"Classifier description: {result.description}")
print(f"Document Classifier expires on: {result.expires_on}")
print(f"Document classes used for training the model:")
for doc_type, details in result.doc_types.items():
    print(f"Document type: {doc_type}")
    print(f"Container source: {details.source.container_url}\n")

HttpResponseError: (InvalidArgument) Invalid argument.
Code: InvalidArgument
Message: Invalid argument.
Exception Details:	(InvalidContentSourceFormat) Invalid content source: Could not read build content.
	Code: InvalidContentSourceFormat
	Message: Invalid content source: Could not read build content.

In [None]:
print(f"Result: {result}")

In [None]:
fr.deleteClassifier(admin_client=formRecognizerAdminClient, classifier_id=classifierId)