# Email Classification with Azure OpenAI and Form Recognizer
This code demonstrates how to use Azure Form Recognizer with OpenAI and Azure Python SDK to classify documents

## Prerequisites
1. To run the code, install the following packages. Please use the latest pre-release version `pip install azure-ai-formrecognizer==3.3.0`.


- > ! pip install azure-ai-formrecognizer==3.3.0
- > ! pip install openai

## Login to Azure Document Intelligence Service

- Need to get Admin Client connection to train/build classifier
- Need regular Client connecton to classify user document

In [1]:
import fr

# Your Azure Document Intelligence Service Instance
MY_FORM_RECOGNIZER_ENDPOINT = 'https://tr-docai-form-recognizer.cognitiveservices.azure.com/'

formRecognizerCredential = fr.getFormRecognizerCredential()

formRecognizerClient = fr.getDocumentAnalysisClient(
                            endpoint=MY_FORM_RECOGNIZER_ENDPOINT,
                            credential=formRecognizerCredential
                        )
formRecognizerAdminClient = fr.getDocumentModelAdminClient(
                            endpoint=MY_FORM_RECOGNIZER_ENDPOINT,
                            credential=formRecognizerCredential
                        )


Got Azure Form Recognizer API Key from environment variable


## Load all the Azure AI Translator API parameters

Currently custom document classification by Azure Document Intelligency Service supports English only. So, we are going to translation input to English first.

- pip install azure-ai-translation-text==1.0.0b1
- Deploy Azure AI Translator

In [2]:
MY_TRANSLATOR_ENDPOINT = 'https://api.cognitive.microsofttranslator.com/'
MY_TRANSLATOR_REGION = 'eastus'
import aitr

translatorCredential = aitr.getAITranslatorCredential(MY_TRANSLATOR_REGION)
translatorClient = aitr.getAITranslatorClient(MY_TRANSLATOR_ENDPOINT, translatorCredential)


Got Azure AI Translator API Key from environment variable


#### Set the parameters

In [4]:
# TODO: Read from Blob Store
# Assuming you are running notebook from the notebook folder
MY_PROJECT_ROOT = r'..\..\..\data\sample-auto-insurance-emails\\'
MY_INPUT_DATA_FILE = r'..\..\..\data\sample-auto-insurance-emails\cleaned-emails-with-classes-for-training.json'
MY_TRAINING_DATA_BASE_FOLDER = r'..\..\..\data\sample-auto-insurance-emails\training'
MY_TRAINING_DOCS_SUBFOLDER = r'emails'
MY_TRAINING_DATA_EMAIL_DOCS_FOLDER = f'{MY_TRAINING_DATA_BASE_FOLDER}\\{MY_TRAINING_DOCS_SUBFOLDER}'

MY_TEST_DATA_FILE = r'..\..\..\data\sample-auto-insurance-emails\cleaned-emails-with-classes-for-test.json'
MY_TEST_DOCUMENT_FOLDER = r'..\..\..\data\sample-auto-insurance-emails\test'

MY_BLOB_STORE_PATH = r'sample-auto-insurance-emails/training'
MY_BLOB_STORE_URL = r'https://trxdocaixblob.blob.core.windows.net/docai'

# The different classes
categories = ["PolicyCancellation","IncisoCancellation","PersonChange",
                "VINNumberChange","CoverageChange","SubsequenteRegister",
                "PaymentMethodChange","UseChange","DiscountChange","VehicleChange",
                "BillingChange","VehicleDataChange","Transactionoutofscope"]

## Create the email files and list files

<font color=red>You do NOT need to run this cell if the files were already generated from  
\DocAI\data\sample-auto-insurance-emails\cleaned-emails-with-classes.json.</font>


#### Generate files for training

In [4]:
import os
import json
from fpdf import FPDF
import docutil

docutil.generateEnglishPDFDocsFromJson(
                                        json_file=MY_INPUT_DATA_FILE,
                                        output_folder=MY_TRAINING_DATA_EMAIL_DOCS_FOLDER,
                                        translator_client=translatorClient
                                      )

docutil.generateCategoryFileList(
                                    json_file=MY_INPUT_DATA_FILE,
                                    output_folder=MY_TRAINING_DATA_BASE_FOLDER,
                                    docs_subfolder=MY_TRAINING_DOCS_SUBFOLDER,
                                    categories=categories
                                )


#### Generate files for testing the classifier

In [5]:
import docutil
docutil.generateEnglishPDFDocsFromJson(MY_TEST_DATA_FILE, MY_TEST_DOCUMENT_FOLDER, translatorClient)

#### Load the sample-auto-insurance-emails folder to your blob store
This blob store will be read by Azure Document Intelligence Service to read the emails and the classes to train the classifier  
<b>TODO:</b> Automatically upload the files to the blob store.  

For now manually upload the <b>sample-auto-insurance-emails</b> folder at the root of your container in your Blob Store

## Form Recognizer Model for Classification
- Generate layout for the training files (aka generate the .ocr.json files) using the Studio (can be done using python code soon)
- Train with classifier
- Classify using the trained model

#### Run layout to create .ocr.json files

#### Train classifier

In [5]:
#TODO: populate category based on above code that created the actual category files....
categories = ["PolicyCancellation","IncisoCancellation","PersonChange",
                "VINNumberChange","CoverageChange","SubsequenteRegister",
                "PaymentMethodChange","UseChange","DiscountChange","VehicleChange",
                "BillingChange","VehicleDataChange"]

# Create the categoryFileMap, needed by Form Recognizer for training
categoryFileMap = {}
for category in categories:
    categoryFile = f'{MY_BLOB_STORE_PATH}/{category}.jsonl'
    categoryFileMap[category] = categoryFile
result = fr.trainClassifier(
                            admin_client=formRecognizerAdminClient,
                            blob_url=MY_BLOB_STORE_URL,
                            class_file_list=categoryFileMap
                           )
MY_CLASSIFIER_ID = result.classifier_id
print(f"Classifier ID: {MY_CLASSIFIER_ID}")
print(f"API version used to build the classifier model: {result.api_version}")
print(f"Classifier description: {result.description}")
print(f"Document Classifier expires on: {result.expires_on}")
print(f"Document classes used for training the model:")
for doc_type, details in result.doc_types.items():
    print(f"Document type: {doc_type}")
    print(f"Container source: {details.source.container_url}\n")

{'PolicyCancellation': ClassifierDocumentTypeDetails(source_kind=azureBlobFileList, source=BlobFileListSource(container_url=https://trxdocaixblob.blob.core.windows.net/docai, file_list=sample-auto-insurance-emails/training/PolicyCancellation.jsonl)), 'IncisoCancellation': ClassifierDocumentTypeDetails(source_kind=azureBlobFileList, source=BlobFileListSource(container_url=https://trxdocaixblob.blob.core.windows.net/docai, file_list=sample-auto-insurance-emails/training/IncisoCancellation.jsonl)), 'PersonChange': ClassifierDocumentTypeDetails(source_kind=azureBlobFileList, source=BlobFileListSource(container_url=https://trxdocaixblob.blob.core.windows.net/docai, file_list=sample-auto-insurance-emails/training/PersonChange.jsonl)), 'VINNumberChange': ClassifierDocumentTypeDetails(source_kind=azureBlobFileList, source=BlobFileListSource(container_url=https://trxdocaixblob.blob.core.windows.net/docai, file_list=sample-auto-insurance-emails/training/VINNumberChange.jsonl)), 'CoverageChange':

In [6]:
print(f"Result: {result}")

Result: DocumentClassifierDetails(classifier_id=ca4a9518-0a8b-4588-8f8f-8705b1426b58, description=Auto Insurance Email Classifier, created_on=2023-10-26 13:22:56+00:00, expires_on=2025-10-25 13:22:56+00:00, api_version=2023-07-31, doc_types={'PolicyCancellation': ClassifierDocumentTypeDetails(source_kind=azureBlobFileList, source=BlobFileListSource(container_url=https://trxdocaixblob.blob.core.windows.net/docai, file_list=sample-auto-insurance-emails/training/PolicyCancellation.jsonl)), 'IncisoCancellation': ClassifierDocumentTypeDetails(source_kind=azureBlobFileList, source=BlobFileListSource(container_url=https://trxdocaixblob.blob.core.windows.net/docai, file_list=sample-auto-insurance-emails/training/IncisoCancellation.jsonl)), 'PersonChange': ClassifierDocumentTypeDetails(source_kind=azureBlobFileList, source=BlobFileListSource(container_url=https://trxdocaixblob.blob.core.windows.net/docai, file_list=sample-auto-insurance-emails/training/PersonChange.jsonl)), 'VINNumberChange': C

#### Test classifier

In [13]:
import os
# Get list of test pdf files
# MY_CLASSIFIER_ID = 'ca4a9518-0a8b-4588-8f8f-8705b1426b58'

for file in os.listdir(MY_TEST_DOCUMENT_FOLDER):
    if os.path.isfile(os.path.join(MY_TEST_DOCUMENT_FOLDER, file)):
        print(f'Classify file: {file}')
        result = fr.classifyDocument(
                        client=formRecognizerClient,
                        classifier_id=MY_CLASSIFIER_ID,
                        file_path=f'{MY_TEST_DOCUMENT_FOLDER}\\{file}'
                )
        for doc in result.documents:
            print(
                f"\tFound document of type '{doc.doc_type or 'N/A'}' with a confidence of {doc.confidence} contained on "
                f"the following pages: {[region.page_number for region in doc.bounding_regions]}"
            )


Classify file: 6f6c1353-d3a7-4ccf-a3df-7ac45de7abbb.pdf
	Found document of type 'BillingChange' with a confidence of 0.07 contained on the following pages: [1]
Classify file: 724b4f48-5e97-48c5-b05e-2ae99eb4da34.pdf
	Found document of type 'CoverageChange' with a confidence of 0.058 contained on the following pages: [1]
Classify file: 7d8b0441-5018-40a9-852d-48f32d1acc79.pdf
	Found document of type 'DiscountChange' with a confidence of 0.054 contained on the following pages: [1]
Classify file: ab8f5654-61b2-4eea-83a7-dccacdf52022.pdf
	Found document of type 'PolicyCancellation' with a confidence of 0.056 contained on the following pages: [1]
Classify file: e7014199-7b44-49b4-9c66-49a17d2d2c81.pdf
	Found document of type 'BillingChange' with a confidence of 0.075 contained on the following pages: [1]


## Delete the Classifier for cleanup

In [None]:
fr.deleteClassifier(admin_client=formRecognizerAdminClient, classifier_id=MY_CLASSIFIER_ID)