# Email Classification with Azure OpenAI and Form Recognizer
This code demonstrates how to use Azure Form Recognizer with OpenAI and Azure Python SDK to classify documents

## Prerequisites
1. To run the code, install the following packages. Please use the latest pre-release version `pip install azure-ai-formrecognizer==3.3.0`.


- > ! pip install azure-ai-formrecognizer==3.3.0
- > ! pip install openai

## Login to Azure Document Intelligence Service

- Need to get Admin Client connection to train/build classifier
- Need regular Client connecton to classify user document

In [9]:
import fr

# Your Azure Document Intelligence Service Instance
MY_FORM_RECOGNIZER_ENDPOINT = 'https://tr-docai-form-recognizer.cognitiveservices.azure.com/'

formRecognizerCredential = fr.getFormRecognizerCredential()

formRecognizerClient = fr.getDocumentAnalysisClient(
                            endpoint=MY_FORM_RECOGNIZER_ENDPOINT,
                            credential=formRecognizerCredential
                        )
formRecognizerAdminClient = fr.getDocumentModelAdminClient(
                            endpoint=MY_FORM_RECOGNIZER_ENDPOINT,
                            credential=formRecognizerCredential
                        )


Got Azure Form Recognizer API Key from environment variable


## Load all the Azure AI Translator API parameters

Currently custom document classification by Azure Document Intelligency Service supports English only. So, we are going to translation input to English first.

- pip install azure-ai-translation-text==1.0.0b1
- Deploy Azure AI Translator

In [10]:
MY_TRANSLATOR_ENDPOINT = 'https://api.cognitive.microsofttranslator.com/'
MY_TRANSLATOR_REGION = 'eastus'
import aitr

translatorCredential = aitr.getAITranslatorCredential(MY_TRANSLATOR_REGION)
translatorClient = aitr.getAITranslatorClient(MY_TRANSLATOR_ENDPOINT, translatorCredential)


Got Azure AI Translator API Key from environment variable


#### Set the parameters

In [11]:
# TODO: Read from Blob Store
# Assuming you are running notebook from the notebook folder
MY_INPUT_DATA_FILE = r'..\..\..\data\sample-auto-insurance-emails\cleaned-emails-with-classes-for-training.json'
MY_OUTPUT_DATA_FOLDER = r'..\..\..\data\sample-auto-insurance-emails\output'

MY_BLOB_STORE_PATH = r'sample-auto-insurance-emails'
MY_BLOB_STORE_URL = r'https://trxdocaixblob.blob.core.windows.net/docai'

# The different classes
categories = ["PolicyCancellation","IncisoCancellation","PersonChange",
                "VINNumberChange","CoverageChange","SubsequenteRegister",
                "PaymentMethodChange","UseChange","DiscountChange","VehicleChange",
                "BillingChange","VehicleDataChange","Transactionoutofscope"]

## Create the email files and list files

<font color=red>You do NOT need to run this cell if the files were already generated from  
\DocAI\data\sample-auto-insurance-emails\cleaned-emails-with-classes.json.</font>


In [12]:
import os
import json
from fpdf import FPDF

MAX_TRAINING_FILES_FOR_EACH_CATEGORY=4

with open(MY_INPUT_DATA_FILE, 'r', encoding='utf-8') as file:
    input_data = json.load(file)

category_file_count = {}
for category in categories:
    category_file_count[category] = MAX_TRAINING_FILES_FOR_EACH_CATEGORY
    
for item in input_data:
    
    HAVE_GENERATED_MAX_TRAINING_FILES_FOR_EACH_CATEGORY = 0
    for category in categories:
        if category_file_count[category] > 0:
            break
        HAVE_GENERATED_MAX_TRAINING_FILES_FOR_EACH_CATEGORY = 1
    if HAVE_GENERATED_MAX_TRAINING_FILES_FOR_EACH_CATEGORY == 1:
        break
    
    email_file_name = item['FileName']
    email_body = item['EmailBody']
    email_body_in_english = aitr.translate(
                                    translator=translatorClient, 
                                    content=email_body, 
                                    to_lang='en')
        
    # Write email to the pdf file
    pdf = FPDF()
    pdf.compress = False
    pdf.accept_page_break()
    pdf.set_margins(left=30.0, top=30.0, right=-1)
    pdf.add_page()
    pdf.add_font(family='arial', fname=r'c:\WINDOWS\Fonts\arial.ttf', uni=True)
    pdf.set_font(family='arial', size=10)
    pdf.write(5,email_body_in_english)
    pdf.output(f'{MY_OUTPUT_DATA_FOLDER}\emails\{email_file_name}', 'F')
    pdf.close()
        
    # Write file list in each category file
    for category in categories:
        if item[category] == True:
            if category_file_count[category] > 0:
                f = open(f'{MY_OUTPUT_DATA_FOLDER}\{category}.jsonl', "a")
                file_path = f'{MY_BLOB_STORE_PATH}/output/emails/{email_file_name}'
                f.write('{"file":"' + file_path + '"}\n')
                f.close
                category_file_count[category] = category_file_count[category] - 1

File: 9eabeba6-1b78-4076-a8e0-19a81abd3ebd.pdf
Body: LOGO BAJA HC43001043 DE UNIDADES FLOTILLA J E MAFER - FORZA 

Sandra Espitia Arce, Ejecutivo Productor, extiende su solicitud para que se realice una acci�n en l�nea con lo siguiente:

POLIZA HC43001043 

1. BAJA DE INCISO HC43001043 FORZA J E debido a la venta de la unidad. 

Adem�s, se solicita ayuda para la cancelaci�n de la siguiente pol�tica: 

POLIZA KD42900002 92 

2. BAJA DE INCISO Sentra 2007 y se pide que se comparta el endoso de cancelaci�n. 

Quedamos atentos y agradecemos su atenci�n a estas solicitudes. 

Saludos.


ServiceRequestError: (<urllib3.connection.HTTPSConnection object at 0x00000203A4CEE350>, 'Connection to api.cognitive.microsofttranslator.com timed out. (connect timeout=300)')

#### Load the sample-auto-insurance-emails folder to your blob store
This blob store will be read by Azure Document Intelligence Service to read the emails and the classes to train the classifier  
<b>TODO:</b> Automatically upload the files to the blob store.  

For now manually upload the <b>sample-auto-insurance-emails</b> folder at the root of your container in your Blob Store

## Form Recognizer Model for Classification
- Create Model
- Analyse documents to create the ocr files
- Train with classifier

In [8]:
## CREATE BUILD LAYOUT

import json

MY_BLOB_STORE_CONTAINER_SAS_URL = r'https://trxdocaixblob.blob.core.windows.net/docai?sp=racwdlmeop&st=2023-10-24T18:11:31Z&se=2024-10-25T02:11:31Z&spr=https&sv=2022-11-02&sr=c&sig=kVn2BJR5Yoeesm3MfQYUvUS%2FMZLAbe7fpJbELa8qHlw%3D'

with open(MY_INPUT_DATA_FILE, 'r', encoding='utf-8') as file:
    input_data = json.load(file)

# Just extract 1 file for test
i = 0
for item in input_data:
    i = i + 1
    if i > 1:
        break
    email_file_name = item['FileName']

    URL_PATH=f'{MY_BLOB_STORE_URL}/{MY_BLOB_STORE_PATH}/output/emails/{email_file_name}'
    print(f"File: {URL_PATH}")
    fr_api_version, model_id, is_handwritten, result = fr.extractResultFromOnlineDocument(
                                                            client=formRecognizerClient,
                                                            model='prebuilt-layout',
                                                            url=URL_PATH
                                                        )

    print(f'Document Intelligence API version = {fr_api_version}\n \
            Document Extraction Model Id = {model_id}\n \
            Does document have any hand written text? {is_handwritten}\n'
         )

File: https://trxdocaixblob.blob.core.windows.net/docai/sample-auto-insurance-emails/output/emails/9eabeba6-1b78-4076-a8e0-19a81abd3ebd.pdf
Document Intelligence API version = 2023-07-31
             Document Extraction Model Id = prebuilt-layout
             Does document have any hand written text? False



In [4]:
# Create the categoryFileMap, needed by Form Recognizer for training
categoryFileMap = {}
for category in categories:
    categoryFile = f'{MY_BLOB_STORE_PATH}/output/emails/{category}.jsonl'
    print(f'Category List Filename: {categoryFile}')
    categoryFileMap[category] = categoryFile
print(f'Blob Url: {MY_BLOB_STORE_URL}')    
result = fr.trainClassifier(
                            admin_client=formRecognizerAdminClient,
                            blob_url=MY_BLOB_STORE_URL,
                            class_file_list=categoryFileMap
                           )
classifierId = result.classifier_id
print(f"Classifier ID: {classifierId}")
print(f"API version used to build the classifier model: {result.api_version}")
print(f"Classifier description: {result.description}")
print(f"Document Classifier expires on: {result.expires_on}")
print(f"Document classes used for training the model:")
for doc_type, details in result.doc_types.items():
    print(f"Document type: {doc_type}")
    print(f"Container source: {details.source.container_url}\n")

Category List Filename: sample-auto-insurance-emails/output/emails/PolicyCancellation.jsonl
Category List Filename: sample-auto-insurance-emails/output/emails/IncisoCancellation.jsonl
Category List Filename: sample-auto-insurance-emails/output/emails/PersonChange.jsonl
Category List Filename: sample-auto-insurance-emails/output/emails/VINNumberChange.jsonl
Category List Filename: sample-auto-insurance-emails/output/emails/CoverageChange.jsonl
Category List Filename: sample-auto-insurance-emails/output/emails/SubsequenteRegister.jsonl
Category List Filename: sample-auto-insurance-emails/output/emails/PaymentMethodChange.jsonl
Category List Filename: sample-auto-insurance-emails/output/emails/UseChange.jsonl
Category List Filename: sample-auto-insurance-emails/output/emails/DiscountChange.jsonl
Category List Filename: sample-auto-insurance-emails/output/emails/VehicleChange.jsonl
Category List Filename: sample-auto-insurance-emails/output/emails/BillingChange.jsonl
Category List Filename

HttpResponseError: (InvalidArgument) Invalid argument.
Code: InvalidArgument
Message: Invalid argument.
Exception Details:	(InvalidContentSourceFormat) Invalid content source: Could not read build content.
	Code: InvalidContentSourceFormat
	Message: Invalid content source: Could not read build content.

In [None]:
print(f"Result: {result}")

In [None]:
fr.deleteClassifier(admin_client=formRecognizerAdminClient, classifier_id=classifierId)