Assignment: 2

# Custom Form Recognizer model

* [x] Provide exported project file 

* [x] Provide code to call your model with sample data  

* [?] Optional: Provide code to create custom model 

* [x] Build and train a custom classifier - Document Intelligence (formerly Form Recognizer) - Azure AI services | Microsoft Learn: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/build-a-custom-classifier?view=doc-intel-3.1.0
    * [x] Data Source: 5 arxiv paper  

    * [x] Extract: Name of the paper, author list, abstract, number of pages 

    * [x] Optional: classify content page vs reference page 

## Task1: Custom Extraction Model - trainings

* Training #1:

    project name: CustomPaperExtraction
    
    Form Recognizer at tier(F0) which causes only the first two pages will be processed for training. Has no capability to train total page which would be in the last page.   
    This is no go.

* Training #2:

    project name: CustomPaperExtraction_S0
    
    Form Recognizer at tier(S0) which entire document will be processed for training. Therefore, it is capable to train total page number.
    
    Projec token:

    "eyJpZCI6Ii9zdWJzY3JpcHRpb25zLzdjYjA5ODk0LWNjNTgtNDczNS1iNjQ0LTVkY2FlYWVjNTI4OC9yZXNvdXJjZUdyb3Vwcy9BenVyZVN0b3JhZ2VBY2NvdW50L3Byb3ZpZGVycy9NaWNyb3NvZnQuU3RvcmFnZS9zdG9yYWdlQWNjb3VudHMveW9uZ2Fpc3RvcmFnZWFjY291bnQiLCJjb250YWluZXIiOiJwYXBlcmV4dHJhY3Rpb24iLCJwYXRoIjoiY29uZmlnLWY3NzczNTYwLWYxZGUtNGNlZS04N2RlLTE1MTRhYjlmNjhiNi05NzY0NzgxMy0zNjYzLTQyNjctYTg1OC01OGFlOWM4ZWNhZjMuanNvbiIsInR5cGUiOjB9"
    
* Azure AI | Document Intelligence Studio: https://formrecognizer.appliedai.azure.com/studio


## Task2: Code to call trained Custom Extraction model

* Custome model extraction project name: CustomPaperExtraction_S0
    
    Form Recognizer at tier(S0) which entire document will be processed for training. Therefore, it is capable to train total page number.
    
    pdf_file_path = './data_source/testing_data/arxiv.org/ai_paper/...'


#### Testing and evaluation (Training #2 - retrained with more data)
* Model ID: 11092023
* Data Source: 5 arxiv paper in AI area
* Extract Tile, Authors, Abstract and Total pages
<br>

| Documents | Title | Authors | Abstract | Total page extracted | Total page calculated | Accuracy | Extraction<br>note |
| --------- | ----- | ------- | -------- | ------------------ | --------------------- | -------- | -------------- |
| 2311.01043.pdf | Yes and 100% accurate | Yes and 100% accurate | Yes and 100% accurate | None | Yes and 100% accurate. 13 Pages. | 100% | This doc does not have page number. |
| 2311.01193.pdf | Yes and 100% accurate | Yes and 100% accurate | Yes and 100% accurate | Yes and 100% accurate. 36 Pages| If calculated, then the total pages would be 38! | 100%| Since this doc has page number and the last page's page number is 36, hence it is 36. The first two pages (abstract page and toc page) of the original doc does not have page numbers. |
| 2311.01258.pdf | Yes and 100% accurate | 8 out of 9 authors were extracted | Yes and 100% accurate | Yes and 100% accurate | N/A |  97.22% | Authors in this doc are mixed with Universities. This is a tough case. Might need some special treatment. |
| 2311.01460.pdf | Yes and 100% accurate | Yes and 100% accurate | Yes and 100% accurate | Yes and 100% accurate | N/A | 100%| This doc has page number. |
| 2310.18168.pdf | Yes and 100% accurate | Yes and 100% accurate | Yes and 100% accurate | Yes and 100% accurate | N/A | 100%| This doc has page number. |

In [17]:
from dotenv import load_dotenv
load_dotenv()

import os
import PyPDF2

Access different trained models


In [19]:
"""
This code sample shows Custom Extraction Model operations with the Azure Form Recognizer client library. 
The async versions of the samples require Python 3.6 or later.

To learn more, please visit the documentation - Quickstart: Form Recognizer Python client library SDKs
https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/quickstarts/try-v3-python-sdk
"""

from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient

"""
Remember to remove the key from your code when you're done, and never post it publicly. For production, use
secure methods to store and access your credentials. For more information, see 
https://docs.microsoft.com/en-us/azure/cognitive-services/cognitive-services-security?tabs=command-line%2Ccsharp#environment-variables-and-application-configuration
"""
# endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"
# key = "YOUR_FORM_RECOGNIZER_KEY"
# Get key and region from system env
# model_id = "YOUR_CUSTOM_BUILT_MODEL_ID"

# tier = 'F0'
tier_f0 = False
if (tier_f0):
    # Trained using Form Recognizer at Tier(F0) - only read first two pages of each uploaded doc
    # Pros: free
    # Cons: cannot extract total page
    endpoint=os.getenv("FORM_RECOGNIZER_ENDPOINT")
    key=os.getenv('FORM_RECOGNIZER_KEY')
    model_id=os.getenv('CUSTOM_BUILT_MODEL_ID_F0')
    print("Tier level: F0")
else: 
    # Trained using Document Intelligence (form recognizer) at Tier(SO) - read entire doc
    # Pros: can extract total page
    # Cons: S0
    endpoint=os.getenv("DI_FORM_RECOGNIZER_ENDPOINT")
    key=os.getenv('DI_FORM_RECOGNIZER_KEY')
    # model_id=os.getenv('CUSTOM_BUILT_MODEL_ID_S0')
    model_id=os.getenv('CUSTOM_BUILT_MODEL_ID_CEM')    
    print("Tier level: S0")

# Access testing data
# formUrl = "YOUR_DOCUMENT"
# Path to testing data - PDF files

# Test case 1: 13 pages
# pdf_file_path = './data_source/testing_data/arxiv.org/ai_paper/2311.01043.pdf'

# Test case 2: 39 pages
# pdf_file_path = './data_source/testing_data/arxiv.org/ai_paper/2311.01193.pdf'

# Test case 3: 175 pages
# pdf_file_path = './data_source/testing_data/arxiv.org/ai_paper/2311.01258.pdf'

# Test case 4: 18 pages
# pdf_file_path = './data_source/testing_data/arxiv.org/ai_paper/2311.01460.pdf'

# Test case 5: 15 pages
pdf_file_path = './data_source/testing_data/arxiv.org/ai_paper/2310.18168.pdf'

def get_pdf_total_pages():
    # Open the PDF file in binary read mode
    with open(pdf_file_path, 'rb') as pdf_file:
        # Create a PDF object to read the file
        pdf_reader = PyPDF2.PdfReader(pdf_file)

        # Check the total number of pages in the PDF
        total_pages = len(pdf_reader.pages)

    # Print the total number of pages
    print(f'~~~~~~Total number of pages in the PDF: {total_pages}')

get_pdf_total_pages()

Tier level: S0
~~~~~~Total number of pages in the PDF: 15


#### Extract paper title, authors, abstract, and number of pages

In [14]:
document_analysis_client = DocumentAnalysisClient(
    endpoint=endpoint, credential=AzureKeyCredential(key)
)

# Make sure your document's type is included in the list of document types the custom model can analyze
# poller = document_analysis_client.begin_analyze_document_from_url(model_id, formUrl)
with open(pdf_file_path, 'rb') as pdf_file:
    poller = document_analysis_client.begin_analyze_document(model_id, pdf_file)
result = poller.result()

for idx, document in enumerate(result.documents):
    print("--------Analyzing document #{}--------".format(idx + 1))
    print("Document has type {}".format(document.doc_type))
    print("Document has confidence {}".format(document.confidence))
    print("Document was analyzed by model with ID {}".format(result.model_id))
    for name, field in document.fields.items():
        print("......found field name: ", name)
        field_value = field.value if field.value else field.content
        print("......found field of type '{}' with value '{}' and with confidence {}".format(field.value_type, field_value, field.confidence))
        # We only fetch total page number using PDF lib if the total page number can not be found by the model.
        if (name == 'total_page' and field_value == None ):
            print("~~~~~~total_page's value is None returned from Custom Extraction Model! Let's use PDF lib to find out document's total page.")
            get_pdf_total_pages()

# # iterate over tables, lines, and selection marks on each page
# for page in result.pages:
#     print("\nLines found on page {}".format(page.page_number))
#     for line in page.lines:
#         print("...Line '{}'".format(line.content.encode('utf-8')))
#     for word in page.words:
#         print(
#             "...Word '{}' has a confidence of {}".format(
#                 word.content.encode('utf-8'), word.confidence
#             )
#         )
#     for selection_mark in page.selection_marks:
#         print(
#             "...Selection mark is '{}' and has a confidence of {}".format(
#                 selection_mark.state, selection_mark.confidence
#             )
#         )

# for i, table in enumerate(result.tables):
#     print("\nTable {} can be found on page:".format(i + 1))
#     for region in table.bounding_regions:
#         print("...{}".format(i + 1, region.page_number))
#     for cell in table.cells:
#         print(
#             "...Cell[{}][{}] has content '{}'".format(
#                 cell.row_index, cell.column_index, cell.content.encode('utf-8')
#             )
#         )
print("-----------------------------------")


--------Analyzing document #1--------
Document has type 11092023
Document has confidence 0.987
Document was analyzed by model with ID 11092023
......found field name:  paper_title
......found field of type 'string' with value 'PERSONAS AS A WAY TO MODEL TRUTHFULNESS IN LANGUAGE MODELS' and with confidence 0.601
......found field name:  authors
......found field of type 'string' with value 'Nitish Joshi1* Javier Rando2* Abulhair Saparov1 Najoung Kim3 He He1' and with confidence 0.624
......found field name:  abstract
......found field of type 'string' with value 'ABSTRACT Large Language Models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different agents producing the corpora, we hypothesize that they can cluster truthful text by modeling a truthful persona: a group of agents that a

## Task3: Build and train a Custom Classification model

* [Optional]: classify content page vs reference page 

* Training #1:*This training has some issues if the reference pages are not at the end. Hence, the training data set would need to break into more granular data units.*

    project name: CustomPaperClassification_S0
    
    Training data: pdf files need to be preprocessed at page level. Documents were splitted into documents containing content pages or reference pages.
    
    Projec token:

    "eyJpZCI6Ii9zdWJzY3JpcHRpb25zLzdjYjA5ODk0LWNjNTgtNDczNS1iNjQ0LTVkY2FlYWVjNTI4OC9yZXNvdXJjZUdyb3Vwcy9BenVyZVN0b3JhZ2VBY2NvdW50L3Byb3ZpZGVycy9NaWNyb3NvZnQuU3RvcmFnZS9zdG9yYWdlQWNjb3VudHMveW9uZ2Fpc3RvcmFnZWFjY291bnQiLCJjb250YWluZXIiOiJwYXBlcmNsYXNzaWZpY2F0aW9uIiwicGF0aCI6ImNvbmZpZy1mNzc3MzU2MC1mMWRlLTRjZWUtODdkZS0xNTE0YWI5ZjY4YjYtN2FjZTRmM2UtYzY5ZS00YWFmLWEzYTktZDIwMmRlYmUyMjg2Lmpzb24iLCJ0eXBlIjoxfQ=="
    
*  Training #2: This training provides very good classification result!

    project name: CustomDocClassification
    
    Training data: pdf files need to be preprocessed at page level. Documents were splitted into documents containing single content page or single reference page.
    
    Projec token:

    "eyJpZCI6Ii9zdWJzY3JpcHRpb25zLzdjYjA5ODk0LWNjNTgtNDczNS1iNjQ0LTVkY2FlYWVjNTI4OC9yZXNvdXJjZUdyb3Vwcy9BenVyZVN0b3JhZ2VBY2NvdW50L3Byb3ZpZGVycy9NaWNyb3NvZnQuU3RvcmFnZS9zdG9yYWdlQWNjb3VudHMveW9uZ2Fpc3RvcmFnZWFjY291bnQiLCJjb250YWluZXIiOiJkb2NjbGFzc2lmaWNhdGlvbiIsInBhdGgiOiJjb25maWctZjc3NzM1NjAtZjFkZS00Y2VlLTg3ZGUtMTUxNGFiOWY2OGI2LWVjNWM5ZTYyLWFiYjUtNGYyOC04M2YwLTRiM2U1MzljYzBmOS5qc29uIiwidHlwZSI6MX0="
    
* Azure AI | Document Intelligence Studio: https://formrecognizer.appliedai.azure.com/studio

## Task4: Code to call trained Custom Classification model

* project name: CustomDocClassification

* file path : './data_source/testing_data/arxiv.org/ai_paper/...'

#### Testing and evaluation (training #2)
* Data Source: 5 arxiv paper in AI area

* Classify content page vs reference page

| Documents | Total pages | Content pages | Reference pages | Classified contend pages | Classified reference pages | Accuracy | Misclassified<br>note |
| --------- | ----------- | ------------- | --------------- | ------------------------ | -------------------------- | -------- | ----------------- |
| 2311.01043.pdf | 13 | 1 to 7 | 8, 9, 10, 11, 12, 13, | 1 to 7 | 8, 9, 10, 11, 12, 13, | 100% | 0 |
| 2311.01193.pdf | 38 | 1 to 29 | 30, 31,<br>32, 33, 34, 35, 36, 37, 38, | 1 to 31 | 32, 33, 34, 35, 36, 37, 38, | (38 -2)/38 = 94.74% | 2 pages are supposed to be reference page, but classified as content page. Model can be retrained with even more data, which might help to resolve this issue. Also, training data ratio might be a factor: current ratio content page vs reference page are 79 : 16. Might need to balance them out. |
| 2311.01258.pdf | 175 | 1 to 154 | 155 to 175 | 1 to 154 | 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, | 100% | 0 |
| 2311.01455.pdf | 39 | 1 to 10, 17 to 39 | 11, 12, 13, 14, 15, 16, | 1 to 10, 17 to 39 | 11, 12, 13, 14, 15, 16, | 100% | 0 |
| 2311.01460.pdf | 18 | 1 to 9, 14 to 18 | 10, 11, 12, 13, | 1 to 9, 14 to 18 | 10, 11, 12, 13, | 100% | 0 |



In [6]:
# Custom Classification Model ID
# ccm_model_id=os.getenv('CUSTOM_BUILT_MODEL_ID_CCM')
ccm_model_id=os.getenv('CUSTOM_BUILT_MODEL_ID_CCM_PAGE')

# Access testing data
# formUrl = "YOUR_DOCUMENT"
# Path to testing data - PDF files

# Test case 1: 13 pages
# pdf_file_path = './data_source/testing_data/arxiv.org/ai_paper/2311.01043.pdf'

# Test case 2: 39 pages
# pdf_file_path = './data_source/testing_data/arxiv.org/ai_paper/2311.01193.pdf'

# Test case 3: 175 pages
# pdf_file_path = './data_source/testing_data/arxiv.org/ai_paper/2311.01258.pdf'

# Test case 4: 39 pages
pdf_file_path = './data_source/testing_data/arxiv.org/ai_paper/2311.01455.pdf'

# Test case 5:
# pdf_file_path = './data_source/testing_data/arxiv.org/ai_paper/2311.01460.pdf'

document_analysis_client = DocumentAnalysisClient(
    endpoint=endpoint, credential=AzureKeyCredential(key)
)

# Make sure your document's type is included in the list of document types the custom model can analyze
# poller = document_analysis_client.begin_analyze_document_from_url(model_id, formUrl)
with open(pdf_file_path, 'rb') as pdf_file:
    poller = document_analysis_client.begin_classify_document(ccm_model_id, pdf_file)
result = poller.result()

for idx, document in enumerate(result.documents):
    pages = ""
    print("--------Analyzing document #{}--------".format(idx + 1))
    print("Document has type {}".format(document.doc_type))
    pages += document.doc_type + ' pages: '
    if (document.doc_type == 'reference_page'):
        pages = ''
        for page in document.bounding_regions:
            # print("......found field name: ", page.page_number)
            pages += document.doc_type + ' page: ' + str(page.page_number) + ',\n'
            # pages += str(page.page_number) + ', '        
        pages += "Please note: above 'reference_page' pages were encapsulated in a sub JSON structure within a single document returned by Custom Classification model,\n"
        pages += "the document was given the confidence value as following:"
    else:
        for page in document.bounding_regions:
            pages += str(page.page_number) + ', '
    print(pages)
    # if (document.doc_type == 'reference_page'):
    #     print("Custom Classiication Mode returns a JSON structure which encapsolates 'reference_page' in a single document and the document was given the confidence value as following:")
    print("Document has confidence {}".format(document.confidence))
    print("Document was analyzed by model with ID {}".format(result.model_id))
   
print("-----------------------------------")


--------Analyzing document #1--------
Document has type content_page
content_page pages: 1, 
Document has confidence 0.964
Document was analyzed by model with ID CCM110623_1
--------Analyzing document #2--------
Document has type content_page
content_page pages: 2, 
Document has confidence 0.982
Document was analyzed by model with ID CCM110623_1
--------Analyzing document #3--------
Document has type content_page
content_page pages: 3, 
Document has confidence 0.974
Document was analyzed by model with ID CCM110623_1
--------Analyzing document #4--------
Document has type content_page
content_page pages: 4, 
Document has confidence 0.978
Document was analyzed by model with ID CCM110623_1
--------Analyzing document #5--------
Document has type content_page
content_page pages: 5, 
Document has confidence 0.996
Document was analyzed by model with ID CCM110623_1
--------Analyzing document #6--------
Document has type content_page
content_page pages: 6, 
Document has confidence 0.994
Documen

## Task5: [Optional]: Provide code to create custom model 

* TODO: issue to access Azure contrainer, need to figure out if there are additional setups needed.

In [15]:
from azure.ai.formrecognizer import FormTrainingClient
from azure.core.credentials import AzureKeyCredential

# Replace with your Form Recognizer service endpoint and API key
# endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"
# credential = AzureKeyCredential("YOUR_API_KEY")
credential=AzureKeyCredential(key)

# Create a FormTrainingClient
form_training_client = FormTrainingClient(endpoint, credential)

# Define the blob container where your training documents are stored
# training_container = "your-training-container"
# TODO: need to figure out access container ...
training_container ="https://yongaitrainingdata.blob.core.windows.net/customextractionmodel/2310.17811.pdf?sp=r&st=2023-11-08T01:57:45Z&se=2023-11-08T09:57:45Z&sv=2022-11-02&sr=b&sig=v6HKu0mb9cBQnyjAODh6lPOZVKoZ6S6%2B6BlMujlpByU%3D"

# Define a name for your custom model
model_name = "custom-model-created-from-code_1"

# # Train the custom model
# poller = form_training_client.begin_training(training_container, model_name)

# # # Wait for training to complete
# model = poller.result()

# # # Get the model ID
# model_id = model.model_id

# print(f"Custom model ID: {model_id}")
