# Extract data from a Sample Form which has been filled up with hand written text

## Prerequisites
1. To run the code, install the following packages. Please use the latest pre-release version `pip install azure-ai-formrecognizer==3.3.0`.


- > ! pip install azure-ai-formrecognizer==3.3.0

## Load all the API keys, parameters and login credentials

In [3]:
import fr

# Your Azure Document Intelligence Service Instance
MY_FORM_RECOGNIZER_ENDPOINT = 'https://tr-docai-form-recognizer.cognitiveservices.azure.com/'
# The model id should match the custom model you have
# trained and deployed in your Azure Document Intelligence Service Instance
# with the endpoint MY_FORM_RECOGNIZER_ENDPOINT
MY_CLAIMS_MODEL_ID = 'claims-v3'

formRecognizerCredential = fr.getFormRecognizerCredential()

formRecognizerClient = fr.getDocumentAnalysisClient(
                            endpoint=MY_FORM_RECOGNIZER_ENDPOINT,
                            credential=formRecognizerCredential
                        )


Got Azure Form Recognizer API Key from environment variable


## Document Extraction Examples

### Auto Insurance Claims form by hand

- Custom Trained model
- Display label, data and confidence (document level and indivudual field level)
- Text, Checkbox, radio button

#### Display labeled data

In [6]:
# Assuming you are running notebook from the notebook folder
#MY_TEST_DOCUMENT = r'..\..\..\data\sample-claims-docs\testing\IC-handwritten-RobertFrost.pdf'
MY_TEST_DOCUMENT = r'..\..\..\data\sample-claims-docs\testing\IC-handwritten-WilliamWordsworth.pdf'

fr_api_version, model_id, is_handwritten, result = fr.extractResultFromLocalDocument(
                                                        client=formRecognizerClient,
                                                        model=MY_CLAIMS_MODEL_ID,
                                                        filepath=MY_TEST_DOCUMENT
                                                    )

print(f'Document Intelligence API version = {fr_api_version}\n \
        Document Extraction Model Id = {model_id}\n \
        Does document have any hand written text? {is_handwritten}\n'
     )
doc_count = len(result.documents)
print(f'Document count = {doc_count}')

for idx, document in enumerate(result.documents):
    print(f'Document {idx} ---------------')
    print(f'\tDocument extraction confidence = {document.confidence}')
    for name, field in document.fields.items():
        field_value = field.value if field.value else field.content
        print("\t{}[type:{};conf:{}] = '{}'".format(name, field.value_type, field.confidence, field_value))


Document Intelligence API version = 2023-07-31
         Document Extraction Model Id = claims-v3
         Does document have any hand written text? True

Document count = 1
Document 0 ---------------
	Document extraction confidence = 0.987
	FormType[type:string;conf:0.92] = 'Auto Insurance Claim Document'
	Name[type:string;conf:0.969] = 'William Wordsworth'
	Address[type:string;conf:0.914] = '39 Washington Street, New York City, NY 10003'
	Phone[type:string;conf:0.957] = '+1 123 465 1637'
	Email[type:string;conf:0.98] = 'dummy3@3.com'
	PolicyNumber[type:string;conf:0.965] = 'TRI 813654329'
	IncidentDate[type:string;conf:0.964] = '5/31/2023'
	IncidentTime[type:string;conf:0.939] = '11 pm BST'
	IncidentLocation[type:string;conf:0.92] = '2 Daffodil Street, New York City, NY 1002'
	IncidentDescription[type:string;conf:0.938] = 'Another Car changed lane and hit my car on the driver side.'
	VehicleOwner[type:string;conf:0.981] = 'NA'
	VehicleMakeAndModel[type:string;conf:0.964] = '2011 Kia S

## Post Processing after extraction, to fix errors

<b><u>Example</u></b>  
The <b>IncidentTime</b> extracted from the document shows - '11 pm <font color=red>BST</font>'  
The <b>IncidentLocation</b> extracted from the document shows - '2 Daffodil Street, New York City, NY 1002'  
The error here is in the time zone. It should be <b>EDT</b> instead of <b>BST</b>  
Let's fix it with GPT-4 using AOAI  


#### Load the AOAI keys and parameters

In [8]:
import aoai

MY_AOAI_ENDPOINT = 'https://tr-non-prod-gpt4.openai.azure.com/'
MY_AOAI_VERSION = '2023-07-01-preview'
MY_GPT_ENGINE = 'tr-gpt4'
MY_AOAI_EMBEDDING_ENGINE = 'tr-embedding-ada'

status = aoai.setupOpenai(aoai_endpoint=MY_AOAI_ENDPOINT, 
                 aoai_version=MY_AOAI_VERSION)
if status > 0:
    print("AOAI setup succeeded")
else:
    print("AOAI setup failed")


Got OPENAI API Key from environment variable
AOAI setup succeeded


#### Ask GPT-4 to fix the error

In [15]:
my_location = '2 Daffodil Street, New York City, NY 1002'
my_time = '11 pm BST'
my_date = '5/31/2023'

my_task = f'Replace the timezone only in {my_time} with the timezone of the location in {my_location}, \
            given the date {my_date}'
my_prompt = [
              {
                "role": "user", 
                "content": my_task
                }
            ]      
tokens_used, finish_reason, aoai_answer = aoai.getChatCompletion(the_engine=MY_GPT_ENGINE, 
                                                                           the_messages=my_prompt)
print(f"Tokens: {tokens_used}")
print(f"Finish Reason: {finish_reason}")
print(f"Answer: {aoai_answer}")

Tokens: 54
Finish Reason: stop
Answer: 11 pm EDT


## If you want to read all the raw OCR data from the extraction

#### View the extracted raw data pages, tables...

In [7]:
for page in result.pages:
    for line_idx, line in enumerate(page.lines):
        print(
         "...Line # {} has text content '{}'".format(
        line_idx,
        line.content.encode("utf-8")
        )
    )

    for selection_mark in page.selection_marks:
        print(
         "...Selection mark is '{}' and has a confidence of {}".format(
         selection_mark.state,
         selection_mark.confidence
         )
    )

for table_idx, table in enumerate(result.tables):
    print(
        "Table # {} has {} rows and {} columns".format(
        table_idx, table.row_count, table.column_count
        )
    )
        
    for cell in table.cells:
        print(
            "...Cell[{}][{}] has content '{}'".format(
            cell.row_index,
            cell.column_index,
            cell.content.encode("utf-8"),
            )
        )

print("----------------------------------------")

...Line # 0 has text content 'b'TR INSURED''
...Line # 1 has text content 'b'A Test P&C INSURANCE Company''
...Line # 2 has text content 'b'Auto Insurance Claim Document''
...Line # 3 has text content 'b'Customer Information''
...Line # 4 has text content 'b'Name William Wordsworth''
...Line # 5 has text content 'b'Address 39 Washington Street, New York City, NY 10003''
...Line # 6 has text content 'b'Phone Number +1 123 465 1637''
...Line # 7 has text content 'b'Email dummy3@3.com''
...Line # 8 has text content 'b'Policy Number TRI 813654329''
...Line # 9 has text content 'b'Incident Information''
...Line # 10 has text content 'b'Date of Incident 5/31/2023''
...Line # 11 has text content 'b'Time of Incident 11 pm BST''
...Line # 12 has text content 'b'Location of Incident 2 Daffodil Street, New York City, NY 1002''
...Line # 13 has text content 'b'Description of Incident Another Car changed lane and hit''
...Line # 14 has text content 'b'my car on the driver side.''
...Line # 15 has t