# Key value pair to entity Conversion

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

## Purpose and Description
This tool uses Form parser JSON files (Parsed from a processor) from the GCS bucket as input, converts the key/value pair to the entities and stores it to the GCS bucket as JSON files.

## Prerequisites

1. Vertex AI Notebook
2. Form parser Json files  in GCS Folders.
3. Output folder to upload the updated json files.

In [1]:
%pip install configparser
%pip install google-cloud
%pip install tqdm

[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.


## Step by Step procedure 

### 1. Config file Creation

In [2]:
import configparser
config = configparser.ConfigParser()
config_path= "config.ini" #Enter the path of config file
# Add the structure to the file we will create
config.add_section('Entities_synonyms')
config.set('Entities_synonyms', 'entity1', 'key_synonym1, key_synonym2, key_synonym3')
config.set('Entities_synonyms', 'entity2', 'key_synonym1, key_synonym2, key_synonym3')
# Write the new structure to the new file
with open(config_path, 'w') as configfile:
    config.write(configfile)

### 2. Input Details

#### a. Once config.ini file is created with the above step , enter the input in the config file : 

entity1 =  key_synonym1, key_synonym2, key_synonym3

<img src="./Images/key_value_entity_input_1.png" width=800 height=400 alt="Key value pair entitiy conversion input image">
Here add the entity name in place of entity1 and  add the synonyms related to the entity in place of key_synonym separated by comma(,). Add multiple entities with their synonyms in the next line.

<div style="background-color:#ADD8E6; border:1px solid black; padding:5px">
<i><b>Example</b></i> : <br> 
Address = AddressName, AddressName1, AddressLine<br>
InvoiceNumber = Invoice,InvoiceNo<br>
PaymentDate = SNC, SNCs, SNC1<br>
</div>

In [5]:
#!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

#### b. Copy the code provided in this document, Enter the path of Config file
<img src="./Images/key_value_entity_input_2.png" width=800 height=400 alt="Key value pair entitiy conversion input image">

#### c. Update the Parser input path and the GCP output for the output Jsons.
<img src="./Images/key_value_entity_input_3.png" width=800 height=400 alt="Key value pair entitiy conversion input image">

### 3. Output

We get the converted Json in the GCS path which is provided in the script with the variable name output_path . 
<img src="./Images/key_value_entity_output_1.png" width=800 height=400 alt="Key value pair entitiy conversion output image">

<table style="border: 1px solid black;padding:0px; margin:0px">
    <tr style="border: 1px solid black;padding:0px; margin:0px">
    <td style="text-align:center;border: 1px solid black;padding:0px; margin:0px"><h2>Before</h2></td>
    <td  style="text-align:center;border: 1px solid black;padding:0px; margin:0px"><h2>After</h2></td>
    </tr>
    <tr>
    <td style="border: 1px solid black;padding:0px; margin:0px"><img src="./Images/key_value_pair_output_comparison_1.png" width=600 height=800 alt="Key value pair entitiy comparison output image"></td>
    <td style="border: 1px solid black;padding:0px; margin:0px"><img src="./Images/key_value_pair_output_comparison_2.png" width=600 height=800 alt="Key value pair entitiy comparison output image"></td>
    </tr>
</table>

### 4. Sample Code

In [None]:
#importing necessary modules

from pathlib import Path
from google.cloud import storage
from io import BytesIO
from google.cloud import documentai_v1beta3 as documentai
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
from utilities import documentai_json_proto_downloader,store_document_as_json, file_names
import re
from tqdm.notebook import tqdm


INPUT_PATH = "gs://xxxx/xxxxxxxx/xxxxxxxxxx" # path of the form parser output
OUTPUT_PATH = "gs://xxxxxxxxx/xxxxxxxxx/xxxxx" # output path for this script

config = configparser.ConfigParser()
config.optionxform = str
config.read(config_path)



def entity_synonyms(old_entity : str) -> str : 
    """
    To check for any synonyms for the entites and replace it with the entity name provided 
    by the user. 

    Args:
        old_entity : The key name from the input document .

    Returns:
        str  : Returns the matched entity name provided by the user.
    """
    entities_synonyms = config.items('Entities_synonyms')
    for item in entities_synonyms:
        synonym_list = [i.lower().strip() for i in item[1].split(",")]
        
        if old_entity.lower() in synonym_list:    
            return item[0]
            
    #if entity does not match with any synonyms, will return entity as it is.
    return ""


def entity_data(formField_data : documentai.Document.Page.FormField,page_number : int) -> documentai.Document.Entity :
    """
    Function to create entity objects with some cleaning.

    Args:
        formField_data:documentai.Document.Page.FormField : The form fields having the key value information
        inside the entity.
        page_number : int : The page number from the input documents

    Returns:
        documentai.Document.Entity: The entity which are converted from key and value.
    """
    #Cleaning the entity name 
    key_name = re.sub('[^\w\s]',"",formField_data.field_name.text_anchor.content).replace(" ","").strip()
    #checking for entity synonyms
    key_name = entity_synonyms(key_name)
    
    if key_name:
        entity_dict = {
        "confidence" :  formField_data.field_value.confidence,
        "mention_text" : formField_data.field_value.text_anchor.content,
        "page_anchor" :{ "page_refs":[{"bounding_poly":formField_data.field_value.bounding_poly, 
                                       "page":page_number}]},
        "text_anchor" : formField_data.field_value.text_anchor,
        "type" : key_name
        }

        return entity_dict
    else:
        return None
          

def convert_kv_entities(document : documentai.Document)-> documentai.Document:
    """
    Function to convert form parser key value to entities. 

    Args:
        document:documentai.Document : The original document object from gcp storage.

    Returns:
        documentai.Document: The converted document object .
    """

    #initializing entities list
    document.entities = []
    
    for page_number, page_data in enumerate(document.pages):
        for formField_number,formField_data in enumerate(page_data.form_fields):
            
            #get the element and push it to the entities array 
            entity_obj = entity_data(formField_data, page_number)
            if entity_obj:
                document.entities.append(entity_obj)
            
    #removing the form parser data
    for i in range(len(document.pages)) :
        if document.pages[i].form_fields:
            del document.pages[i].form_fields

        if document.pages[i].tables:
            del document.pages[i].tables
            
    return document
     
def main()->None:
    """
    Main function to call helper functions
    
    """
    # fetching all the files
    input_bucket_name = INPUT_PATH.split('/')[2]
    input_prefix_path = "/".join(INPUT_PATH.split('/')[3:])
    output_bucket_name = OUTPUT_PATH.split('/')[2]
    output_prefix_path = "/".join(OUTPUT_PATH.split('/')[3:])
    file_name_list = [i for i in  list(file_names(INPUT_PATH)[1].values()) if i.endswith(".json")]
    
    for file_name in tqdm(file_name_list,desc ="Status : "):
        try:
            # converting key value to entites
            document = documentai_json_proto_downloader(input_bucket_name,file_name)
            converted_document = convert_kv_entities(document)

            #storing the document
            output_file_name = f"{output_prefix_path}/{file_name.split('/')[-1]}"
            store_document_as_json(documentai.Document.to_json(converted_document),
                                   output_bucket_name,output_file_name)
            print(f"[✓] {output_bucket_name}/{output_file_name}")
        except Exception as e:
            print(f"[x] {input_bucket_name}/{file_name} || Error : {str(e)}")
    print('\nOperation completed')
#calling main function    
if __name__ == "__main__":
    main()
