# Reverse Annotation Tool

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

## Objective

Reverse Annotation Tool helps in annotating or labeling the entities in the document based on the ocr text tokens. The notebook script expects the input file containing the name of entities in tabular format. And the first row is the header representing the entities that need to be labeled in every document. The script calls the processor and parses each of these input documents. The parsed document is then annotated if input entities are present in the document based on the OCR text tokens. The result is an output json file with updated entities and exported into a storage bucket path. This result json files can be imported into a processor to further check the annotations are existing as per the input file which was provided to the script prior the execution.

## Prerequisite

* Vertex AI Notebook
* Input csv file containing list of files to be labeled.
* Document AI Processor
* GCS bucket for processing of  the input documents and writing the output.


## Step by Step procedure

### 1. Importing required modules

In [None]:
!pip install configparser
!pip install fuzzywuzzy

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
import configparser
import numpy as np
import pandas as pd
import csv,json
import utilities

from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage
from tqdm import tqdm
from fuzzywuzzy import fuzz,process
from typing import List, Dict, Tuple, Union, Any

### 2. Input details

* **Config file Creation** : Below code snippet creates configuration file for the script.


In [None]:
### Config file creation
config = configparser.ConfigParser()
# Add the structure to the file
config.add_section('Parameters')
config.set('Parameters', 'project_id', '')
config.set('Parameters', 'processor_id', "")
config.set('Parameters', 'processor_version', '')
config.set('Parameters', 'input_bucket', 'gs://')
config.set('Parameters', 'output_bucket', 'gs://')
config.set('Parameters', 'location', 'us')
# Write the new structure to the new file
with open(r"configfile.ini", 'w') as configfile:
    config.write(configfile)

In [None]:
### Config file reading
Path= "configfile.ini" #Enter the path of config file
config = configparser.ConfigParser()
config.read(Path)

project_id = config.get('Parameters','project_id')
processor_id = config.get('Parameters','processor_id')
processor_version= config.get('Parameters','processor_version')
input_bucket = config.get('Parameters','input_bucket')
output_bucket = config.get('Parameters','output_bucket')
location = config.get('Parameters','location')

Upon executing the above script, ‘configfile.ini’ file is created in the same directory and expects the user to input details as shown below.
Enter the appropriate values into the config.ini files before executing the script.

<img src="./Images/config_file.png" width=800 height=400></img>

* **GCS Bucket** : Copy the list of input document files into the bucket path.
* **InputData.csv file** : This is a schema file containing a tabular data with header row as name of the entities that needs to be identified and annotated in the document and the following rows are for each file whose values needs to be extracted. The below image shows the structure of the input file.

<img src="./Images/version_4_input.png" width=800 height=400></img>

**Note** : 
1. Ensure the name of this file is ‘inputData.csv’.
2. Specify the correct filenames and its extensions(.pdf) in the inputData.csv file.
3. The values must match the text present in the document. There can be a different format for date, but the date should be the same. 
4. Second row is for the ‘Type’ (data type) of the entity which is optional.
5. If multiple values are present, then the tool tags each occurrence. 


### 3.Run the Code

Create the **config.ini** and **inputData.csv** files in the same directory where the script resides. Ensure the project details are updated for values and confirm the values match the text present in document and inputData.csv file for every document. Copy the documents that need to be processed in the GCS Bucket.
The complete script can be found in the last section of the document. 

Included functionality for labeling multiple line items. Kindly generate a CSV file, as depicted below, for the annotation ground truth file, and ensure that it is located in the same folder with the name **"input_data.csv."**


In [None]:
def read_input_schema(readSchemaFileName : str) -> pd.DataFrame:
    """
    Reads an input schema from a CSV file.

    Args:
    - read_schema_file_name (str): Path to the CSV file containing the schema.

    Returns:
    - pd.DataFrame: DataFrame containing the schema data.
    """
    
    df_schema = pd.read_csv(readSchemaFileName, dtype=str)
    df_schema = df_schema.drop(df_schema[df_schema['FileNames'] == 'Type'].index)
    df_schema.replace('', np.nan, inplace=True)
    return df_schema

readSchemaFileName = 'inputData.csv'
df_schema = read_input_schema(readSchemaFileName)

# Group by 'FileNames' and process each group
grouped = df_schema.groupby('FileNames', as_index=False)
processed_rows = []

for name, group in grouped:
    group = group.ffill().bfill()  # Forward and backward fill to handle NaNs
    combined_row = []

    # Flatten the group into a single list
    for row in group.itertuples(index=False):
        combined_row.extend(row)

    processed_rows.append(combined_row)    
# # Column headers based on your original CSV structure
headers = [header.strip() for header in pd.read_csv(readSchemaFileName, nrows=0).columns.tolist()]
prefix = "line_item/"

# Extract the part after 'line_item/' for each item that starts with the prefix
unique_entities = [item.split('/')[-1] for item in headers if item.startswith(prefix)]

processed_files = set()  # Set to keep track of processed FileNames

Desc_merge_update='Yes'# update to Yes if you want to combine description within the line item, else NO#
line_item_across_pages='Yes'# update to Yes if you want to group line items across pages#


def get_token_range(jsonData : object) -> Dict:
    """
    Gets the token ranges from the provided JSON data.

    Args:
    - json_data (object): JSON data containing page and token information.

    Returns:
    - dict: Dictionary containing token ranges with page number and token number information.
    """
    
    tokenRange={}
    for i in range(0, len(jsonData.pages)):
        for j in range(0,len(jsonData.pages[i].tokens)):
            pageNumber = i
            tokenNumber = j
            try:
                startIndex = int(jsonData.pages[i].tokens[j].layout.text_anchor.text_segments[0].start_index)
            except:
                startIndex = 0
            endIndex = int(jsonData.pages[i].tokens[j].layout.text_anchor.text_segments[0].end_index)
            tokenRange[range(startIndex, endIndex)] = {'page_number': pageNumber, 'token_number': tokenNumber}
    return tokenRange    

def fix_page_anchor_entity(i : object, jsonData : object, tokenRange : Dict) -> object:
    """
    Fixes the page anchor entity based on the provided JSON data and token range.

    Args:
    - i (object): Entity object to be fixed.
    - jsonData (object): JSON data containing page and token information.
    - tokenRange (Dict): Dictionary containing token ranges with page number and token number information.

    Returns:
    - object: Fixed entity object.
    """
    
    start = int(i.text_anchor.text_segments[0].start_index)
    end = int(i.text_anchor.text_segments[0].end_index) - 1

    for j in tokenRange:
        if start in j:
            lowerToken = tokenRange[j]
    for j in tokenRange:
        if end in j:
            upperToken = tokenRange[j]

    lowerTokenData = jsonData.pages[int(lowerToken['page_number'])].tokens[int(lowerToken['token_number'])].layout.bounding_poly.normalized_vertices
    upperTokenData = jsonData.pages[int(upperToken['page_number'])].tokens[int(upperToken['token_number'])].layout.bounding_poly.normalized_vertices
    # for A
    xA = float(lowerTokenData[0].x)
    yA = float(lowerTokenData[0].y)
    xA_ = float(upperTokenData[0].x)
    yA_ = float(upperTokenData[0].y)
    # for B
    xB = float(lowerTokenData[1].x)
    yB = float(lowerTokenData[1].y)
    xB_ = float(upperTokenData[1].x)
    yB_ = float(upperTokenData[1].y)
    # for C
    xC = float(lowerTokenData[2].x)
    yC = float(lowerTokenData[2].y)
    xC_ = float(upperTokenData[2].x)
    yC_ = float(upperTokenData[2].y)
    # for D
    xD = float(lowerTokenData[3].x)
    yD = float(lowerTokenData[3].y)
    xD_ = float(upperTokenData[3].x)
    yD_ = float(upperTokenData[3].y)

    A = {'x': min(xA, xA_),'y': min(yA, yA_)}
    B = {'x': max(xB, xB_),'y': min(yB, yB_)}
    C = {'x': max(xC, xC_),'y': max(yC, yC_)}
    D = {'x': min(xD, xD_),'y': max(yD, yD_)}
    i.page_anchor.page_refs[0].bounding_poly.normalized_vertices = [A, B, C, D]
    i.page_anchor.page_refs[0].page=lowerToken['page_number']
    return i

def create_entity(mention_text : str, type_ : str, m : Any) -> object:
    """
    Creates a Document Entity based on the provided mention text, type, and match object.

    Args:
    - mention_text (str): The text to be mentioned in the entity.
    - type_ (str): The type of the entity.
    - m (Union[re.Match, None]): Match object representing the start and end indices of the mention text.

    Returns:
    - documentai.Document.Entity: The created Document Entity.
    """
    
    entity = documentai.Document.Entity()
    entity.mention_text = mention_text
    entity.type = type_
    normalizedVertices = []
    pageRefs = []
    pageRefs.append({"bounding_poly":{"normalized_vertices":normalizedVertices}})
    entity.page_anchor = {"page_refs":pageRefs}
    entity.text_anchor = {"text_segments":[{'start_index':str(m.start()),'end_index':str(m.end())}]}
    return entity

def modify_as_parent_entity(entity : object, parent_name : str) -> object:
    """
    Modifies the provided entity as a parent entity with the given parent name.

    Args:
    - entity (documentai.Document.Entity): The entity to be modified.
    - parent_name (str): The name of the parent entity.

    Returns:
    - documentai.Document.Entity: The modified parent entity.
    """
    
    parent_entity = documentai.Document.Entity()
    parent_entity.mention_text = entity.mention_text
    parent_entity.type_ = parent_name
    parent_entity.properties = [entity]
    parent_entity.page_anchor = entity.page_anchor
    parent_entity.text_anchor = entity.text_anchor
    return parent_entity

def to_camel_case(snake_str : str) -> str:
    """
    Convert a snake_case string to CamelCase.

    Args:
    - snake_str (str): The snake_case string to be converted.

    Returns:
    - str: The CamelCase representation of the input string.
    """
    
    components = snake_str.split('_')
    return components[0] + ''.join(x.title() for x in components[1:])

def convert_keys_to_camel_case(obj : Any) -> Any:
    """
    Recursively convert keys of a dictionary or a list of dictionaries to CamelCase.

    Args:
    - obj (Union[dict, list]): The input dictionary or list of dictionaries.

    Returns:
    - Union[dict, list]: The converted object with CamelCase keys.
    """
    
    if isinstance(obj, dict):
        new_obj = {}
        for key, value in obj.items():
            new_key = to_camel_case(key)
            new_obj[new_key] = convert_keys_to_camel_case(value)
        return new_obj
    elif isinstance(obj, list):
        return [convert_keys_to_camel_case(item) for item in obj]
    else:
        return obj

def line_item_check(entities,unique_entities):
    """
    Checks line items within a list of entities based on unique entity types.

    Args:
        entities: A list of entities to be checked.
        unique_entities: A list of unique entity types to be considered for checking.

    Returns:
        An integer representing the type of line item found.
        1 for single line item, 2 for multiple, and 0 for none.
    """
    
    entity_types = [
        subentity.type
        for entity in entities
        if hasattr(entity, "properties")
        for subentity in entity.properties
    ]

    entity_counts = {unique: entity_types.count(unique) for unique in unique_entities}
    multiple_entities_count = sum(count > 1 for count in entity_counts.values())

    if any(count == 1 for count in entity_counts.values()):
        return 1
    elif multiple_entities_count and (
        len(unique_entities) >= 3 or multiple_entities_count >= 1
    ):
        return 2
    else:
        return 0

def get_normalized_vertices(normalized_vertices : List[Dict]) -> List[Dict]:
    """
    Get normalized vertices and return the final vertices.

    Args:
    - normalized_vertices (List[Dict]): List of dictionaries containing x and y coordinates.

    Returns:
    - List[Dict]: Final vertices as a list of dictionaries.
    """
    
    min_x = min(normalized_vertices, key=lambda d: d.x).x
    max_x = max(normalized_vertices, key=lambda d: d.x).x
    min_y = min(normalized_vertices, key=lambda d: d.y).y
    max_y = max(normalized_vertices, key=lambda d: d.y).y
    vertices_final = [
        {"x": min_x, "y": min_y},
        {"x": min_x, "y": max_y},
        {"x": max_x, "y": min_y},
        {"x": max_x, "y": max_y},
    ]
    
    return vertices_final
    
def single_line_item_merge(entities,page):
    """
    Merges single line item entities into a unified line item.

    Args:
        entities: A list of entities to be merged.
        page: The page number as a string where these entities are found.

    Returns:
        A dictionary representing the merged line item.
    """
    
    line_item_sub_entities = [
        subentity
        for entity in entities
        if entity.type == "line_item"
        for subentity in entity.properties
    ]

    text_anchors = [
        item.text_anchor.text_segments[0] for item in line_item_sub_entities
    ]
    normalized_vertices = [
        vertex
        for item in line_item_sub_entities
        for vertex in item.page_anchor.page_refs[0].bounding_poly.normalized_vertices
    ]
    
    vertices_final = get_normalized_vertices(normalized_vertices)

    line_item = {
        "mention_text": " ".join(item.mention_text for item in line_item_sub_entities),
        "page_anchor": {
            "page_refs": [
                {"bounding_poly": {"normalized_vertices": vertices_final}, "page": page}
            ]
        },
        "properties": line_item_sub_entities,
        "text_anchor": {
            "text_segments": sorted(text_anchors, key=lambda x: int(x.end_index))
        },
        "type": "line_item",
    }

    return line_item

def get_lineitems_grouped_pages_across(json_dict : object, schema : Dict = {}) -> object:
    def get_line_items_temp_schema(json_dict : object) -> Dict:   
        """
        Analyzes the structure of line items in a JSON document and creates a temporary schema
        based on the observed types and their frequencies.

        Args:
            json_dict (object): The JSON document.

        Returns:
            Dict[str, Dict[str, int]]: A dictionary representing the temporary schema with
            keys as line item identifiers and values as dictionaries of child types and their frequencies.
        """
        
        line_items = [
            entity 
            for entity in json_dict.entities 
            if entity.properties and entity.type_ == 'line_item'
        ]
        
        x=1
        line_types = {}
        for i in line_items:
            line_types['line' + '_' + str(x)]={}
            for child in i.properties:
                if child.type_ in line_types['line'+'_'+str(x)].keys():
                    line_types['line' + '_' + str(x)][child.type_]=line_types['line' + '_' + str(x)][child.type_]+1
                else:
                    line_types['line' + '_' + str(x)][child.type]=1
            x+=1          
        from collections import Counter
        all_child=[]
        for key, val in line_types.items():
            all_child.append(val)
        counts = {k: Counter([d[k] for d in all_child if k in d]) for k in set().union(*all_child)}
        temp_schema = {k: max(counts[k], key=counts[k].get) for k in counts}
        return temp_schema

    if schema=={}:
        schema=get_line_items_temp_schema(json_dict)
    line_items_1=[]
    for entity in json_dict.entities:
        if entity.properties:
            line_items_1.append(entity)
    line_item_sorted_first={}
    line_item_sorted_last={}
    line_item_across_pages_first={}
    line_item_across_pages_last={}
    for li_1 in line_items_1:
        try:
            page=li_1.page_anchor.page_refs[0].page
        except:
            page=str(0)
        max_y=max(vertex.y for vertex in li_1.page_anchor.page_refs[0].bounding_poly.normalized_vertices)
        min_y=min(vertex.y for vertex in li_1.page_anchor.page_refs[0].bounding_poly.normalized_vertices)
        if page in line_item_sorted_last.keys():
            if line_item_sorted_last[page]>=max_y:
                pass
            else:
                line_item_sorted_last[page]=max_y
                line_item_across_pages_last[page]=li_1
        else:
            line_item_sorted_last[page]=max_y
            line_item_across_pages_last[page]=li_1
        if page in line_item_sorted_first.keys():
            if line_item_sorted_first[page]<=min_y:
                pass
            else:
                line_item_sorted_first[page]=min_y
                line_item_across_pages_first[page]=li_1
        else:
            line_item_sorted_first[page]=min_y
            line_item_across_pages_first[page]=li_1

    groups_across={};p=0
    for page_last,ent_last in line_item_across_pages_last.items():
        try:
            groups_across[p]=[line_item_across_pages_last[page_last],line_item_across_pages_first[str(int(page_last)+1)]]
            p+=1
        except KeyError:
            pass   
    #getting schema of each line item in groups
    schema_across={}
    for group,match in groups_across.items():
        for i in range(len(match)):
            for subitem in match[i]['properties']:
                if group in schema_across.keys():
                    if i in schema_across[group].keys():
                        if subitem['type'] in schema_across[group][i].keys():
                            schema_across[group][i][subitem['type']]+=1
                        else:
                            schema_across[group][i][subitem['type']]=1
                    else:
                        schema_across[group][i]={subitem['type']:1}
                else:
                    schema_across[group]={i:{subitem['type']:1}}
    group_entites_spread={}
    for selected_group,schema_ent in schema_across.items():
        missing_ent_0={}
        for key in schema.keys():
            if key not in schema_ent[0]:
                missing_ent_0[key]=schema[key]
            else:
                missing_ent_0[key]=schema[key]-schema_ent[0][key]
        for k1,v1 in missing_ent_0.items():
            if v1>0:
                if k1 in schema_ent[1]:
                    if schema_ent[1][k1]>schema[k1] or len(schema_ent[1])<(len(schema)/2):
                        if selected_group in group_entites_spread.keys():
                            group_entites_spread[selected_group][k1]=schema_ent[1][k1]-schema[k1]
                        else:
                            if len(schema_ent[1])<(len(schema)/2):
                                group_entites_spread[selected_group]={k1:schema_ent[1][k1]}
                            else:
                                group_entites_spread[selected_group]={k1:schema_ent[1][k1]-schema[k1]}
    def get_ent_schffle(group : str, index_1 : int) -> List[Dict]:
        """
        Shuffles entities within a specified group based on their minimum Y coordinates.

        Args:
            group (str): The name of the entity group.
            index_1 (int): The index for accessing entities within the group.

        Returns:
            List[Dict]: A list of shuffled entities within the specified group.
        """
        
        ent_min_y={}
        ent_sort_ent={}
        for sube1 in groups_across[group][index_1]['properties']:
            if sube1['type'] in group_entites_spread[group].keys():
                min_y_temp=min(vertex['y'] for vertex in sube1['page_anchor']['page_refs'][0]['bounding_poly']['normalized_vertices'])
                if sube1['type'] in ent_min_y.keys():
                    ent_min_y[sube1['type']].append(min_y_temp)
                    if sube1['type'] in ent_sort_ent.keys():
                        ent_sort_ent[sube1['type']][min_y_temp]=sube1
                    else:
                        ent_sort_ent[sube1['type']]={min_y_temp:sube1}
                else:
                    ent_min_y[sube1['type']]=[min_y_temp]
                    if sube1['type'] in ent_sort_ent.keys():
                        ent_sort_ent[sube1['type']][min_y_temp]=sube1
                    else:
                        ent_sort_ent[sube1['type']]={min_y_temp:sube1}
        sorted_ent_min_y={key: sorted(values) for key, values in ent_min_y.items()}
        ent_shuffle=[]
        for en1,val1 in sorted_ent_min_y.items():
            b=0
            for num in range(group_entites_spread[group][en1]):
                for miny in range(len(sorted_ent_min_y[en1])):
                    if num>=b:
                        ent_shuffle.append(ent_sort_ent[en1][sorted_ent_min_y[en1][miny]])
                        b+=1
        return ent_shuffle

    def ent_move(group : str, index_1 : int, index_0 : str) -> List[Dict]:
        """
        Move entities within a specified group from one index to another.

        Args:
            group (str): The name of the entity group.
            index_1 (int): The source index from which entities are moved.
            index_0 (str): The destination index to which entities are moved.

        Returns:
            List[Dict]: A modified list of entities within the specified group after the move.
        """
        
        import copy
        temp_group=copy.deepcopy(groups_across[group])
        for ent_sh in ent_shuffle:
            for sub_en3 in temp_group[index_1]['properties']:
                if ent_sh['type']==sub_en3['type'] and ent_sh['text_anchor']==sub_en3['text_anchor']:
                    temp_group[index_1]['properties'].remove(ent_sh)
            temp_group[index_0]['properties'].append(ent_sh)
            for t1 in ent_sh['page_anchor']['page_refs']:
                temp_group[index_0]['page_anchor']['page_refs'].append(t1)
            temp_group[index_0]['mention_text']=temp_group[index_0]['mention_text']+' '+ent_sh['mention_text']
            for t2 in ent_sh['text_anchor']['text_segments']:
                temp_group[index_0]['text_anchor']['text_segments'].append(t2)
        return temp_group

    def correct_page_text(temp_group_1 : List[Dict], index_1 : int) -> List[Dict]:
        """
        Correct the page text for a group of entities at a specified index.

        Args:
            temp_group_1 (List[Dict]): A list of entities within a group.
            index_1 (int): The index of the entities to be corrected.

        Returns:
            List[Dict]: A modified list of entities after correcting the page text.
        """
        
        temp_x=[]
        temp_y=[]
        temp_text_anc=[]
        temp_mention_text=''
        for suben in temp_group_1[index_1]['properties']:
            for tex_an1 in suben['text_anchor']['text_segments']:
                temp_text_anc.append(tex_an1)
            for page_an1 in suben['page_anchor']['page_refs'][0]['bounding_poly']['normalized_vertices']:
                temp_x.append(page_an1['x'])
                temp_y.append(page_an1['y'])
        updated_ver=[{'x':min(temp_x),'y':min(temp_y)},{'x':max(temp_x),'y':max(temp_y)},{'x':min(temp_x),'y':max(temp_y)},{'x':max(temp_x),'y':min(temp_y)}]
        sorted_temp_text_anc=sorted(temp_text_anc, key=lambda x: int(x['end_index']))
        temp_group_1[index_1]['text_anchor']['text_segments']=sorted_temp_text_anc
        temp_group_1[index_1]['page_anchor']['page_refs'][0]['bounding_poly']['normalized_vertices']=updated_ver
        for t5 in sorted_temp_text_anc:
            s1=t5['start_index']
            e1=t5['end_index']
            temp_mention_text=temp_mention_text+' '+json_dict['text'][int(s1):int(e1)]
        temp_group_1[index_1]['mention_text']=temp_mention_text
        return temp_group_1

    if len(group_entites_spread)>0:
        for group, entity_move in group_entites_spread.items():
            try:
                page_1=groups_across[group][0]['page_anchor']['page_refs'][0]['page']
            except:
                page_1='0'
            try:
                page_2=groups_across[group][1]['page_anchor']['page_refs'][0]['page']
            except:
                page_2='0'
            if page_1<page_2:
                ent_shuffle=get_ent_schffle(group,1)
                temp_group_1=ent_move(group,1,0)
                if len(temp_group_1[1]['properties'])>0:
                    temp_group_updated=correct_page_text(temp_group_1,1)
                else:
                    temp_group_updated=[temp_group_1[0]]
            elif page_1>page_2:
                ent_shuffle=get_ent_schffle(group,0)
                temp_group_1=ent_move(group,0,1)
                if len(temp_group_1[0]['properties'])>0:
                    temp_group_updated=correct_page_text(temp_group_1,0)
                else:
                    temp_group_updated=[temp_group_1[1]]
            for ent_remove in groups_across[group]:
                json_dict.entities.remove(ent_remove)
            for ent_add in temp_group_updated:
                json_dict.entities.append(ent_add)
    return json_dict

def get_page_wise_entities(json_dict : object) -> Dict:
    """
    Get entities grouped by page from the provided JSON dictionary.

    Args:
        json_dict (object): The input JSON dictionary containing entities.

    Returns:
        Dict[str, List[object]]: A dictionary where keys are page numbers and values are lists of entities on each page.
    """
    
    entities_page={}
    for entity in json_dict.entities:
        page='0'
        try:
            if entity.page_anchor.page_refs[0].page:
                page=entity.page_anchor.page_refs[0].page
            if page in entities_page.keys():
                entities_page[page].append(entity)
            else:
                entities_page[page]=[entity]
        except:
            pass
    return entities_page

def multi_page_entites(entities_pagewise : object, page : int) -> Tuple[List,str]:
    """
    Process multi-page entities.

    Args:
        entities_pagewise (object): Entities on a specific page.
        page (int): Page number.

    Returns:
        Tuple[List, str]: List of line items, considered boundary entity.
    """
    
    entity_types=[]
    line_item_sub_entities=[]
    for entity in entities_pagewise:
        if entity.properties:
            if entity.type_ == 'line_item':
                for subentity in entity.properties:
                    entity_types.append(subentity.type_)
                    line_item_sub_entities.append(subentity)
        else:
            entity_types.append(entity.type_)
    line_items_multi_dict={}
    for unique in unique_entities:
        if entity_types.count(unique)>1:
            line_items_multi_dict[unique]=entity_types.count(unique)
    from collections import Counter
    value_counts = Counter(line_items_multi_dict.values())
    max_count = max(value_counts.values())
    entity_types_keys = [key for key, value in line_items_multi_dict.items() if value_counts[value] == max_count]
    dict_unique_ent = {
        entity_type: [
            subentity 
            for entity in entities_pagewise 
            if entity.properties 
            for subentity in entity.properties 
            if subentity.type_ == entity_type
        ] 
        for entity_type in entity_types_keys
    }

    region_line_items={}
    region_line_items_x={}
    region_line_items_y={}
    opt_region={}
    for ent in dict_unique_ent:
        region=[]
        dict_x_y=[]
        count_product_code=0
        min_x_1=[]
        min_y_1=[]
        for item in dict_unique_ent[ent]:
            x_y={}
            x=[]
            y=[]
            x_min=''
            y_min=''
            count_product_code+=1
            for i in item.page_anchor.page_refs[0].bounding_poly.normalized_vertices:
                y.append(i.y)
                x.append(i.x)
            x_min=min(x)
            y_min=min(y)
            diff=max(y)-min(y)
            x_y[count_product_code]=[{'x':min(x),'y':min(y)},{'x':max(x),'y':max(y)}]
            dict_x_y.append(x_y)
            min_x_1.append(x_min)
            min_y_1.append(y_min)
        region_line_items_x[ent]=min_x_1
        region_line_items_y[ent]=min_y_1
        sorted_lst_x_y = sorted(dict_x_y, key=lambda x: list(x.values())[0][0]['y'])
        region_line_items[ent]=sorted_lst_x_y
        opt_region[ent]=diff
    sorted_region_line_y={key:sorted(values) for key, values in region_line_items_y.items()}
    sorted_region_line_x={key:sorted(values) for key, values in region_line_items_x.items()}
    regions_line_y_final={}
    opt_region_ent={}
    for region_line in sorted_region_line_y:
        line_no=0
        line_range={}
        for i in range(len(sorted_region_line_y[region_line])):
            line_no+=1
            try:
                pair=(sorted_region_line_y[region_line][i],sorted_region_line_y[region_line][i+1])
            except IndexError:
                pair=(sorted_region_line_y[region_line][i])
            line_range[line_no]=pair
        if region_line in regions_line_y_final.keys():
            regions_line_y_final[region_line].append(line_range)
        else:
            regions_line_y_final[region_line]=[line_range]
        opt_region_ent[region_line]=max(sorted_region_line_y[region_line])-min(sorted_region_line_y[region_line])
    max_value = max(opt_region_ent.values())  # Find the maximum value
    selected_values = [key for key, value in opt_region_ent.items() if abs(value - max_value) < 0.005]  # Select values satisfying the condition
    if len(selected_values)>1:
        considered_boundry_ent=''
        final_y=1
        for selected_ent in selected_values:
            if final_y>min(sorted_region_line_y[selected_ent]):
                final_y=min(sorted_region_line_y[selected_ent])
                considered_boundry_ent=selected_ent
            else:
                pass
    else:
        considered_boundry_ent=selected_values[0]
    import copy
    sub_entities_list=copy.deepcopy(line_item_sub_entities)
    line_item_dict_final={}
    sub_entities_categorized=[]
    count=0
    for subentity in sub_entities_list:
        y_ent=[]
        for ver in subentity.page_anchor.page_refs[0].bounding_poly.normalized_vertices:
            y_ent.append(ver.y)
        for line,region in regions_line_y_final[considered_boundry_ent][0].items():
            try:
                if (min(y_ent)>=region[0] or max(y_ent)>=region[0])  and (max(y_ent)<region[1]):
                    if line in line_item_dict_final.keys():
                        count=count+1
                        if subentity not in sub_entities_categorized:
                            line_item_dict_final[line].append(subentity)
                            sub_entities_categorized.append(subentity)
                    else:
                        count=count+1
                        if subentity not in sub_entities_categorized:
                            line_item_dict_final[line]=[subentity]
                            sub_entities_categorized.append(subentity)
            except TypeError:
                if min(y_ent)>=region:
                    if line in line_item_dict_final.keys():
                        count=count+1
                        if subentity not in sub_entities_categorized:
                            line_item_dict_final[line].append(subentity)
                            sub_entities_categorized.append(subentity)
                    else:
                        count=count+1
                        if subentity not in sub_entities_categorized:
                            line_item_dict_final[line]=[subentity]
                            sub_entities_categorized.append(subentity)

    for item in sub_entities_list:
        if item not in sub_entities_categorized:
            y_ent=[]
            for ver in item.page_anchor.page_refs[0].bounding_poly.normalized_vertices:
                y_ent.append(ver.y)
            diff_line={}
            for line1, y1 in regions_line_y_final[considered_boundry_ent][0].items():
                try:
                    diff=abs(min(y_ent)-y1[0])
                    diff_line[line1]=diff
                except TypeError:
                    diff=abs(min(y_ent)-y1)
                    diff_line[line1]=diff

            min_dist=min(diff_line.values())
            line_item_2=[key for key, value in diff_line.items() if value == min_dist]
            if line_item_2[0] in line_item_dict_final.keys():
                line_item_dict_final[line_item_2[0]].append(item)
                sub_entities_categorized.append(item)
            else:
                line_item_dict_final[line_item_2[0]]=[item]
                sub_entities_categorized.append(item)
    temp3 = []
    for element in line_item_sub_entities:
        if element not in sub_entities_categorized:
            temp3.append(element)
    if len(sub_entities_categorized)<len(line_item_sub_entities):
        left_out=len(line_item_sub_entities)-len(sub_entities_categorized)
        print('out of {} subentities,{}  are not yet classified '.format(len(line_item_sub_entities),left_out))
        pass
    elif len(sub_entities_categorized)==len(line_item_sub_entities):
        print('All lineitems are classified')
        pass
    else:
        print('something is wrong in classified')
        pass

    def create_lineitem(line_item_sub_entities : List, page : str) -> Dict:
        """
        Create a line item from a list of sub-entities.

        Args:
            line_item_sub_entities (List[object]): The list of sub-entities to be included in the line item.
            page (str): The page number.

        Returns:
            Dict[str, Union[str, Dict[str, List[object]], List[object], Dict[str, List[object]]]]: The created line item.
        """
        line_item={'mention_text':'','page_anchor': {'page_refs': [{'bounding_poly': {'normalized_vertices':[]},'page':page }]},'properties':[],'text_anchor': {'text_segments':[]},'type': 'line_item'}
        text_anchors_sub_entities=[]
        for item in line_item_sub_entities:
            text_anchors_sub_entities.append(item.text_anchor.text_segments[0])
        line_item['text_anchor']['text_segments'] = text_anchors_sub_entities
        sorted_list_text1=sorted(text_anchors_sub_entities, key=lambda x: int(x.end_index))
        line_item_mention_text=''
        line_item_properties=[]
        line_item_text_segments=[]
        line_item_normalizedvertices=[]
        subentities_classified=[]
        for index in sorted_list_text1:
            for item in line_item_sub_entities:
                if index in item.text_anchor.text_segments:
                    if item not in subentities_classified:
                        subentities_classified.append(item)
                        line_item_mention_text=line_item_mention_text+' '+ item.mention_text
                        line_item_properties.append(item)
                        line_item_text_segments.append(index)
                        for i in item.page_anchor.page_refs[0].bounding_poly.normalized_vertices:
                            line_item_normalizedvertices.append(i)              
        line_item_normalizedvertices_final = get_normalized_vertices(line_item_normalizedvertices)
        line_item['page_anchor']['page_refs'][0]['bounding_poly']['normalized_vertices']=line_item_normalizedvertices_final
        line_item['mention_text']=line_item_mention_text
        line_item['properties']=line_item_properties
        line_item['text_anchor']['text_segments']=line_item_text_segments
        return line_item

    line_items_classified=[]
    for key, line_item_1 in line_item_dict_final.items():
        line_item=create_lineitem(line_item_1,page)
        line_items_classified.append(line_item)
    return line_items_classified,considered_boundry_ent

def merge_entities(json_dict : object) -> object:
    """
    Merge entities in a page-wise manner.

    Args:
        json_dict (object): The JSON object containing entities.

    Returns:
        object: The updated JSON object after merging entities.
    """
    
    entitites_page_wise=get_page_wise_entities(json_dict)
    line_entities_classified_pagewise=[]
    for page, entities in entitites_page_wise.items():
        line_entities_temp=''
        line_item_count=line_item_check(entities,unique_entities)
        if line_item_count==1:
            line_entities_temp=single_line_item_merge(entities,page)
            line_entities_classified_pagewise.append(line_entities_temp)
        elif line_item_count>1:
            line_entities_temp,considered_boundry_ent=multi_page_entites(entities,page)
            for ent1 in line_entities_temp:
                line_entities_classified_pagewise.append(ent1)
        elif line_item_count==0:
            print('no line items')
    final_entities=[]
    if len(line_entities_classified_pagewise)==0:
        pass
    else:
        for entity in json_dict.entities:
            if entity.type_ != 'line_item':
                final_entities.append(entity)
        for ent in line_entities_classified_pagewise:
            final_entities.append(ent)
        json_dict.entities = final_entities

    return json_dict

def desc_merge_update(json_dict : object) -> object:
    def desc_merge_1(ent_desc : List[Dict]) -> Dict:
        """
        Merge description entities.

        Args:
            ent_desc (List[Dict]): List of description entities.

        Returns:
            Dict: Merged description entity.
        """
        desc_merge={'mention_text':'','page_anchor': {'page_refs':''},'text_anchor': {'text_segments':[]},'type': 'line_item/description'}
        text_anchors_desc_merge=[]
        pagerefs=''
        for item in ent_desc:
            text_anchors_desc_merge.append(item['text_anchor']['text_segments'][0])
            pagerefs=item['page_anchor']['page_refs']
        desc_merge['page_anchor']['page_refs']=pagerefs
        desc_merge['text_anchor']['text_segments']=text_anchors_desc_merge
        sorted_list_text1=sorted(text_anchors_desc_merge, key=lambda x: int(x['end_index']))
        desc_mention_text=''
        desc_text_segments=[]
        desc_normalizedvertices=[]
        subentities_classified=[]
        for index in sorted_list_text1:
            for item in ent_desc:
                if index in item['text_anchor']['text_segments']:
                    if item not in subentities_classified:
                        subentities_classified.append(item)
                        desc_mention_text=(desc_mention_text+' '+ item['mention_text'])
                        desc_text_segments.append(index)
                        for i in item['page_anchor']['page_refs'][0]['bounding_poly']['normalized_vertices']:
                            desc_normalizedvertices.append(i)
        desc_normalizedvertices_final = get_normalized_vertices(desc_normalizedvertices)
        desc_merge['page_anchor']['page_refs'][0]['bounding_poly']['normalized_vertices']=desc_normalizedvertices_final
        desc_merge['mention_text']=desc_mention_text
        desc_merge['text_anchor']['text_segments']=desc_text_segments
        return desc_merge

    for entity in json_dict.entities:
        line_en=[]
        ent_desc=[]
        desc_ent_merge={}
        if entity.type_ == 'line_item':
            line_en.append(entity.properties)
        for itm in line_en:
            for ent1 in itm:
                if ent1.type_ == 'line_item/description':
                    ent_desc.append(ent1)
        if len(ent_desc)>1:
            desc_merge=desc_merge_1(ent_desc)
            if entity.type_ == 'line_item':
                for en2 in ent_desc:
                    if en2 in entity.properties:
                        del entity.properties[entity.properties.index(en2)]
            entity.properties.append(desc_merge)
    return json_dict

def process_and_update_json(res_dict : object) -> object:
    """
    Process and update the JSON dictionary.

    Args:
        res_dict (Dict): Input JSON dictionary.

    Returns:
        Dict: Updated JSON dictionary.
    """
    json_dict_updated = merge_entities(res_dict)
    if line_item_across_pages == 'Yes':
        json_dict_updated = get_lineitems_grouped_pages_across(json_dict_updated)
    if Desc_merge_update == 'Yes':
        json_dict_updated = desc_merge_update(json_dict_updated)
    return json_dict_updated

In [None]:
for row in tqdm(processed_rows):
    file_name = row[0]  # The first item is 'FileNames'

    if file_name not in processed_files:
        print('Processing:', file_name)
        client = storage.Client()
        bucket = client.get_bucket(input_bucket.split('/')[2])
        file_name_path=input_bucket+file_name
        file_name_path="/".join(file_name_path.split('/')[3:])
        blob = bucket.blob(file_name_path)
        content = blob.download_as_bytes()
        res = utilities.process_document_sample(project_id=project_id, location=location, processor_id=processor_id, pdf_bytes = content, processor_version=processor_version)
        res_dict = res.document
        tokenRange = get_token_range(res_dict)

        # Add the file_name to the set of processed files
        processed_files.add(file_name)

    list_of_entities = []
    list_of_entities_not_mapped = []
    processed_entities = set()  # Set to track processed entities
    # Iterate over the sets of data associated with this file_name
    for i in range(0, len(row), len(headers)):
        row_slice = row[i:i + len(headers)]
        for j in range(1, len(headers)):
            type_ = headers[j]
            mention_text = row_slice[j]                    
            if '/' in type_:
                # ADDS LINE ITEM ENTITY
                if mention_text:
                    parts = type_.split('/')
                    parent_name = parts[0]
                    type_ = parts[1] if len(parts) > 1 else None
                    occurrences = re.finditer(re.escape(str(mention_text)) + r"[ |\,|\n]", res_dict.text)
                    entity_flag = False
                    for m in occurrences:
                        start, end = m.start(), m.end()
                        entity_id = (mention_text, start, end)  # Unique identifier for each entity

                        if entity_id not in processed_entities:
                            entity_flag = True
                            entity = create_entity(mention_text, type_, m)
                            try:
                                entity_modified = fix_page_anchor_entity(entity, res_dict, tokenRange)
                                entity_modified_as_parent = modify_as_parent_entity(entity_modified,parent_name)
                                camel_case_entity = convert_keys_to_camel_case(entity_modified_as_parent)
                                processed_entities.add(entity_id)  # Add the unique identifier to the set
                                list_of_entities.append(camel_case_entity)
                            except:
                                print("Not able to find " + mention_text + " in the OCR")
                                continue
                    if not entity_flag:
                        list_of_entities_not_mapped.append(type_)
                    continue    
            else:
                # ADDS Normal Entity
                if mention_text:
                    occurrences = re.finditer(re.escape(str(mention_text)) + r"[ |\,|\n]", res_dict.text)
                    entity_flag = False

                    for m in occurrences:
                        start, end = m.start(), m.end()
                        entity_id = (mention_text, start, end)  # Unique identifier for each entity

                        if entity_id not in processed_entities:
                            entity_flag = True
                            entity = create_entity(mention_text, type_, m)
                            try:
                                entity_modified = fix_page_anchor_entity(entity, res_dict, tokenRange)
                                camel_case_entity = convert_keys_to_camel_case(entity_modified)
                                processed_entities.add(entity_id)  # Add the unique identifier to the set
                                list_of_entities.append(camel_case_entity)
                            except:
                                print("Not able to find " + mention_text + " in the OCR")
                                continue
                    if not entity_flag:
                        list_of_entities_not_mapped.append(type_)

    print("Number of entities that are mapped: ",len(list_of_entities))
    res_dict.entities = list_of_entities

    # Integrate with CODE 2 Processing
    updated_json_dict = process_and_update_json(res_dict)

    # Write the final output to GCS
    output_bucket_name = output_bucket.split('/')[2]
    output_path_within_bucket = '/'.join(output_bucket.split('/')[3:]) + file_name
    utilities.store_document_as_json(documentai.Document.to_json(updated_json_dict), output_bucket_name, output_path_within_bucket)

### 4.Output

* The output processed json files are available in the GCS bucket. These files can be further imported into the DocumentAI processor and checked for their annotation

<img src="./Images/output_bucket.png" width=800 height=400></img>

* When the files are imported to the DocumentAI processor, the annotated entities are observed.The number for entities that were mapped are displayed during the runtime.

<img src="./Images/output_2.png" width=800 height=400></img>

<img src="./Images/output_1.png" width=800 height=400></img>
