# Finding Missing child items and Improving Line items grouping


* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied. 


## Objective

The objective of the tool is to find the missing child items and group the correct child items into parent line items

## Prerequisites

* Vertex AI Notebook Or Colab (If using Colab, use authentication)
* Storage Bucket for storing input and output json files
* Permission For Google Storage and Vertex AI Notebook.



## Step by Step procedure

### 1. Importing Required Modules

In [1]:
!pip install pandas numpy google-cloud-storage google-cloud-documentai==2.16.0 
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

--2024-01-30 10:18:21--  https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29700 (29K) [text/plain]
Saving to: ‘utilities.py’


2024-01-30 10:18:21 (17.4 MB/s) - ‘utilities.py’ saved [29700/29700]



In [1]:
#import libraries
from tqdm import tqdm
from google.cloud import documentai_v1beta3 as documentai
from pathlib import Path
from google.cloud import storage
from collections import Counter
from typing import Dict, List, Any,Tuple
from utilities import *

### 2. Input and Output Paths

In [None]:
# Path to the raw parsed JSON files. The path must end with a forward slash ('/').
Gcs_input_path = "gs://xxxxx/xxxxxxxxxxxx/xx/"
# Your Google Cloud project ID.
project_id = 'xxx-xxxx-xxxx'
# Path for saving the processed output files. Do not include a trailing forward slash ('/').
Gcs_output_path = "gs://xxxxx/xxxxxxxxxxxx/xx"
parent_type='table_item'
Missing_items_flag='True' # case sensitive

* ``Gcs_input_path ``: GCS Input Path. It should contain DocAI processed output json files. 
* ``Gcs_output_path ``: GCS Output Path. The updated jsons will be saved in output path. 
* ``project_id`` : It should contains the project id of your current project.
* ``Missing_items_flag``:  "True" if we need to find the missing child items , missing items step will be skipped if this value is other than True


### 3. Run the Code

### Note
* While using the missing items code , if the line items are closer then `modify the get_token_data function by increasing or decreasing the x and y allowances`.
* Human review is recomended after this tool usage

In [3]:
def get_page_bbox(entity):
    bound_poly = entity.page_anchor.page_refs
    norm_ver = bound_poly[0].bounding_poly.normalized_vertices
    x_values = [vertex.x for vertex in norm_ver]
    y_values = [vertex.y for vertex in norm_ver]
    bbox = [min(x_values), min(y_values), max(x_values), max(y_values)]

    return bbox

def get_page_wise_entities(json_dict):
    """Args: loaded json file
    THIS FUNCTION GIVES THE ENTITIES SPEPERATED FROM EACH PAGE IN DICTIONARY FORMAT
    RETURNS: {page: [entities]}"""
    
    entities_page={}
    for entity in json_dict.entities:
        page=entity.page_anchor.page_refs[0].page
        if page in entities_page.keys():
            entities_page[page].append(entity)
        else:
            entities_page[page]=[entity]
            
    return entities_page

def  get_line_items_schema(line_items):
    
    # line_items = [entity for entity in json_dict.entities if entity.properties]
    line_item_schema = []
    schema_xy = []
    for line_item in line_items:
        temp_schema = {}
        temp_xy = {}
        for item in line_item.properties:
            temp_schema[item.type] = temp_schema.get(item.type, 0) + 1
            bbox = get_page_bbox(item)
            if item.type in temp_xy:
                temp_xy[item.type].append(bbox)
            else:
                temp_xy[item.type] = [bbox]

            line_item_schema.append(temp_schema)
            schema_xy.append(temp_xy)

    flat_list = [(key, value) for item in line_item_schema for key, value in item.items()]

    counter = Counter(dict(flat_list))
    temp_schema_dict = dict(counter)
    consolidated_positions_ent={}
    x=[]
    for k3,v3 in temp_schema_dict.items():
        for l3 in schema_xy:
            for k4,v4 in l3.items():
                if k3==k4:
                    for x12 in v4:
                        if k3 in consolidated_positions_ent.keys():
                            consolidated_positions_ent[k3].append(x12)
                        else:
                            consolidated_positions_ent[k3]=[x12]
    final_ent_x12={}
    final_ent_y12={}
    for ent_typ,va1 in consolidated_positions_ent.items():
        sorted_data = sorted(va1, key=lambda x: x[0])
        groups = []
        current_group = [sorted_data[0]]
        difference_threshold = 0.02
        for i in range(1, len(sorted_data)):
            if abs(sorted_data[i][0] - current_group[-1][0]) <= difference_threshold:
                current_group.append(sorted_data[i])
            else:
                groups.append(current_group)
                current_group = [sorted_data[i]]
        groups.append(current_group)
        for va3 in  groups:
            if len(va3)>=1:
                if ent_typ in final_ent_x12.keys():
                    final_ent_x12[ent_typ].append([min(item[0] for item in va3), max(item[2] for item in va3)])
                    final_ent_y12[ent_typ].append([min(item[1] for item in va3), max(item[3] for item in va3)])
                else:
                    final_ent_x12[ent_typ]=[[min(item[0] for item in va3), max(item[2] for item in va3)]]
                    final_ent_y12[ent_typ]=[[min(item[1] for item in va3), max(item[3] for item in va3)]]

    return temp_schema_dict,final_ent_x12,final_ent_y12

def get_token_xy(token: Any) -> Tuple[float, float, float, float]:
    
    """
    Extracts the normalized bounding box coordinates (min_x, min_y, max_x, max_y) of a token.

    Args:
    - token (Any): A token object with layout information.

    Returns:
    - Tuple[float, float, float, float]: The normalized bounding box coordinates.
    
    """
    vertices = token.layout.bounding_poly.normalized_vertices
    minx_token, miny_token = min(point.x for point in vertices), min(point.y for point in vertices)
    maxx_token, maxy_token = max(point.x for point in vertices), max(point.y for point in vertices)

    return minx_token,miny_token,maxx_token,maxy_token

def get_token_data(json_dict,min_x,max_x,min_y,max_y,page_num):
    text_anc_temp=[]
    text_anc=[]
    page_anc_temp={'x':[],'y':[]}
    y_allowance=0.01 # edit this if the line items are closer and your not getitng desir
    x_allowance=0.02
    for page in json_dict.pages:
        if page_num==page.page_number-1:
            for token in page.tokens:
                minx_token,miny_token,maxx_token,maxy_token=get_token_xy(token)
                if min_y<=miny_token+y_allowance and max_y>=maxy_token-y_allowance and min_x<=minx_token+x_allowance and max_x>=maxx_token-x_allowance:
                    temp_anc=token.layout.text_anchor.text_segments[0]
                    text_anc.append(temp_anc)
                    page_anc_temp['x'].extend([minx_token,maxx_token])
                    page_anc_temp['y'].extend([miny_token,maxy_token])
                    for seg in token.layout.text_anchor.text_segments:
                        text_anc_temp.append([seg.start_index,seg.end_index])
    if page_anc_temp!={'x':[],'y':[]}:    
        page_anc=[{'x':min(page_anc_temp['x']),'y':min(page_anc_temp['y'])},{'x':max(page_anc_temp['x']),'y':min(page_anc_temp['y'])},
                  {'x':min(page_anc_temp['x']),'y':max(page_anc_temp['y'])},{'x':max(page_anc_temp['x']),'y':max(page_anc_temp['y'])}]
    if text_anc_temp!=[]:
        sorted_data = sorted(text_anc_temp, key=lambda x: x[0])
        mention_text = ""
        for start_index, end_index in sorted_data:
            mention_text += json_dict.text[start_index:end_index] 

        return mention_text,text_anc,page_anc

                

def get_missing_fields(json_dict,line_items,temp_schema_dict,final_ent_x12,ent_x_region,line_item_y_region):
    
    for line_item in line_items:
        import copy
        page_num=0
        temp_types=[]
        mis_type=[]
        text_anc_line=[]
        text_anc_mt=[]
        page_anc_line={'x':[],'y':[]}
        deep_copy_temp_schema = copy.deepcopy(temp_schema_dict)

        for child in line_item.properties:
            temp_types.append(child.type)
            for seg in child.text_anchor.text_segments:
                text_anc_line.append(seg)
                text_anc_mt.append([seg.start_index,seg.end_index])
            for anc4 in child.page_anchor.page_refs:
                # page_n=anc4.page
                for xy2 in anc4.bounding_poly.normalized_vertices:
                    page_anc_line['x'].append(xy2.x)
                    page_anc_line['y'].append(xy2.y)
        #only for bank statement parser output
        for k2 in temp_types:
            if 'deposit' in k2:
                modified_schema = {key: value for key, value in deep_copy_temp_schema.items() if 'withdrawal' not in key}
                break
            elif 'withdrawal' in k2:
                modified_schema = {key: value for key, value in deep_copy_temp_schema.items() if 'deposit' not in key}
                break
            if 'modified_schema' not in locals():
                modified_schema=deep_copy_temp_schema

        for t1,v1 in modified_schema.items():
            if t1 in temp_types:
                pass
            else:
                mis_type.append(t1)

        if len(mis_type)>0:
            for typ in mis_type:

                for ent_pos in line_item.page_anchor.page_refs:
                    page_num=ent_pos.page
                    try:
                        min_x=ent_x_region[typ][0]
                    except:
                        min_x=min(ver.x for ver in ent_pos.bounding_poly.normalized_vertices)
                    min_y=min(ver.y for ver in ent_pos.bounding_poly.normalized_vertices)
                    try:
                        max_x=ent_x_region[typ][1]-0.02
                    except:
                        max_x=max(ver.x for ver in ent_pos.bounding_poly.normalized_vertices)-0.02
                        
                    if 'description' in typ:

                        try:
                            # differences = [line_item_y_region[i+1] - line_item_y_region[i] for i in range(len(line_item_y_region)-1)]
                            # avg_dif=sum(differences) / len(differences)
                            closest_index_y = min(range(len(line_item_y_region)), key=lambda i: abs(line_item_y_region[i] - min_y))
#                             if abs(min_y-closest_index_y)>0.02:
#                                 max_y=min_y+avg_dif
#                                 print(max_y)
#                             else:
                                
                            max_y=line_item_y_region[closest_index_y+1]
                        except :
                            pass
                    else:
                        max_y=max(ver.y for ver in ent_pos.bounding_poly.normalized_vertices)
                        
        
                    try:
                    
                        mention_text,text_anc,page_anc=get_token_data(json_dict,min_x,max_x,min_y,max_y,page_num)
                        for an3 in text_anc:
                            text_anc_line.append(an3)
                            text_anc_mt.append([an3.start_index,an3.end_index])
                        for xy3 in page_anc:
                            page_anc_line['x'].append(xy3['x'])
                            page_anc_line['y'].append(xy3['y'])
                        entity_new={'mention_text': mention_text,
                                     'page_anchor': {'page_refs': [{'bounding_poly': {'normalized_vertices': page_anc},
                                        'page': str(page_num)}]},
                                     'text_anchor': {'content': mention_text,
                                      'text_segments': text_anc},
                                     'type': typ}
                        line_item.properties.append(entity_new)
                    except Exception as e:
                        # print(e)
                        # print('YES')
                        pass
        page_anc_final=[{'x':min(page_anc_line['x']),'y':min(page_anc_line['y'])},{'x':max(page_anc_line['x']),'y':min(page_anc_line['y'])},
                      {'x':min(page_anc_line['x']),'y':max(page_anc_line['y'])},{'x':max(page_anc_line['x']),'y':max(page_anc_line['y'])}]
        sorted_data_1 = sorted(text_anc_mt, key=lambda x: x[0])
        mention_text_final = ""
        for start_index_1, end_index_1 in sorted_data_1:
            mention_text_final = mention_text_final+' '+json_dict.text[start_index_1:end_index_1] 

        line_item.mention_text=mention_text_final
        for anc6 in line_item.page_anchor.page_refs:
            anc6.bounding_poly.normalized_vertices=page_anc_final
        line_item.text_anchor.text_segments=text_anc_line

    new_ent=[]

    for l1 in line_items:
        new_ent.append(l1)
    
    return new_ent

def get_schema_with_bbox(line_items):
    line_item_schema = []
    schema_xy = []
    for line_item in line_items:
        temp_schema = {}
        temp_xy = {}
        for item in line_item.properties:
            temp_schema[item.type] = temp_schema.get(item.type, 0) + 1
            bbox = get_page_bbox(item)
            if item.type in temp_xy:
                temp_xy[item.type].append(bbox)
            else:
                temp_xy[item.type] = [bbox]

        line_item_schema.append(temp_schema)
        schema_xy.append(temp_xy)
    
    return line_item_schema,schema_xy

def get_anchor_entity(schema_xy,line_item_schema):
    ent_y2 = {}
    for sc1 in schema_xy:
        for e2, bbox in sc1.items():
            if len(bbox)==1:
                for b2 in bbox:
                    ent_y2.setdefault(e2, []).extend([b2[1], b2[3]])
    #get the min and max y of entities
    entity_min_max_y = {}
    for en3, val3 in ent_y2.items():
        min_y_3 = min(val3)
        max_y_3 = max(val3)
        entity_min_max_y[en3] = [min_y_3, max_y_3]
    
    #counting times the entity appeared uniquely in all the line items
    entity_count = {}
    for entry in line_item_schema:
        for entity, value in entry.items():
            if value == 1:
                if entity in entity_count:
                    entity_count[entity] += 1
                else:
                    entity_count[entity] = 1

    value_counts = {}
    for value in entity_count.values():
        value_counts[value] = value_counts.get(value, 0) + 1
    # Find the maximum value
    max_value = max(value_counts.values())

    # Find keys with the maximum value
    keys_with_max_value = [key for key, value in value_counts.items() if value == max_value]

    # Find the key with the maximum value (in case of ties, choose the maximum key)
    max_key = max(keys_with_max_value)

    repeated_key = [key for key, value in entity_count.items() if value == max_key]

    filtered_entities = {key: entity_min_max_y[key] for key in repeated_key if key in entity_min_max_y}
    # print(filtered_entities)
    # print(max_key)
    if len(filtered_entities) > 1 :
        anchor_entity = min(filtered_entities, key=lambda k: filtered_entities[k][0])
    else:
        anchor_entity=list(filtered_entities.keys())[0]
        
    return anchor_entity
    
def entity_region_x(schema_xy):
    def get_margin(min_y_bin,min_values="YES"):
        # Sort the list in ascending order
        min_y_bin.sort()

        bins = []
        current_bin = [min_y_bin[0]]
        # Iterate through the values to create bins
        for i in range(1, len(min_y_bin)):
            if min_y_bin[i] - current_bin[-1] < 0.05:
                current_bin.append(min_y_bin[i])
            else:
                bins.append(current_bin.copy())
                current_bin = [min_y_bin[i]]
                
        # Add the last bin
        bins.append(current_bin)
        final_bins=[]
        for bin_1 in bins:
            if len(bin_1)>=2:
                final_bins.append(bin_1)
        if final_bins==[]:
            for bin_1 in bins:
                if len(bin_1)>=1:
                    final_bins.append(bin_1)
        if min_values=='YES':
            return min(min(inner_list) for inner_list in final_bins)
        else:
            return max(max(inner_list) for inner_list in final_bins)
        
    ent_full_boundries={}
    for line_1 in schema_xy:
        for typ_1,bbox_1 in line_1.items():
            if len(bbox_1)==1:
                if typ_1 in ent_full_boundries.keys():
                    ent_full_boundries[typ_1].append(bbox_1[0])
                else:
                    ent_full_boundries[typ_1]=bbox_1
    ent_margins={}
    for ent_typ_1,values_1 in ent_full_boundries.items():
        min_x_bin=[]
        min_y_bin=[]
        max_x_bin=[]
        max_y_bin=[]
        min_check=len(values_1)
        for bbox in values_1:
            min_x_bin.append(bbox[0])
            min_y_bin.append(bbox[1])
            max_x_bin.append(bbox[2])
            max_y_bin.append(bbox[3])
        min_x=get_margin(min_x_bin,min_values="YES")
        min_y=get_margin(min_y_bin,min_values="YES")
        max_x=get_margin(max_x_bin,min_values="NO")
        max_y=get_margin(max_y_bin,min_values="NO")

        ent_margins[ent_typ_1]=[min_x,min_y,max_x,max_y]
    
    ent_margin_withdrawal={}
    ent_margin_deposit={}
    for ent_3,bbox_3 in ent_margins.items():
        if 'withdrawal' in ent_3:
            ent_margin_withdrawal[ent_3]=bbox_3
        elif 'deposit' in ent_3:
            ent_margin_deposit[ent_3]=bbox_3
        else:
            ent_margin_withdrawal[ent_3]=bbox_3
            ent_margin_deposit[ent_3]=bbox_3

    def get_x_region(ent_margin_withdrawal):
        sorted_ent_margin_withdrawal = sorted_data = dict(sorted(ent_margin_withdrawal.items(), key=lambda x: x[1][0]))
        ent_x_regions={}
        keys_sorted=list(sorted_ent_margin_withdrawal.keys())
        for n_1 in range(len(keys_sorted)):
            if n_1<len(keys_sorted)-1:
                if sorted_ent_margin_withdrawal[keys_sorted[n_1]][2]>sorted_ent_margin_withdrawal[keys_sorted[n_1+1]][0]:  
                    ent_x_regions[keys_sorted[n_1]]=[sorted_ent_margin_withdrawal[keys_sorted[n_1]][0],sorted_ent_margin_withdrawal[keys_sorted[n_1]][2]]
                else:
                    ent_x_regions[keys_sorted[n_1]]=[sorted_ent_margin_withdrawal[keys_sorted[n_1]][0],sorted_ent_margin_withdrawal[keys_sorted[n_1+1]][0]]
            else:
                ent_x_regions[keys_sorted[n_1]]=[sorted_ent_margin_withdrawal[keys_sorted[n_1]][0],sorted_ent_margin_withdrawal[keys_sorted[n_1]][2]]

        return ent_x_regions

    withdrawal_x_region=get_x_region(ent_margin_withdrawal)
    deposit_x_region=get_x_region(ent_margin_deposit)
    ent_x_region = {**deposit_x_region, **withdrawal_x_region}

    
    return ent_x_region

def get_line_item_y_region(line_items):
    line_item_y_region=[]
    max_y_line_item=[]
    for tab_item in line_items:
        y_1=[]
        for line_details in tab_item.page_anchor.page_refs:
            page=line_details.page
            for xy_1 in line_details.bounding_poly.normalized_vertices:
                y_1.append(xy_1.y)
        line_item_y_region.append(min(y_1))
        max_y_line_item.append(max(y_1))

    line_item_y_region.append(max(max_y_line_item))
    sorted_line_item_y_region=sorted(line_item_y_region)
    
    return sorted_line_item_y_region

def get_line_item_y_region_by_anchor(line_items,anchor_entity):
    y_max_anchor=[]
    y_min_anchor=[]

    for tab1_item in line_items:
        for child in tab1_item.properties:
            if child.type_==anchor_entity:
                y_2=[]
                for child_details in child.page_anchor.page_refs:
                    for xy_2 in child_details.bounding_poly.normalized_vertices:
                        y_2.append(xy_2.y)
                y_max_anchor.append(max(y_2))
                y_min_anchor.append(min(y_2))
    sorted_y_max_anchor=sorted(y_max_anchor)
    sorted_y_min_anchor=sorted(y_min_anchor)
    sorted_y_max_anchor.append(sorted_y_min_anchor[0])
    sorted_y_anchor=sorted(sorted_y_max_anchor)

    return sorted_y_anchor

def get_line_item_region(schema_xy, anchor_entity,line_items):
    region_y=[]
    for reg in schema_xy:
        for e4,v4 in reg.items():
            if 'date' in e4:# if e4==anchor_entity:
                region_y.append(v4[0][1])
    #Get line item total region and getting all child items into single list
    bbox_line_y=[]
    bbox_line_x=[]
    child_items=[]
    for line_item in line_items:
        bbox_line=get_page_bbox(line_item)
        bbox_line_y.extend([bbox_line[1],bbox_line[3]])
        bbox_line_x.extend([bbox_line[0],bbox_line[2]])
        for child in line_item.properties:
            child_items.append(child)
    line_item_start_y=min(bbox_line_y)
    line_item_end_y=max(bbox_line_y)
    
        #getting Boundry for each line item
    line_item_region=[]
    region_y=sorted(region_y)
    for r1 in range(len(region_y)):
        if r1==0:
            line_item_region.append([line_item_start_y,region_y[r1+1]])
        elif r1==len(region_y)-1:
            line_item_region.append([region_y[r1],line_item_end_y])
        else:
            line_item_region.append([region_y[r1],region_y[r1+1]])

    return line_item_region,child_items


def group_line_items(parent_type,child_items,page,line_item_region,json_dict):

    grouped_line_items=[]
    
    for boundry in line_item_region:
        line_item_temp={'mention_text':'','page_anchor': {'page_refs': [{'bounding_poly': {'normalized_vertices':[]},'page':page }]},'properties':[],'text_anchor': {'text_segments':[]},'type': parent_type}
        text_anc_temp=[]
        page_anc_temp={'x':[],'y':[]}
        mt_temp=''
        for child_1 in child_items:
            bbox_temp=get_page_bbox(child_1)
            if bbox_temp[1]>=boundry[0]-0.005 and bbox_temp[3]<=boundry[1]+0.005:
                # print('entered')
                line_item_temp['properties'].append(child_1)
                page_anc_temp['x'].extend([bbox_temp[0],bbox_temp[2]])
                page_anc_temp['y'].extend([bbox_temp[1],bbox_temp[3]])
                seg_temp=child_1.text_anchor.text_segments
                for seg in seg_temp:
                    text_anc_temp.append({'start_index':str(seg.start_index),'end_index':str(seg.end_index)})
        sorted_data = sorted(text_anc_temp, key=lambda x: int(x['end_index']))
        for sort_text in sorted_data:
            mt_temp=mt_temp+' '+json_dict.text[int(sort_text['start_index']):int(sort_text['end_index'])]
        line_item_temp['text_anchor']['text_segments']=sorted_data
        line_item_temp['mention_text']=mt_temp
        # print(mt_temp)
        line_item_temp['page_anchor']['page_refs'][0]['bounding_poly']['normalized_vertices']=[{'x':min(page_anc_temp['x']),'y':min(page_anc_temp['y'])},
                                                                                               {'x':max(page_anc_temp['x']),'y':min(page_anc_temp['y'])},
                                                                                               {'x':max(page_anc_temp['x']),'y':max(page_anc_temp['y'])},
                                                                                               {'x':min(page_anc_temp['x']),'y':max(page_anc_temp['y'])}]
        grouped_line_items.append(line_item_temp)
        
    return grouped_line_items

def get_updated_grouped_line_items(json_dict,parent_type):
    final_line_items=[]
    page_wise_ent=get_page_wise_entities(json_dict)
    entities_ungrouped=[]
    for page_num, ent in page_wise_ent.items():
        try:
            line_items = [entity for entity in ent if entity.properties and entity.type==parent_type]
            try:
                line_item_schema,schema_xy=get_schema_with_bbox(line_items)
                anchor_entity=get_anchor_entity(schema_xy,line_item_schema)
                line_item_region,child_items=get_line_item_region(schema_xy, anchor_entity,line_items)
                grouped_line_items=group_line_items(parent_type,child_items,page_num,line_item_region,json_dict)
                for item in grouped_line_items:
                    final_line_items.append(item)
            except:
                entities_ungrouped.append(line_items)
                continue
        except:
            continue
    final_entities=[]
    for en3 in json_dict.entities:
        if en3.type!=parent_type:
            final_entities.append(en3)
    for lin_it in final_line_items:
        final_entities.append(lin_it)
    if len(entities_ungrouped)>0:
        for item_1 in entities_ungrouped:
            for item_2 in item_1:
                final_entities.append(item_2)
                
    
    json_dict.entities=final_entities

    return json_dict

def get_missing_data(json_dict,parent_type):
    page_wise_ent=get_page_wise_entities(json_dict)
    new_added_entities=[]
    other_entities=[]
    json_dict=get_updated_grouped_line_items(json_dict,parent_type)
    for page_num, ent in page_wise_ent.items():
        line_items = [entity for entity in ent if entity.properties and entity.type==parent_type]
        line_items_other= [entity for entity in ent if entity.properties and entity.type!=parent_type]
        for other_ent in line_items_other:
            other_entities.append(other_ent)
        # print(page_num)
        if len(line_items)>2:
            line_item_schema,schema_xy=get_schema_with_bbox(line_items)
            ent_x_region=entity_region_x(schema_xy)
            # print(ent_x_region)
            anchor_entity=get_anchor_entity(schema_xy,line_item_schema)
            # print(anchor_entity)
            line_item_y_region= get_line_item_y_region(line_items)
            temp_schema_dict,final_ent_x12,final_ent_y12=get_line_items_schema(line_items)
            new_ent=get_missing_fields(json_dict,line_items,temp_schema_dict,final_ent_x12,ent_x_region,line_item_y_region)
            for item in new_ent:
                new_added_entities.append(item)
        else:
            for lin_it1 in line_items:
                other_entities.append(lin_it1)
    final_entities=[]
    for en3 in json_dict.entities:
        if en3.type!=parent_type:
            final_entities.append(en3)
    for lin_it in new_added_entities:
        final_entities.append(lin_it)
    for lin_it2 in other_entities:
        final_entities.append(lin_it2)
    json_dict.entities=final_entities
    
    return json_dict

def main():
    file_name_list,file_path_dict=file_names(Gcs_input_path)
    for i in range(len(file_name_list)):
        file_path='gs://'+Gcs_input_path.split('/')[2]+'/'+file_path_dict[file_name_list[i]]
        print(file_path)
        json_data=documentai_json_proto_downloader(file_path.split('/')[2],('/').join(file_path.split('/')[3:]))
        if Missing_items_flag=='True':
            json_data=get_missing_data(json_data,parent_type)
        json_data= get_updated_grouped_line_items(json_data,parent_type)
        store_document_as_json(documentai.Document.to_json(json_data), Gcs_output_path.split('/')[2], ('/').join(Gcs_output_path.split('/')[3:])+'/'+file_name_list[i])

main()


gs://test_vb1/1_Testing_2024/Test_gateless/FEB1/parsed_jsons/6422400092316209525/0/02a6179e-9762-4f0f-b7b3-fcb0da787e54_BankStatement_gateless-redacted-0.json
0
2
gs://test_vb1/1_Testing_2024/Test_gateless/FEB1/parsed_jsons/6422400092316209525/1/02b28d2f-4026-4f2c-94da-fc53bd374517_BankStatement_gateless-redacted-0.json
0
[0.47736263275146484, 0.5094505548477173, 0.5406593680381775, 0.5705494284629822, 0.5942857265472412]
2
[0.47780218720436096, 0.508571445941925, 0.5406593680381775, 0.5701099038124084, 0.6000000238418579, 0.6307692527770996, 0.6593406796455383]
3
[0.04879120737314224, 0.08087912201881409, 0.11120878905057907, 0.14241757988929749, 0.16923077404499054]
gs://test_vb1/1_Testing_2024/Test_gateless/FEB1/parsed_jsons/6422400092316209525/10/04410ed1-8857-42b3-8373-1ea957812171_BankStatement_gateless-redacted-0.json
1
[0.1428571492433548, 0.15736263990402222, 0.17186813056468964, 0.1850549429655075, 0.1986813247203827, 0.21450549364089966, 0.22901098430156708, 0.24307692050933

KeyboardInterrupt: 