# DocAI Renaming Normalized Currency

* Author: docai-incubator@google.com

# Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Objective

This tool uses parsed json files and a list of entities in which date format has to be changed( if empty all the date entities are changed).The normalized value in currency entity which is predicted as USD sometimes changed to SGD and normalized date format to the format needed as per the input.

# Prerequisite
* Vertex AI Notebook
* Parsed json files in GCS Folder
* Output folder to upload the updated json files

# Step By Step Procedur

## 1. Import Modules/Packages

In [None]:
!pip install google-cloud-documentai --quiet

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
from typing import List, Tuple, Union

from google.cloud import documentai_v1beta3 as documentai

import utilities

## 2. Input Variables Description

**`project_id`** : Enter your GCP project ID         
**`entities_normalize`** : list of entities to be changed for date , if its empty all the date related entities will be updated     
    * Below are the changes made for the entities given in the list entitites_normalize list     
        1. Default google normalized value of currency from USD to SGD       
        2. Default google normalized date format will be changed to the format given in the normalized_date_format    (The date format changes only   
    in the JSON but changes doesnt reflect in UI)    
**`input_files_path`** : GCS folder URI where parsed json files are stored       
**`output_files_path`** : GCS Folder URI where the updated Jsons have to be stored         
**`normalized_date_format`** : Specific Date format to replace the default normalized date format day/month/year     
**`switch`** : Either `ON` or `OFF` to update the jsons.      
    * *ON* : Renaming normalize current processing takes place     
    * *OFF* : JSON files copied from *input_files_path* to *output_files_path*(i.e, with-out postprocessing)

In [None]:
project_id = "xx-xx-xx"
entities_normalize = []
input_files_path = "gs://bucket/path_to/input/"
output_files_path = "gs://bucket/path_to/post/output/"
normalized_date_format = "month/year/day"
switch = "ON"

## 3. Run Below Code-Cell

Implementation Approach:  
The Currency is updated only if supplier_city or supplier_address has singapore in it or supplier_city and supplier_address entities are not available in parsed jsons(input path) and currency entity is either USD or no currency entity available in the json

In [None]:
def all_entities_update(
    doc: documentai.Document, entities_normalize: Union[list, List[str]]
) -> List[str]:
    """To get all entity types in Document Proto object

    Args:
        doc (documentai.Document): DocumentAI document-proto object
        entities_normalize (Union[list, List[str]]): list of entities to be changed for date , if its empty all the date related entities will be updated

    Returns:
        List[str]: list of entities to be changed for date
    """

    if entities_normalize not in ([], [""]):
        return entities_normalize

    for entity in doc.entities:
        if entity.properties:
            for subentity in entity.properties:
                entities_normalize.append(subentity.type_)
            continue
        entities_normalize.append(entity.type_)

    return entities_normalize


def normalize_date(
    doc: documentai.Document, normalized_date_format: str, entities_normalize: List[str]
) -> documentai.Document:
    """To noramalize the date in required format

    Args:
        doc (documentai.Document): DocumentAI document-proto object
        normalized_date_format (str): The required date format, wont work for abbrevation like Y or YY... eg-year/month/day or month/day/year
        entities_normalize (List[str]): list of entities to be changed for date, considers only required entities

    Returns:
        documentai.Document: Updated document-proto object with normalized date format
    """

    for entity in doc.entities:
        normalized_date_text = normalized_date_format
        if entity.type_ in entities_normalize and entity.normalized_value.date_value:
            year = str(entity.normalized_value.date_value.year)
            month = str(entity.normalized_value.date_value.month)
            day = str(entity.normalized_value.date_value.day)
            normalized_date_text = (
                normalized_date_text.replace("year", year)
                .replace("month", month)
                .replace("day", day)
            )
            if normalized_date_text != "0/0/0":
                entity.normalized_value.text = normalized_date_text

        if not entity.properties:
            continue
        for subentity in entity.properties:
            if (
                subentity.type_ in entities_normalize
                and subentity.normalized_value.date_value
            ):
                normalized_date_text = normalized_date_format
                year = str(subentity.normalized_value.date_value.year)
                month = str(subentity.normalized_value.date_value.month)
                day = str(subentity.normalized_value.date_value.day)
                normalized_date_text = (
                    normalized_date_text.replace("year", year)
                    .replace("month", month)
                    .replace("day", day)
                )
                if normalized_date_text != "0/0/0":
                    subentity.normalized_value.text = normalized_date_text
    return doc


def currency_normalize_sgd(doc: documentai.Document) -> Tuple[documentai.Document, int]:
    """The Currency is updated only if supplier_city or supplier_address has singapore in it
       or supplier_city and supplier_address entities are not available in parsed jsons(input path)
       and currency entity is either USD or no currency entity available in the json

    Args:
        doc (documentai.Document): DocumentAI document-proto object

    Returns:
        Tuple[documentai.Document, int]: It returns updated document-proto object and currency_update flag value
                                         which helps to update currency entity updated to SGD or newly added
    """

    entity_types = []
    k = 0
    currency_update = 0
    for entity in doc.entities:
        if entity.properties:
            for subentity in entity.properties:
                entity_types.append(subentity.type_)
            continue
        entity_types.append(entity.type_)

    if "supplier_city" in entity_types:
        for entity in doc.entities:
            if entity.type_ == "supplier_city" and "singapore" in (
                entity.mention_text.lower(),
                entity.normalized_value.text.lower(),
            ):
                k = 1
    elif "supplier_address" in entity_types:
        for entity in doc.entities:
            if (
                entity.type_ == "supplier_address"
                and "singapore" in entity.mention_text.lower()
            ):
                k = 2
    else:
        k = 3

    if not k:
        return doc, currency_update

    if "currency" in entity_types:
        for entity in doc.entities:
            if entity.type_ == "currency" and entity.normalized_value.text.lower() in (
                "usd",
                "",
            ):
                entity.normalized_value.text = "SGD"
                currency_update = 1
        return doc, currency_update

    ent = documentai.Document.Entity()
    ent.normalized_value.text = "SGD"
    ent.type_ = "currency"
    ent.page_anchor.page_refs = [documentai.Document.PageAnchor.PageRef()]
    doc.entities.append(ent)
    currency_update = 2

    return doc, currency_update


print("Process Started")
list_files_updated = []
_, input_files_dict = utilities.file_names(input_files_path)
input_bucket = input_files_path.split("/")[2]
input_gcs_path = input_files_path.replace(f"gs://{input_bucket}/", "")
output_bucket = output_files_path.split("/")[2]
output_gcs_path = output_files_path.replace(f"gs://{output_bucket}/", "")
count = 0
for filename, filepath in input_files_dict.items():
    print(f"\tfilename: {filename}")
    doc = utilities.documentai_json_proto_downloader(input_bucket, filepath)
    file_name = {}
    output_uri = f"{output_gcs_path.rstrip('/')}/{filename}"
    if switch == "ON":
        count += 1
        entities_normalize_1 = all_entities_update(doc, entities_normalize)
        doc = normalize_date(doc, normalized_date_format, entities_normalize_1)
        json_updated, currency_update = currency_normalize_sgd(doc)
        if currency_update == 1:
            file_name[filename] = "currency entity updated to SGD"
            list_files_updated.append(file_name)
        elif currency_update == 2:
            file_name[filename] = "currency entity not found, so added now"
            list_files_updated.append(file_name)
        str_data = documentai.Document.to_json(
            json_updated, including_default_value_fields=False
        )
        utilities.store_document_as_json(str_data, output_bucket, output_uri)
    else:
        str_data = documentai.Document.to_json(
            doc, including_default_value_fields=False
        )
        utilities.store_document_as_json(str_data, output_bucket, output_uri)

print(f"Total no.of files updated are {count}")
print("Process Completed")
# print(list_files_updated)

## 4. Output Details

* Changes the normalized value from USD to SGD as shown below and saves the updated jsons in the output GCS folder.

![](./images/currency.png)

* Updates the date format as needed and given in the input details(i.e, based on format provided `normalized_date_format` variable)

<table>
<tr>
<td> Pre-processing</td>
<td> Post-processing</td>
</tr>
<tr>
<td><img src="./images/pre_process.png" width=400 height=800></td>
<td><img src="./images/post_process.png" width=400 height=800></td>
</tr>
</table>

**NOTE**: But this date change should not be visible in the UI.

## 5. Testing Postprocessing Results Script

In [None]:
import pandas as pd


def currency_test(input_files_path: str, output_files_path: str) -> pd.DataFrame:
    """It creates dataframe to visualize, preprocessed and post-processed currenct entity normalized text data

    Args:
        input_files_path (str): GCS folder path containing pre-processed results
        output_files_path (str): GCS folder path containing post-processed results

    Returns:
        pd.DataFrame: Dataframe which helps to observe difference between pre&post processes reults for currency entity
    """

    input_bucket = input_files_path.split("/")[2]
    output_bucket = output_files_path.split("/")[2]
    parsed_jsons_files, parsed_jsons_dict = utilities.file_names(input_files_path)
    post_processed_jsons_files, post_processed_jsons_dict = utilities.file_names(
        output_files_path
    )
    print("Preprocessed files count", len(parsed_jsons_files))
    print("Postprocessed files count", len(post_processed_jsons_files))

    dict_test_parsed = {}
    dict_test_post_processed = {}
    for fn, fp in parsed_jsons_dict.items():
        print(f"filename: {fn}")
        parsed_json = utilities.documentai_json_proto_downloader(input_bucket, fp)
        post_processed_json = utilities.documentai_json_proto_downloader(
            output_bucket, post_processed_jsons_dict[fn]
        )
        dict_file_parsed = {}
        dict_file_post_processed = {}
        for entity in parsed_json.entities:
            if entity.type_ == "supplier_city":
                if entity.mention_text:
                    dict_file_parsed["supplier_city"] = entity.mention_text
                elif entity.normalized_value:
                    if entity.normalized_value.text:
                        dict_file_parsed["supplier_city"] = entity.normalized_value.text

            if entity.type_ == "supplier_address" and entity.mention_text:
                dict_file_parsed["supplier_address"] = entity.mention_text

            if entity.type_ == "currency" and entity.normalized_value:
                dict_file_parsed["currency_before"] = entity.normalized_value.text

            dict_file_parsed["singapore_text"] = (
                "yes" if "singapore" in parsed_json.text.lower() else "no"
            )

        if "supplier_city" not in dict_file_parsed.keys():
            dict_file_parsed["supplier_city"] = "not predicted"
        if "supplier_address" not in dict_file_parsed.keys():
            dict_file_parsed["supplier_address"] = "not predicted"

        dict_test_parsed[fn] = dict_file_parsed
        for entity in post_processed_json.entities:
            if entity.type_ == "currency" and entity.normalized_value:
                dict_file_post_processed[
                    "currency_after"
                ] = entity.normalized_value.text
        if dict_file_post_processed == {}:
            dict_file_post_processed[
                "currency_after"
            ] = "no currency and singapore not in address and city"
        dict_test_post_processed[fn] = dict_file_post_processed

    df_parsed = pd.DataFrame.from_dict(dict_test_parsed, orient="index")
    df_parsed = df_parsed.reset_index()
    df_post_processed = pd.DataFrame.from_dict(dict_test_post_processed, orient="index")
    df_post_processed = df_post_processed.reset_index()
    df_test = pd.merge(df_parsed, df_post_processed, on="index")
    df_test = df_test[
        [
            "index",
            "supplier_city",
            "supplier_address",
            "singapore_text",
            "currency_before",
            "currency_after",
        ]
    ]
    df_test.rename(
        columns={
            "index": "Receipt Name",
            "singapore_text": "Is “Singapore” anywhere in receipt",
        },
        inplace=True,
    )
    print("Writing dataframe results to 'test_currency.csv'")
    df_test.to_csv("test_currency.csv")

    return df_test


print("Currenct Test Process Started")
currency_test(input_files_path, output_files_path)

Output of above testing script is a csv of comparison file wise on currency entity as below

![](./images/test_result.png)