# Challenge 03: Data Modelling: From Retrieval to Upload

In this step, we will structure the data retrieved from Azure Document Intelligence (ADI) into the right format to be read by our systems in subsequent steps. 

The data will be outputted from the ADI as a JSON file, and it is our role to process and organize it. Some of the data will be structured into tables, while other data will be formatted as text. This step ensures that the extracted information is organized in a meaningful way for further analysis and usage.

As stated before, we need to make sure that our Function will know how to process:
- **Loan Forms:** Extract relevant details such as borrower information, loan amounts, and terms.
- **Loan Contract:** Identify and parse key contract elements like clauses, signatures, and dates.
- **Pay Stubs:** Retrieve data such as employee details, earnings, deductions, and net pay.

Not all customers will have provided all types of content, and during this Challenge we will be only be processing one file. We will combine in the next challenge the capabilities of a trigger, which will, at a time, also process one single document.

Due to the nature of this challenge, we will separate this challenge in the 3 different types of documents.

## Pay Stub 

As part of some loan applications, the pay stub is a required document. The pay stub is a document that outlines the details of an employee’s income. It contains the employee’s wages earned, applicable deductions and total gross pay, and net pay for the pay period. A pay stub will provide Contoso bank with crucial information about not only a person’s income and employment stability, which helps assess their ability to repay the loan. It also verifies the applicant’s financial credibility and ensures that their reported income matches their actual earnings.

When processing a Pay Stub, we will have similar challenges as we previously did on the Loan Forms. These particular documents combine text and contrary to the previous use case, more than 1 table, Once again, the ADI capabilities allows you to extract these 2 types of entities as also separate capabilities.

As we've previously create a the function that will load the documents inside a designated folder, all we have to do now is to retrieve all the information inside the paystub folder, we will retrieve one single Loan Form for us to analyse.


In [1]:
import os
import json
import pandas as pd
from azure.storage.blob import BlobServiceClient
from dotenv import load_dotenv
import re
# Load environment variables from .env file
load_dotenv()

def read_json_files_from_blob(folder_path):
    # Retrieve the connection string from the environment variables
    connection_string = os.getenv('STORAGE_CONNECTION_STRING')

    # Ensure the connection string is not None
    if connection_string is None:
        raise ValueError("The connection string environment variable is not set.")

    # Create a BlobServiceClient
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)

    # Get the container client
    container_client = blob_service_client.get_container_client("data")

    # List all blobs in the specified folder
    blob_list = container_client.list_blobs(name_starts_with=folder_path)

    # Filter out JSON files and read their contents
    for blob in blob_list:
        if blob.name.endswith('.json'):
            blob_client = container_client.get_blob_client(blob.name)
            blob_data = blob_client.download_blob().readall()
            data = json.loads(blob_data)
            return data 

In [2]:
paystub = read_json_files_from_blob("paystubs") ## RETIRAR PARA ELES PERCEBEREM OQ TAO A FAZE

In [3]:
def clean_form_recognizer_result(data):
    text_content = []
    
    for page in data.get("pages", []):
        for line in page.get("lines", []):
            # Check if the line contains the word "table"
            if "table" in line.get("text", "").lower():
                continue  # Keep everything if "table" is in the text
            # Keep only the "text" key
            line_keys = list(line.keys())
            for key in line_keys:
                if key != "text":
                    del line[key]
            # Collect the text content
            text_content.append(line.get("text", ""))
    
    # Create structured tables
    structured_tables = create_structured_tables(data.get("tables", []))
    
    # Concatenate all text content into a single string
    plain_text_content = " ".join(text_content)
    
    data["structured_tables"] = structured_tables
    data["plain_text_content"] = plain_text_content

    return data

def parse_pay_stub(pay_stub_text):
        # Dictionary to store parsed data
        parsed_data = {}

        # Regular expressions to match the required fields
        pay_stub_patterns = {
            'id': r'Customer ID: (\d+)',
            'Company Name': r'^(.+?) Pay Stub for:',
            'Employee Name': r'Pay Stub for: (.+?) Pay Period:',
            'Pay Period': r'Pay Period: (.+?) Pay Date:',
            'Pay Date': r'Pay Date: (.+?) Employee ID:',
            'Employee ': r'Employee ID: (.+?) Employee Information:',
            'Employee Address': r'Address: (.+?), Social Security',
            'Social_Security': r'Social Security Number: (XXX-XX-\d{4})'
        }

        # Apply regex patterns and store matches in the dictionary
        for key, pattern in pay_stub_patterns.items():
            match = re.search(pattern, pay_stub_text)
            if match:
                parsed_data[key] = match.group(1)
        return parsed_data

def create_structured_tables(tables):
    structured_tables = []
    for table in tables:
        row_count = table.get("row_count", 0)
        column_count = table.get("column_count", 0)
        cells = table.get("cells", [])
        
        # Initialize an empty table
        structured_table = [["" for _ in range(column_count)] for _ in range(row_count)]
        
        # Populate the table with cell content
        for cell in cells:
            row_index = cell.get("row_index", 0)
            column_index = cell.get("column_index", 0)
            content = cell.get("content", "")
            structured_table[row_index][column_index] = content
        
        structured_tables.append(structured_table)
    
    return structured_tables

def tables_to_dataframes(structured_tables):
    dataframes = []
    for table in structured_tables:
        df = pd.DataFrame(table)
        dataframes.append(df)
    return dataframes



cleaned_data = clean_form_recognizer_result(paystub)
dataframes = tables_to_dataframes(cleaned_data["structured_tables"])

structured_data = {
    "pay stub details": parse_pay_stub(cleaned_data["plain_text_content"]),
}


df_list = []

def process_dataframe(df):
    result = {}
    columns = df.columns[1:]  # Ignore the first column
    for i in range(1, len(df)):  # Ignore the first row
        row_name = df.iloc[i, 0]
        result[row_name] = {}
        for col in columns:
            result[row_name][col] = f"{row_name} {col}: {df.at[i, col]}"
    return result

def rename_json_attributes(json_obj, attribute_titles):
    """
    Rename the keys of a JSON object based on the provided attribute titles.

    Parameters:
    json_obj (dict): The JSON object to rename.
    attribute_titles (dict): A dictionary where keys are the current attribute names and values are the new attribute names.

    Returns:
    dict: The updated JSON object with renamed keys.
    """
    updated_json = {}
    for old_key, new_key in attribute_titles.items():
        if old_key in json_obj:
            updated_json[new_key] = json_obj[old_key]
        else:
            updated_json[old_key] = json_obj.get(old_key, None)
    return updated_json

attribute_titles_earnings = {
    "1": "Hours Worked",
    "2": "Rate",
    "3": "Current Earnings",
    "4": "Year-to-Date Earnings"
}

attribute_titles_deductions = {
    "1": "Current Amount",
    "2": "Year-to-Date Amount"
}

# Process the earnings and deductions DataFrames
earnings_dict = process_dataframe(dataframes[0])
deductions_dict = process_dataframe(dataframes[1])
# Append the processed DataFrames to the JSON structure
structured_data["earnings"] = earnings_dict
structured_data["deductions"] = deductions_dict

def clean_pay_stub_section(data):
    # Check for 'deductions' and 'earnings' in the data
    for section in ['deductions', 'earnings']:
        if section in data:
            for key, values in data[section].items():
                # For each entry, clean up the values by removing everything before the colon
                for subkey in values:
                    # Split the string by colon and take the second part, stripping whitespace
                    values[subkey] = values[subkey].split(":")[1].strip()
    return data

structured_data = clean_pay_stub_section(structured_data)


def update_attribute_keys(data, section, key_mapping):
    # Ensure the section exists in the data (either "earnings" or "deductions")
    if section in data:
        # Iterate over each type within the earnings or deductions section
        for entry_type, attributes in data[section].items():
            # Create a new dictionary to store the updated attributes
            updated_attributes = {}
            
            # Loop through each attribute in that entry (e.g. 1, 2, 3)
            for old_key, value in attributes.items():
                # Map the old key (which is an integer) to the new descriptive key using key_mapping
                if str(old_key) in key_mapping:  # Convert old_key to string to match the mapping
                    new_key = key_mapping[str(old_key)]
                else:
                    new_key = old_key  # If no mapping is found, retain the old key
                
                # Update the dictionary with the new key
                updated_attributes[new_key] = value

            # Replace the old attributes with the updated attributes in the data
            data[section][entry_type] = updated_attributes

    return data


paystub_final = update_attribute_keys(structured_data, "earnings", attribute_titles_earnings)
paystub_final = structured_data = update_attribute_keys(structured_data, "deductions", attribute_titles_deductions)

customer_id = paystub_final["pay stub details"]["id"]

print(customer_id)
# Save the updated JSON structure back to the file
json_data = json.dumps(paystub_final, indent=4)


print("JSON file updated successfully.")

100002
JSON file updated successfully.


### Code

In [4]:
from azure.cosmos import CosmosClient, exceptions, PartitionKey
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Cosmos DB connection details from environment variables
endpoint = os.getenv("COSMOS_ENDPOINT")
key = os.getenv("COSMOS_KEY")

def upload_text_to_cosmos_db(text_content, container_name):
    # Check if the text is empty
    if not text_content:
        print("The text content is empty. No data to upload.")
        return
    
    # Initialize the Cosmos client
    client = CosmosClient(endpoint, key)
    
    try:
        # Create or get the database
        database = client.create_database_if_not_exists(id="ContosoDB")
        
        # Create or get the container
        container = database.create_container_if_not_exists(
            id=container_name,
            partition_key=PartitionKey(path=f"/id"),
            offer_throughput=400
        )
    except exceptions.CosmosHttpResponseError as e:
        print(f"An error occurred while creating the database or container: {e.message}")
        return
    
    # Create a document with the text content and partition key
    document = {
        'id': str(customer_id),  # Generate a unique ID for the document
        'content': text_content,  # Store the plain text as 'content'
    }
    
    # Upload the document to the container
    try:
        container.create_item(body=document)
        print(f"Text content uploaded successfully with ID '{document['id']}' in Cosmos DB.")
    except exceptions.CosmosHttpResponseError as e:
        print(f"An error occurred while uploading the document: {e.message}")

### Upload Pay Stubs

In [5]:
upload_text_to_cosmos_db(paystub_final, "PayStubs")

Text content uploaded successfully with ID '100002' in Cosmos DB.
