# Mortgage & Lending Use case

Amazon Bedrock Data Automation (BDA) is a fully managed capability of Amazon Bedrock that streamlines the generation of valuable insights from unstructured, multimodal content such as documents, images, audio, and videos. With Amazon Bedrock Data Automation, you can build automated intelligent document processing (IDP), media analysis, and Retrieval-Augmented Generation (RAG) workflows quickly and cost-effectively.

This workbook focuses on using BDA to process insights from unstructured documents. The use case we will focus on is for processing a loan applcation. We will process a packet of documents relavent to loans: ID Cards, Bank Statements, W2 Tax forms, Pay Stubs and checks.  


This noteboox is based on the solution 'Guidance for Multimodal Data Processing Using Amazon Bedrock Data Automation', published [here](https://aws.amazon.com/solutions/guidance/multimodal-data-processing-using-amazon-bedrock-data-automation/).

In this workbook, we will explore the various aspects of this workflow such as the creating blueprints, processing sample documents, page classification.  We will process these documents:

1. ID Card
2. Bank Statements
3. W2 Tax forms
4. Pay Stubs
5. Check
6. Homeowner Insurance Application

We will then process a single PDF document with a 'loan application package', i.e. all 6 documents in one PDF file. 

This workbook follows these steps:

1. Step 1: Setup packages and create boto3 clients
2. Step 2: Create blueprint and process a Homeowner Insurance Form
3. Step 3: Create an Bedrock Data automation Project for processing Lending Packages
4. Step 4: Process a Multi-Page Document lending Package
7. Step 5: Display the results
8. Step 6: Cleanup Resources

## Prerequisite

Before starting the workshop you will need to create an Amazon SageMaker Studio notebook instance. https://docs.aws.amazon.com/sagemaker/latest/dg/howitworks-create-ws.html For IAM role, choose either an existing IAM role in your account or create a new role. The role must the necessary permissions to invoke the BDA, SageMaker and S3 APIs. 

These IAM policies can be assigned to the role: AmazonBedrockFullAccess, AmazonS3FullAccess, AmazonSageMakerFullAccess, IAMReadOnlyAccess

Note: The AdministratorAccess IAM policy can be used, if allowed by security policies at your organization. 

## Notes

It is important to run the cells below in order. If you need to re-start the workbook, and have not sucessfully run step 8 to cleanup resources, you will need to login to the AWS Console and delete the project and blueprints created in this workbook. 

If you get out of order, and unexpected results, you can 'Restart Kernel' from the SageMaker studio Kernel menu. 

## Step 1: Setup packages and create boto3 clients

In this step, we will import some necessary libraries that will be used throughout this notebook. 
To use Amazon Bedrock Data Automation (BDA) with boto3, you'll need to ensure you have the latest version of the AWS SDK for Python (boto3) installed. Version Boto3 1.35.96 of later is required. 

We also have a nifty utility in utils/helpers.py that will display our document images and the results resturned from the Bedrock service.  

In [None]:
%pip install "boto3>=1.37.6" pypdfium2==4.30.1 --upgrade -q

In [None]:
import boto3
import json
from time import sleep
from IPython.display import JSON, IFrame
import sagemaker
import pypdfium2 as pdfium
import ipywidgets as widgets
from utils.helpers import get_s3_to_dict, display_image_jsons


print(boto3.__version__)

region_name = boto3.session.Session().region_name

sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()['Account']

s3 = boto3.client('s3')
client = boto3.client('bedrock-data-automation')
run_client = boto3.client('bedrock-data-automation-runtime')
sts_client=boto3.client('sts')

We will give a unique name to our project and blueprint

In [None]:
project_name = 'my-bda-lending-workbook-v1'
blueprint_name = 'my-insurance-blueprint-v1'
bucket_name = sagemaker.Session().default_bucket()
print(f"Bucket_name: {bucket_name}")


## Step 2: Create blueprint and process a Homeowner Insurance Form

Amazon Bedrock Data Automation (BDA) includes several sample blueprints to help you get started with custom output for documents and images. 

For this workshop, there is no existing blueprint for a Homeowner Insurance Form, so we're going to create one. 

We will next create out own Blueprint for the Homeowners Insurance document. This is a common document seen in a residential loan application. We need just 4 fields from this documment to proceses the loan application. 

1. The insured's name
2. The insurance company name
3. The address of the insured property
4. The primary email address

In [None]:
# Display the Homeowner Insurance Application Form

file_name = 'documents/homeowner_insurance_application_sample.pdf'
object_name = f'data_automation/input/{file_name}'
output_name = 'data_automation/output'
s3.upload_file(file_name, bucket_name, object_name)

IFrame("documents/homeowner_insurance_application_sample.pdf", width=1000, height=500)

In [None]:
# delete project if it already exists
projects_existing = [project for project in client.list_data_automation_projects()["projects"] if project["projectName"] == project_name]
if len(projects_existing) >0:
    print(f"Deleting existing project: {projects_existing[0]}")
    client.delete_data_automation_project(projectArn=projects_existing[0]["projectArn"])
    
# delete blueprint if it already exists
blueprints_existing = [blueprint for blueprint in client.list_blueprints()["blueprints"] if blueprint["blueprintName"] == blueprint_name]
if len(blueprints_existing) >0:
    print(f"Deleting existing blueprint: {blueprints_existing[0]}")
    client.delete_blueprint(blueprintArn=blueprints_existing[0]["blueprintArn"])

This next call with create the blueprint. Note the coniguration for the four fields to be extracted. 

In [None]:
response = client.create_blueprint(
    blueprintName=blueprint_name,
    type='DOCUMENT',
    blueprintStage='LIVE',
    schema=json.dumps({
    "$schema": "http://json-schema.org/draft-07/schema#",
    "description": "This blueprint will process a homeowners insurance applicatation form",
    "documentClass": "default",
    "type": "object",
    "properties": {
        "Insured Name":{
           "type":"string",
           "inferenceType":"extractive",
           "description":"Insured's Name",
        },
           "Insurance Company":{
           "type":"string",
           "inferenceType":"extractive",
           "description":"insurance company name",
        },  
           "Insured Address":{
           "type":"string",
           "inferenceType":"extractive",
           "description":"the address of the insured property",
        },
           "Email Address":{
           "type":"string",
           "inferenceType":"extractive",
           "description":"the primary email address",
        }
        }
    })
)
blueprint_arn = response['blueprint']['blueprintArn']
JSON(response, expanded=False)

Next we will use that custom blueprint to process a Homeowner Insurance Form

In [None]:
# Upload a new Homeowner Insurance Application Form

file_name = 'documents/homeowner_insurance_application_sample.pdf'
object_name = f'data_automation/input/{file_name}'
output_name = 'data_automation/output'
s3.upload_file(file_name, bucket_name, object_name)

IFrame("documents/homeowner_insurance_application_sample.pdf", width=1000, height=500)

In [None]:

# Construct the project and Profile ARNs
dataAutomationProfileArn = 'arn:aws:bedrock:'+ region_name +':' + account_id + ':data-automation-profile/us.data-automation-v1'
dataAutomationpProjectArn = 'arn:aws:bedrock:' + region_name + ':aws:data-automation-project/public-default'

response = run_client.invoke_data_automation_async(
    inputConfiguration={'s3Uri':  f"s3://{bucket_name}/{object_name}"},
    outputConfiguration={'s3Uri': f"s3://{bucket_name}/{output_name}"},
    blueprints=[{'blueprintArn': blueprint_arn, 'stage': 'LIVE'}],
    dataAutomationProfileArn = dataAutomationProfileArn)
response

invoke_arn = response['invocationArn']

In [None]:
in_progress = True
while in_progress:
    progress = run_client.get_data_automation_status(invocationArn=invoke_arn)
    if progress['status'] == 'InProgress':
        print(progress['status'])
        sleep(5)
    else:
        break
        
print(progress['status'])

Display the Custom blueprint Results

Note the four fields we requested in the blueprint have been returned

In [None]:

doc = pdfium.PdfDocument(file_name)
pages_pil = [page.render(scale=1.53).to_pil() for page in doc]

job_json_obj = get_s3_to_dict(s3,progress['outputConfiguration']['s3Uri'])
results_meta = job_json_obj["output_metadata"][0]["segment_metadata"]

results_all = []

for result in results_meta:
#    standard_output_obj = get_s3_to_dict(s3,result["standard_output_path"])
    custom_output_obj = get_s3_to_dict(s3,result["custom_output_path"])
    pages = custom_output_obj["split_document"]["page_indices"]
    w = display_image_jsons(pages_pil[pages[0]], [custom_output_obj['matched_blueprint'],custom_output_obj['inference_result']],["Matched Blueprint", "Inference Result"])
    results_all.append(w)

widgets.VBox(results_all)

## Step 3: Create an Bedrock Data automation Project for processing Lending Packages

Create automation project for the lending flow

To process a lending package we need to be able to support processing of multiple document types.
We add our custom blueprints and multiple existing standard blueprints.

1. Homeowner Insurance Application (custom)
2. Drivers License ID Card
3. Bank Statements
4. W2 Tax form
5. Pay Stubs
6. A Check


Lets define the output format of the standard output using standard output configuration for BDA. 
1. Response Granularity
2. Output Settings
3. Text Format
4. Bounding Boxes and Generative Fields

The output settings are described in the documents [here](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-output-documents.html).


In [None]:
output_config = {
  "document": {
    "extraction": {
      "granularity": {"types": ["PAGE", "ELEMENT"]},
      "boundingBox": {"state": "ENABLED"}
    },
    "generativeField": {"state": "ENABLED"},
    "outputFormat": {
      "textFormat": {"types": ["PLAIN_TEXT", "MARKDOWN", "HTML", "CSV"]},
      "additionalFileFormat": {"state": "DISABLED"}
    }
  },
  "image": {
    "extraction": {
      "category": {"state": "ENABLED", "types": ["TEXT_DETECTION"]},
      "boundingBox": {"state": "ENABLED"}
    },
    "generativeField": {"state": "ENABLED", "types": ["IMAGE_SUMMARY"]}
  },
  "video": {
    "extraction": {
      "category": {"state": "ENABLED", "types": ["TEXT_DETECTION"]},
      "boundingBox": {"state": "ENABLED"}
    },
    "generativeField": {"state": "ENABLED", "types": ["VIDEO_SUMMARY", "SCENE_SUMMARY"]}
  },
  "audio": {
    "extraction": {
      "category": {"state": "ENABLED", "types": ["TRANSCRIPT"]}
    },
    "generativeField": {"state": "ENABLED", "types": ["IAB"]}
  }
}

JSON(output_config)

In [None]:
response = client.create_data_automation_project(
    projectName=project_name,
    projectDescription="Workbook to process Lending Applictions",
    projectStage='LIVE',
    standardOutputConfiguration=output_config,
    customOutputConfiguration={
    'blueprints': [
        {
        'blueprintArn': blueprint_arn,
        'blueprintStage': 'LIVE'
        },
        {
        'blueprintArn': f'arn:aws:bedrock:{region_name}:aws:blueprint/bedrock-data-automation-public-w2-form',
        'blueprintStage': 'LIVE'
        },
        {
        'blueprintArn': f'arn:aws:bedrock:{region_name}:aws:blueprint/bedrock-data-automation-public-us-driver-license',
        'blueprintStage': 'LIVE'
        },
        {
        'blueprintArn': f'arn:aws:bedrock:{region_name}:aws:blueprint/bedrock-data-automation-public-us-bank-check',
        'blueprintStage': 'LIVE'
        },
        {
        'blueprintArn': f'arn:aws:bedrock:{region_name}:aws:blueprint/bedrock-data-automation-public-payslip',
        'blueprintStage': 'LIVE'
        },
        {
        'blueprintArn': f'arn:aws:bedrock:{region_name}:aws:blueprint/bedrock-data-automation-public-bank-statement',
        'blueprintStage': 'LIVE'
        },
        ]
        },
         overrideConfiguration={'document': {'splitter': {'state': 'ENABLED'}}}
)

project_arn = response['projectArn']
JSON(response, expanded=False)

## Step 4: Process a Multi-Page Document Lending Package

A lending package is a single PDF file that contains multiple documents needed to apply for a loan. 

In [None]:
##
## Upload a package of documents to an S3
##
file_name = 'documents/lending_package.pdf'
object_name = f'data_automation/input/{file_name}'
output_name = 'data_automation/output'
s3.upload_file(file_name, bucket_name, object_name)

IFrame("documents/lending_package.pdf", width=1000, height=500)

In [None]:
# Process the document package
response = run_client.invoke_data_automation_async(
    dataAutomationConfiguration = { "dataAutomationProjectArn" : project_arn,"stage" : 'LIVE'},
    inputConfiguration={'s3Uri':  f"s3://{bucket_name}/{object_name}"},
    outputConfiguration={'s3Uri': f"s3://{bucket_name}/{output_name}"},
    dataAutomationProfileArn = dataAutomationProfileArn
)

response


invoke_arn = response['invocationArn']
invoke_arn


In [None]:
in_progress = True

while in_progress:
    progress = run_client.get_data_automation_status(invocationArn=invoke_arn)
    if progress['status'] == 'InProgress':
        print(progress['status'])
        sleep(10)
    else:
        break
        
print(progress['status'])

## Step 5: Display the results

BDA will automatically split the documents based and return the detected blueprints as well as the requested structured output for each blueprint.
Lets visualize these results by showing the first page of each detected blueprint and the inference results.

In [None]:

doc = pdfium.PdfDocument(file_name)
pages_pil = [page.render(scale=1.53).to_pil() for page in doc]

# get the job_metadata
job_json_obj = get_s3_to_dict(s3,progress['outputConfiguration']['s3Uri'])
results_meta = job_json_obj["output_metadata"][0]["segment_metadata"]

# put the results together and show with first page side by side
results_all = []
for result in results_meta:
    standard_output_obj = get_s3_to_dict(s3,result["standard_output_path"])
    custom_output_obj = get_s3_to_dict(s3,result["custom_output_path"])
    pages = custom_output_obj["split_document"]["page_indices"]
    w = display_image_jsons(pages_pil[pages[0]], [custom_output_obj['matched_blueprint'],custom_output_obj['inference_result']],["Matched Blueprint", "Inference Result"])
    results_all.append(w)    

widgets.VBox(results_all)


## Conclusion

We learned how to use BDA to extract structured outputs from complex documents by
* creating a custom blueprint with JSON schema and matched it against a specific document.
* creating a project with multiple blueprints and automatically split, classify and match the requested information from blueprints


## Cleanup Resources

This step is needed before we run through the workbook a second time. 

In [None]:
# Delete the project
response = client.delete_data_automation_project(projectArn=project_arn)

# Delete the blueprint
response = client.delete_blueprint(blueprintArn=blueprint_arn)