# KB Integration

## Introduction

Amazon Bedrock Data Automation (BDA) lets you configure output based on your processing needs for a specific data type: images, documents, audio or video. BDA can generate standard output or custom output.

You can use standard outputs for all four modalities: images, documents, audio, and videos. BDA always provides a standard output response even if it's alongside a custom output response.

Standard outputs are modality-specific default structured insights, such as video summaries that capture key moments, visual and audible toxic content, explanations of document charts, graph figure data, and more. 

In this notebook we will explore the standard output for documents.

### Prerequisites

In [None]:
%pip install "boto3>=1.37.6" itables==2.2.4 PyPDF2==3.0.1 --upgrade -q

In [None]:
%load_ext autoreload
%autoreload 2

### Setup

Before we get to the part where we invoke BDA with our sample artifacts, let's setup some parameters and configuration that will be used throughout this notebook

In [5]:
#Clients
suffix = random.randrange(200, 900)

session = sagemaker.Session()
bucket_name = session.default_bucket()

sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]

bucket_name_kb = f'bedrock-kb-{suffix}-1' # replace it with your first bucket name.
region_name = "us-west-2" 
region = region_name

s3_client = boto3.client('s3', region_name=region_name)

bda_client = boto3.client('bedrock-data-automation', region_name=region_name)
bda_runtime_client = boto3.client('bedrock-data-automation-runtime', region_name=region_name)

bedrock_agent_client = boto3.client('bedrock-agent')
bedrock_agent_runtime_client = boto3.client('bedrock-agent-runtime') 

logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [None]:
import boto3
import json
from IPython.display import JSON, display, IFrame, Markdown
import sagemaker
import pandas as pd
from itables import show
import time


session = sagemaker.Session()
default_bucket = session.default_bucket()
current_region = boto3.session.Session().region_name

sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()['Account']

# Initialize Bedrock Data Automation client
bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
s3_client = boto3.client('s3')

bda_s3_input_location = f's3://{default_bucket}/bda/input'
bda_s3_output_location = f's3://{default_bucket}/bda/output'

# Prepare sample document
For this lab, we use a `Monthly Treasury Statement for the United States Government` for Fiscal Year 2025 through November 30, 2024. The document is prepared by the Bureau of the Fiscal Service, Department of the Treasury and provides detailed information on the government's financial activities. We will extract a subset of pages from the `PDF` document and use BDA to extract and analyse the document content.

### Download and store sample document
we use the document url to download the document and store it a S3 location. 

Note - We will configure BDA to use the sample input from this S3 location, so we need to ensure that BDA has `s3:GetObject` access to this S3 location. If you are running the notebook in your own AWS Account, ensure that the SageMaker Execution role configured for this JupyterLab app has the right IAM permissions.

In [None]:
from utils.helper_functions import ( wait_for_job_to_complete,read_s3_object, download_document, get_bucket_and_key, create_image_html_column)

from pathlib import Path

# Download the document
document_url = "https://fiscaldata.treasury.gov/static-data/published-reports/mts/MonthlyTreasuryStatement_202411.pdf"

local_file_name = "data/documents/MonthlyTreasuryStatement_202411.pdf"
file_path_local = download_document(document_url, output_file_path=local_file_name)

# Upload the document to S3
file_name = Path(file_path_local).name
document_s3_uri = f'{bda_s3_input_location}/{file_name}'

target_s3_bucket, target_s3_key = get_bucket_and_key(document_s3_uri)
s3_client.upload_file(local_file_name, target_s3_bucket, target_s3_key)

print(f"Downloaded file to: {file_path_local}")
print(f"Uploaded file to S3: {target_s3_key}")
print(f"document_s3_uri: {document_s3_uri}")

- **[Response Granularity](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-output-documents.html#document-granularity)**
This setting indicates to BDA the kind of response you want to receive from document text extraction. Each level of granularity gives you more and more separated responses. The available granularity levels are:
  - **Page** - provides each page of the document in the text output (enabled by default)
  - **Element** - Provides the text of the document in the output format of your choice, seperated into different elements such as figures, tables, or paragraphs (enabled by default)
  - **Word** - Provides you with each word and its location on the page


- **Output settings**
Output settings determine the structure of the results produced by BDA. The options for output settings are:
    - **JSON** - The result would be a JSON output file with the information from your configuration settings. This is the **default** for document analysis.    
    - **JSON+files**  The result would include a JSON output along with files that correspond with different outputs. For example, this setting gives you a text file for the overall text extraction, a markdown file for the text with structural markdown, and CSV files for each table that's found in the text.


- **Text Format**
Text format determines the different kinds of texts that will be provided via various extraction operations. You can select any number of the following options for your text format.

   - **Plaintext** – This setting provides a text-only output with no formatting or other markdown elements noted.
   - **Text with markdown** – The **default** output setting for standard output. Provides text with markdown elements integrated.    
   - **Text with HTML** – Provides text with HTML elements integrated in the response.    
   - **CSV** – Provides a CSV structured output for tables within the document. This will only give a response for tables, and not other elements of the document

- **Bounding Boxes**
    - With the Bounding Boxes option enabled, BDA would output `Bounding Boxes` for elements in the document in form of coordinates of four corners of the box. This helps in creating a visual outline of the element in the document.


- **Generative Fields**
   - When `Generative Fields` are enabled, BDA generates a 10-word summary and a 250 word description of the document in the output. Additionally with Response Granularity at element level enabled, BDA also generates a descriptive caption of each figure detected in the document. Figures include things like charts, graphs, and images.


Both Bounding Boxes and Generative Fields are **disabled by default**.


Now that we have looked at the default options, let's create a config which activates all the different types, so that we can see how the output looks like. We leave image, audio, and video types for illustrational purposes. 

In [None]:
standard_output_config =  {
  "document": {
    "extraction": {
      "granularity": {"types": ["DOCUMENT","PAGE", "ELEMENT","LINE","WORD"]},
      "boundingBox": {"state": "ENABLED"}
    },
    "generativeField": {"state": "ENABLED"},
    "outputFormat": {
      "textFormat": {"types": ["PLAIN_TEXT", "MARKDOWN", "HTML", "CSV"]},
      "additionalFileFormat": {"state": "ENABLED"}
    }
  },
  "image": {
    "extraction": {
      "category": {
        "state": "ENABLED",
        "types": ["CONTENT_MODERATION","TEXT_DETECTION"]
      },
      "boundingBox": {"state": "ENABLED"}
    },
    "generativeField": {
      "state": "ENABLED",
      "types": ["IMAGE_SUMMARY","IAB"]
    }
  },
  "video": {
    "extraction": {
      "category": {
        "state": "ENABLED",
        "types": ["CONTENT_MODERATION","TEXT_DETECTION", "TRANSCRIPT"]
      },
      "boundingBox": {"state": "ENABLED"}
    },
    "generativeField": {
      "state": "ENABLED",
      "types": ["VIDEO_SUMMARY", "CHAPTER_SUMMARY","IAB"]
    }
  },
  "audio": {
    "extraction": {
      "category": {
        "state": "ENABLED",
        "types": ['AUDIO_CONTENT_MODERATION', 'TOPIC_CONTENT_MODERATION', 'TRANSCRIPT']
      }
    },
    "generativeField": {
      "state": "ENABLED",
      "types": ['AUDIO_SUMMARY', 'TOPIC_SUMMARY', 'IAB']
    }
  }
}

# JSON(standard_output_config["document"], expanded=True)
JSON(standard_output_config, expanded=False)

# Create project with standard output config

To utilize standard output configurations, we create a project and utilize the previously defined standard output config. To get an overview of all the available parameters for project creation, see the [create project documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-data-automation/client/create_data_automation_project.html).

In [None]:
project_name= "my_bda_project"

# delete project if it already exists
projects_existing = [project for project in bda_client.list_data_automation_projects()["projects"] if project["projectName"] == project_name]
if len(projects_existing) > 0:
    print(f"Deleting existing project: {projects_existing[0]}")
    bda_client.delete_data_automation_project(projectArn=projects_existing[0]["projectArn"])
    time.sleep(1) # nosemgrep

In [None]:
response = bda_client.create_data_automation_project(
    projectName=project_name,
    projectDescription="project to get our extended standard output",
    projectStage='LIVE',
    standardOutputConfiguration=standard_output_config    
)
project_arn = response["projectArn"]
time.sleep(1) # nosemgrep
JSON(response)

# Invoke data automation async

In [None]:
print(f"Invoking bda - input: {document_s3_uri}")
print(f"Invoking bda - output: {bda_s3_output_location}")

response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': document_s3_uri
    },
    outputConfiguration={
        's3Uri': bda_s3_output_location
    },
    dataAutomationConfiguration={
        'dataAutomationProjectArn': project_arn,
        'stage': 'LIVE'
    },
    dataAutomationProfileArn = f'arn:aws:bedrock:{current_region}:{account_id}:data-automation-profile/us.data-automation-v1'
)

invocationArn = response['invocationArn']

### Get data automation job status

We can check the status and monitor the progress of the Invocation job using the `GetDataAutomationStatus`. This API takes the invocation arn we retrieved from the response to the `InvokeDataAutomationAsync` operation above.

The invocation job status moves from `Created` to `InProgress` and finally to `Success` when the job completes successfully, along with the S3 location of the results. If the job encounters and error the final status is either `ServiceError` or `ClientError` with error details

In [None]:
status_response = wait_for_job_to_complete(invocationArn=invocationArn)
if status_response['status'] == 'Success':
    job_metadata_s3_location = status_response['outputConfiguration']['s3Uri']
else:
    raise Exception(f'Invocation Job Error, error_type={status_response["error_type"]},error_message={status_response["error_message"]}')

### Retrieve job metadata

Let's retrieve and explore the job metadata response.
It will contain a field `standard_output_path` where the results have been saved.

In [None]:
job_metadata = json.loads(read_s3_object(job_metadata_s3_location))
JSON(job_metadata,root='job_metadata',expanded=True)

# Explore standard output results

We can now explore the standard output received from processing documents using Data Automation. 

Based on the standard output configuration, we used above, we can have the following fields:
* **metadata**
* **document**
* **pages**
* **elements**
* **text_lines**
* **text_words**

We will review each of these fields in the sections below.

First lets download and parse the standard_output json file, which we received from the job metadata.

In [None]:
asset_id=0
standard_output_path = next(item["segment_metadata"][0]["standard_output_path"] 
                                for item in job_metadata["output_metadata"] 
                                if item['asset_id'] == asset_id)
standard_output = json.loads(read_s3_object(standard_output_path))
JSON(standard_output)

### metadata
The metadata section in the response provides an overview of the metadata associated with the document. This include the S3 bucket and key for the input document. The metadata also contains the modality that was selected for your response, the number of pages processed as well as the start and end page index.

In [None]:
JSON(standard_output['metadata'],root='metadata',expanded=True)

### document
The document section of the standard output provides document level granularity information. Document level granularity would include an analysis of information from the document providing key pieces of info.

By default the document level granularity includes statistics that contain information on the actual content of the document, such as how many semantic elements there are, how many figures, words, lines, etc. We will look at further information that would be presented in the document level granularity when we modify the standard output using projects.

In [None]:
df_document = pd.json_normalize(standard_output["document"])

df = df_document.T
pd.set_option('display.max_colwidth', 200)
df

### pages
With Page level granularity (enabled by default) text in a page are consolidated and are listed in the pages section with one item for each page. The page entity in the Standard output include the page index. The individual page entities also include the statistics that contain information on the actual content of the document, such as how many semantic elements there are, how many figures, words, lines, etc. The asset metadata represents the page bounds using coordinates of the four corners.

Below, we look at a snippet of the output pertaining to a specific page.

In [None]:
df_pages = pd.json_normalize(standard_output["pages"])
pd.reset_option('display.max_colwidth')  
df_pages.loc[3].T
# show(df_pages.loc[3].T, classes="compact")

In [None]:
JSON(standard_output['pages'][8],root='pages[7]',expanded=False)

In [None]:
# Retrieve the markdown formatted text
pages_md = [page["representation"]["markdown"] for page in standard_output['pages']]
display(Markdown(pages_md[4]))

### elements
The element section contains the various semantic elements extracted from the documents including Text content, Tables and figures. The text and figure entites are further sub-classified for example TITLE/SECTION_TITLE for Text or Chart for figures.

#### TEXT elements

In [None]:
# Filter dataframe for text elements
df_elements = pd.json_normalize(standard_output["elements"])
df_text = df_elements[df_elements["type"] == "TEXT"]

# Display formatted dataframe
show(
    df_text.iloc[:50, 2:8],
    columnDefs=[
        {"width": "280px", "targets": [4, 5]},
        {"width": "150px", "targets": [3]},
        {"className": "dt-left", "targets": "_all"}
    ],
    style="width:1200px",
    autoWidth=False,
    classes = "compact",
    showIndex=False
)

In [None]:
JSON(standard_output['elements'][5],root='elements[5]')

#### FIGURE elements

In [None]:
# Filter dataframe for text elements
df_elements = pd.json_normalize(standard_output["elements"])
df_figure = df_elements[df_elements["type"] == "FIGURE"]

embedded_images=df_figure.apply( lambda row: create_image_html_column(row, "crop_images","200px"), axis=1)
df_figure.insert(6, 'image', embedded_images)

# Display formatted dataframe
show(
    df_figure.iloc[:, 2:9],
    columnDefs=[                
        {"width": "120px", "targets": [0,1,3]},          
        {"width": "220px", "targets": [2,4]},
        {"width": "280px", "targets": [5]},        
        {"width": "480px", "targets": [6]},        
        {"className": "dt-left", "targets": "_all"}
    ],
    style="width:1200px",
    autoWidth=False,
    classes="compact",
    showIndex=False,
    # column_filters="header"
)

In [None]:
time.sleep(2) # nosemgrep
JSON([el for el in standard_output["elements"]if el["type"]=="FIGURE"])

#### TABLE elements

In [None]:
# Filter dataframe for text elements
df_elements = pd.json_normalize(standard_output["elements"])
df_table = df_elements[df_elements["type"] == "TABLE"]

embedded_images=df_table.apply( lambda row: create_image_html_column(row, "crop_images","500px"), axis=1)
df_table.insert(6, 'image', embedded_images)
cols = ["type","locations","image", 
        #'representation.text', 'representation.markdown', 
        'representation.html','title', 'summary', 'footers', 'headers', 'csv_s3_uri',
       'representation.csv']
# Display formatted dataframe
show(
    df_table[cols],
    columnDefs=[                
        {"width": "120px", "targets": [0,1]},   
        {"width": "340px", "targets": [2]},  
        {"width": "380px", "targets": [3]},
        {"width": "150px", "targets": [5,6,7,8]},        
        {"className": "dt-left", "targets": "_all"}
    ],
    # style="width:1200px",
    # autoWidth=True,
    classes="compact",
    showIndex=False,
    scrollY="400"    
)

In [None]:
JSON([el for el in standard_output["elements"]if el["type"]=="TABLE"][2], root="sample_table")

### text_lines elements

In [None]:
JSON(standard_output["text_lines"][:10], root="text_lines")

In [None]:
df = pd.json_normalize(standard_output["text_lines"])
show(df, classes="compact")

### text_words elements

In [None]:
JSON(standard_output["text_words"][3:4], root="text_words[3:4]", expanded=True)

## Conclusion

We explored the standard output of BDA for documents which can be configured and allows us to detailled insights about a document and its structure,  like headers, sections, paragraphs, tables, figures, charts, etc.

It does not only detect these elements but als interprets these elements, e.g. by giving a description of a figures, or by extracting the chart depicted values into a structured table. This structured output is very powerful 

This allows

## Clean Up
When you are done uncomment the lines of code in the following cells and execute to remove the sample file(s) and the BDA output

In [None]:
#import os
#from pathlib import Path

## Delete S3 File
#s3_client.delete_object(Bucket=target_s3_bucket, Key=target_s3_key)

## Delete local file
#if os.path.exists(local_file_name):
# os.remove(local_file_name)	

## Delete bda job output
#bda_s3_job_location = str(Path(job_metadata_s3_location).parent).replace("s3:/","s3://")
#!aws s3 rm {bda_s3_job_location} --recursive