# How Bedrock Data Automation works

Bedrock Data Automation (BDA) lets you configure output based on your processing needs for a specific data type: documents, images, video or audio. BDA can generate standard output or custom output. Below are some key concepts for understanding how BDA works. If you're a new user, start with the information about standard output.

* **Standard output** – Sending a file to BDA with no other information returns the default standard output, which consists of commonly required information that's based on the data type. Examples include audio transcriptions, scene summaries for video, and document summaries. These outputs can be tuned to your use case using projects to modify them. For more information, see e.g. [Standard output for documents in Bedrock Data Automation](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-output-documents.html).

* **Custom output** – For documents and images, only. Choose custom output to define exactly what information you want to extract using a blueprint. A blueprint consists of a list of expected fields that you want retrieved from a document or image. Each field represents a piece of information that needs to be extracted to meet your specific use case. You can create your own blueprints, or select predefined blueprints from the BDA blueprint catalog. For more information, see [Custom output and blueprints](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-custom-output-idp.html).

* **Projects** – A project is a BDA resource that allows you to modify and organize output configurations. Each project can contain standard output configurations for documents, images, video, and audio, as well as custom output blueprints for documents and images. Projects are referenced in the `InvokeDataAutomationAsync` API call to instruct BDA on how to process the files. For more information about projects and their use cases, see [Bedrock Data Automation projects](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-projects.html).

In this notebook, we see will see how we can get started with using BDA API for your document processing use cases. The Amazon Bedrock Data Automation (BDA) feature provides a streamlined API workflow for processing your data. For all modalities, this workflow consists of three main steps: creating a project, invoking the analysis, and retrieving the results. To retrieve custom output for your processed data, you provide the Blueprint ARN when you invoke the analysis operation.

## Prerequisites

### Configure IAM Permissions

The features being explored in the workshop require multiple IAM Policies for the role being used. If you're running this notebook within SageMaker Studio in your own Account, update the default execution role for the SageMaker user profile to include the IAM policies described in [README.md](../README.md).

### Install Required Libraries

In [None]:
%pip install --no-warn-conflicts "boto3>=1.37.6" itables==2.2.4 PyPDF2==3.0.1 --upgrade -q

In [None]:
from utils.helper_functions import restart_kernel
restart_kernel()

In [None]:
%load_ext autoreload
%autoreload 2

### Setup

Before we get to the part where we invoke BDA with our sample artifacts, let's setup some parameters and configuration that will be used throughout this notebook

In [None]:
import boto3
import json
from IPython.display import JSON, IFrame
import sagemaker
from utils.helper_functions import read_s3_object, wait_for_job_to_complete, get_bucket_and_key
from pathlib import Path
import os

session = sagemaker.Session()
default_bucket = session.default_bucket()
current_region = boto3.session.Session().region_name

sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()['Account']

# Initialize Bedrock Data Automation client
bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
s3_client = boto3.client('s3')

bda_s3_input_location = f's3://{default_bucket}/bda/input'
bda_s3_output_location = f's3://{default_bucket}/bda/output'

print(f"My BDA output s3 URI: {bda_s3_output_location}")

## Prepare sample document
For this lab, we use a sample `Bank Statement` for Fiscal Year 2025 through November 30, 2024. The document is prepared by the Bureau of the Fiscal Service, Department of the Treasury and provides detailed information on the government's financial activities. We will extract a subset of pages from the `PDF` document and use BDA to extract and analyse the document content.

### Download and store sample document
we use the document url to download the document and store it a S3 location. 

Note - We will configure BDA to use the sample input from this S3 location, so we need to ensure that BDA has `s3:GetObject` access to this S3 location. If you are running the notebook in your own AWS Account, ensure that the SageMaker Execution role configured for this JupyterLab app has the right IAM permissions.

In [None]:
local_download_path = "data/documents/"
local_file_name = "BankStatement.jpg"
file_path_local = f"{local_download_path}/{local_file_name}"
os.makedirs(local_download_path, exist_ok=True)

# Download Sample file
#(bucket, key) = get_bucket_and_key(document_url)
#response = s3_client.download_file(bucket, key, file_path_local)

# Upload the document to S3
document_s3_uri = f'{bda_s3_input_location}/{local_file_name}'

target_s3_bucket, target_s3_key = get_bucket_and_key(document_s3_uri)
s3_client.upload_file(file_path_local, target_s3_bucket, target_s3_key)

print(f"Downloaded file to: {file_path_local}")
print(f"Uploaded file to S3: {target_s3_key}")
print(f"document_s3_uri: {document_s3_uri}")

### View Sample Document

In [None]:
IFrame(file_path_local, width=600, height=400)

## Using BDA for standard output

Sending e.g. a document to BDA with no other information using the [`InvokeDataAutomationAsync` API](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-data-automation-runtime/client/invoke_data_automation_async.html) looks as follows:

BDA will process the file provided in `inputConfiguration` and write the output to the s3 URI of `outputConfiguration`.

```python
response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': 's3://bedrock-data-automation-prod-assets-us-west-2/demo-assets/Document/BankStatement.jpg'
    },
    outputConfiguration={
        's3Uri': 's3://my_output'
    },
)
```

### Invoking BDA for standard output

In [None]:
response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={        
        's3Uri': document_s3_uri
    },
    outputConfiguration={'s3Uri': f'{bda_s3_output_location}'},
    dataAutomationProfileArn = f'arn:aws:bedrock:{current_region}:{account_id}:data-automation-profile/us.data-automation-v1',
    dataAutomationConfiguration = {
        'dataAutomationProjectArn': f'arn:aws:bedrock:{current_region}:aws:data-automation-project/public-default',
    }
)
JSON(response)

### Get data automation job status

In [None]:
status_response = wait_for_job_to_complete(invocationArn=response["invocationArn"])
JSON(status_response)

### Retrieve job metadata

In [None]:
job_metadata_s3 = status_response["outputConfiguration"]["s3Uri"]
print(f"Retrieving job metadata: {job_metadata_s3}")
job_metadata = json.loads(read_s3_object(job_metadata_s3))

JSON(job_metadata,root='job_metadata',expanded=True)

### Get job results for standard output

The standard output will contain the following fields

* metadata: simple document metadata like location and number of pages
* document: Contains document statistics on number of elements, tables, and figures
* pages: Contains markdown version of each page
* elements: Contains details and references to Text blocks, figures, tables, charts, etc.

Note that the standard output can configured to contain much more information about the document structure, or descriptions of figures, charts, etc. We will explore this in the next notebook

In [None]:
standard_output_path = job_metadata["output_metadata"][0]["segment_metadata"][0]["standard_output_path"]
print(f"Receiving the jobs results from: {standard_output_path}")
standard_output = json.loads(read_s3_object(standard_output_path))
JSON(standard_output, root="standard_output")

## Using BDA for custom outputs with blueprints

We can also provide a list of blueprints to be used when invoking BDA through the `InvokeDataAutomationAsync` API.
BDA will match the document against the blueprints and extract or derive structured insights based on the blueprint definitions.

We will see follow up notebooks how this works in more detail. Here we provide just a high level overview how it can be used, for example in `us-east-1` region.

```python
response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': 's3://bedrock-data-automation-prod-assets-us-east-1/demo-assets/Document/BankStatement.jpg'
    },
    outputConfiguration={
        's3Uri': 's3://my_output'
    },
    dataAutomationProfileArn = f'arn:aws:bedrock:{current_region}:{account_id}:data-automation-profile/us.data-automation-v1',
    blueprints=[
    {
        'blueprintArn': 'arn:aws:bedrock:us-east-1:aws:blueprint/bedrock-data-automation-public-bank-statement',     
    },
]
)
```

## Using projects with custom output and standard output

A data automation project allows to bundle multiple configurations together, to be consumed as a single unit.
It allows in particular to

* extend the standard output by defining the granularity and types insights using `standardOutputConfiguration`
* define a list of blueprints using `customOutputConfiguration`
* activate document splitting using `overrideConfiguration`


### Creating a data automation project

The follow preview shows how we can create a data automation project using the boto3 client.

```python
import boto3

client = boto3.client('bedrock-data-automation')
response = bda_runtime_client.create_data_automation_project(
    projectName='my name',
    projectDescription='my description',
    projectStage='LIVE',
    standardOutputConfiguration={
        "document": {
            "extraction": {
              "granularity": {"types": ["DOCUMENT","PAGE", "ELEMENT","LINE","WORD"]},
              "boundingBox": {"state": "ENABLED"}
            },
            "generativeField": {"state": "ENABLED"},
            "outputFormat": {
                "textFormat": {"types": ["PLAIN_TEXT", "MARKDOWN", "HTML", "CSV"]},
                "additionalFileFormat": {"state": "ENABLED"}
                }
        },
        "image": {...},
        "video": {...},
        "audio": {...}
        },
    customOutputConfiguration={
        'blueprints': [
            {
                'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-bank-statement'                
            },
        ]
    },
    overrideConfiguration={
        'document': {
            'splitter': {
                'state': 'ENABLED'
            }
        }
    },
)
```

### Invoking a data automation project

We can now invoke a data automation project with an input file using the `InvokeDataAutomationAsync` API and by providing the previously created project ARN.

```python
response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': 's3://bedrock-data-automation-prod-assets-us-west-2/demo-assets/Document/BankStatement.jpg'
    },
    outputConfiguration={
        's3Uri': 's3://my_output'
    },
    dataAutomationConfiguration={
        'dataAutomationArn': 'arn:aws:bedrock:us-west-2:123456789101:data-automation-project/0644799db368',
    }
)
```

In the next modules we will explore these approaches in more detail.