# Batch Registration Notebook

> The notebook can be run on Cloud environment (e.g. EC2 instance, SageMaker notebook, Studio notebook, etc.) with proper IAM permissions to operate on Bedrock, S3, DynamoDB, and STS. For learning more details, please referto the reference section.

The purpose of the notebook is to generate synthetic data, then upload to s3, and create the DynamoDB records, which are the batch registration records.

***Hold on***, what's the problem we are trying to solve? If you are not sure, please have a look at [The Problem](../README.md#the-problem)

## Before running the notebook

Please refer to the [README.md](../solution/README.md) to setup the AWS environment and deploy the stack. 


## Setup

If you are running at local environment, please setup aws credentials first. 

e.g.
```bash
aws configure
```


In [None]:
%env AWS_DEFAULT_REGION=us-east-1

# uncomment the following line if you are running at local environment, and setup a right profile name
# %env AWS_PROFILE=default

In [2]:
import boto3
import json

from pprint import pprint

In [3]:
bedrock = boto3.client('bedrock')

bedrock_runtime = boto3.client('bedrock-runtime')

model_id = 'amazon.titan-embed-text-v2:0'

Sample text content is sourced from [What can I do with Amazon Bedrock?](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html#servicename-feature-overview)

In [4]:
# create some random content for sampling text content on synthetic data generation.
free_text = """
What can I do with Amazon Bedrock?

You can use Amazon Bedrock to do the following:

Experiment with prompts and configurations – Submit prompts and generate responses with model inference by sending prompts using different configurations and foundation models to generate responses. You can use the API or the text, image, and chat playgrounds in the console to experiment in a graphical interface. When you're ready, set up your application to make requests to the InvokeModel APIs.

Augment response generation with information from your data sources – Create knowledge bases by uploading data sources to be queried in order to augment a foundation model's generation of responses.

Create applications that reason through how to help a customer – Build agents that use foundation models, make API calls, and (optionally) query knowledge bases in order to reason through and carry out tasks for your customers.

Adapt models to specific tasks and domains with training data – Customize an Amazon Bedrock foundation model by providing training data for fine-tuning or continued-pretraining in order to adjust a model's parameters and improve its performance on specific tasks or in certain domains.

Improve your FM-based application's efficiency and output – Purchase Provisioned Throughput for a foundation model in order to run inference on models more efficiently and at discounted rates.

Determine the best model for your use case – Evaluate outputs of different models with built-in or custom prompt datasets to determine the model that is best suited for your application.

Prevent inappropriate or unwanted content – Use guardrails to implement safeguards for your generative AI applications.

Optimize your FM's latency – Get faster response times and improved responsiveness for AI applications with Latency-optimized inference for foundation models.


"""

### Synthesize data


In [None]:
# define the sample text length
sample_text_length = 150

# split the free text into words
words = free_text.split()
print(f"words count: {len(words)}")

# assert the words count is greater than the sample text length
assert len(words) > sample_text_length


In [None]:
import random

# define the record count
record_count = 2500

# initialize the records list
records = []

# generate the records
for i in range(record_count):
    random_words = random.sample(words, sample_text_length)
    random_text = ' '.join(random_words)
    records.append(random_text)

print(f"records count for a data file: {len(records)}")


output `jsonl` file for batch inference job. 

to get more information about jsonline format, please refer to the target FM inference parameters and responses - [Amazon Titan Embeddings Text](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-titan-embed-text.html)

In [8]:
output_file = 'synthetic-data.jsonl'

# write the records to the output file
with open(output_file, 'w') as f:
    for i, record in enumerate(records):

        output = {
            "recordId": str(i), 
            "modelInput": {
                "inputText": record,
                "dimensions": 256 # 256, 512, or 1024 (default)
            }
        }
        f.write(json.dumps(output) + '\n')



construct the target s3 bucket name. (the naming pattern is `embedding-batch-job-{account_id}-{region_code}` per our CDK implemetation.)

In [14]:
stack_name = "SolutionStack"

# get the role arn from the stack output
cfn_client = boto3.client('cloudformation')

def get_stack_output(stack_name, output_key):
    response = cfn_client.describe_stacks(StackName=stack_name)
    outputs = response['Stacks'][0]['Outputs']

    # Find the role ARN from the outputs
    output_value = None
    for output in outputs:
        if output['OutputKey'] == output_key:  # Adjust this key based on your CDK output
            output_value = output['OutputValue']
            break

    return output_value

In [15]:
bucket_name = get_stack_output(stack_name, "DataS3BucketName")
data_uri_prefix = f"s3://{bucket_name}/input"

In [16]:
# total batches to be processed
BATCH_COUNT = 25

In [None]:
import uuid

# initialize the data file uris list
data_file_uris = []

# copy the synthetic data to the target s3 bucket on each batch
# please prepare the real data files for your use case if any.
for i in range(BATCH_COUNT):
    batch_id = str(uuid.uuid4())[:12]
    input_file_uri = f"{data_uri_prefix}/{batch_id}/data.jsonl"
    data_file_uris.append(input_file_uri)
    !aws s3 cp ./$output_file $input_file_uri

Create Batch registry records in the dynamodb table.

In [None]:

dynamodb_table_name = get_stack_output(stack_name, "BatchRegistryTableName")
dynamodb_table_name

In [None]:
from datetime import datetime
from datetime import timezone

dynamodb = boto3.client('dynamodb')

# Get current timestamp
current_time = datetime.now(timezone.utc).isoformat()

# Store each data file URI in DynamoDB
for uri in data_file_uris:
    item = {
        'data_s3_uri': {'S': uri},
        'created_dt': {'S': current_time},
        'status': {'S': 'Pending'},
        'id': {'S': uri.split('/')[-2]}  # Extract batch_id from URI
    }
    
    dynamodb.put_item(
        TableName=dynamodb_table_name,
        Item=item
    )

print(f"Stored {len(data_file_uris)} records in DynamoDB")

Query the pending records.

In [None]:
from boto3.dynamodb.conditions import Key

# query the pending records
table = boto3.resource('dynamodb').Table(dynamodb_table_name)
response = table.query(
    IndexName='status-index',
    KeyConditionExpression=Key('status').eq('Pending')
)

response['Items']


### Reference

> **Note**: You don't need to run the below code cells. The below are just for reference.

#### Create Batch Inference Job

In [21]:
role_arn = get_stack_output(stack_name, "BatchInferenceRoleArn")

In [None]:
input_file_uri = data_file_uris[0]

output_data_uri = f"s3://{bucket_name}/output/"

inputDataConfig=({
    "s3InputDataConfig": {
        "s3Uri": input_file_uri
    }
})

outputDataConfig=({
    "s3OutputDataConfig": {
        "s3Uri": output_data_uri
    }
})

response=bedrock.create_model_invocation_job(
    roleArn=role_arn,
    modelId=model_id,
    jobName="sample-batch-job" + str(uuid.uuid4())[:12],
    inputDataConfig=inputDataConfig,
    outputDataConfig=outputDataConfig
)

print(response)

### reference realtime inference

In [None]:
sample_input = {
    "inputText": free_text,
    "dimensions": 256,
    "normalize": True
}
body = json.dumps(sample_input)
response = bedrock_runtime.invoke_model(
    modelId=model_id,
    body=body,
    accept='application/json',
    contentType='application/json'
)

response_body = json.loads(response.get('body').read())
embeddings = response_body['embedding']
print(f"embedding lenght: {len(embeddings)}")
print(f"embedding: {embeddings}")
pprint(f"response body: {response_body}")