# Amazon Bedrock embedding for Ads - Building with Bedrock Embeddings

In this demo notebook, we demonstrate how to use the Bedrock Python SDK for Embeddings Generation.

1. [Set Up](#1.-Set-Up)
2. [Embeddings Generation](#2.-Embeddings-Generation)


Note: This notebook was tested in Amazon SageMaker Studio with Python 3 (Data Science 3.0) kernel, suggest 2vCPU and 8GiB memory. 


# 1. Set Up

---
Before executing the notebook for the first time, execute this cell to add Amazon bedrock extensions to the Python boto3 SDK

---

In [None]:
%pip install --upgrade pip
%pip install boto3 --upgrade
%pip install botocore --upgrade

In [7]:
import boto3
import botocore

# Get the Boto3 version
boto3_version = boto3.__version__

# Get the Botocore version
botocore_version = botocore.__version__


# Print the Boto3 version
print("Current Boto3 Version:", boto3_version)

# Print the Botocore version
print("Current Botocore Version:", botocore_version)

Current Boto3 Version: 1.28.63
Current Botocore Version: 1.31.63


Let's initialize the boto3 client to use Bedrock

In [8]:
import boto3
import botocore
import json
bedrock = boto3.client(
 service_name='bedrock',
 region_name='us-east-1',
 endpoint_url='https://bedrock.us-east-1.amazonaws.com'
)

Lets test the endpoint to see what models are available

In [9]:
bedrock.list_foundation_models()

{'ResponseMetadata': {'RequestId': '5ff415c3-7b48-4906-b9fa-f424410d362d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Fri, 13 Oct 2023 04:34:17 GMT',
   'content-type': 'application/json',
   'content-length': '5729',
   'connection': 'keep-alive',
   'x-amzn-requestid': '5ff415c3-7b48-4906-b9fa-f424410d362d'},
  'RetryAttempts': 0},
 'modelSummaries': [{'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-tg1-large',
   'modelId': 'amazon.titan-tg1-large',
   'modelName': 'Titan Text Large',
   'providerName': 'Amazon',
   'inputModalities': ['TEXT'],
   'outputModalities': ['TEXT'],
   'responseStreamingSupported': True,
   'customizationsSupported': ['FINE_TUNING'],
   'inferenceTypesSupported': ['ON_DEMAND']},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-e1t-medium',
   'modelId': 'amazon.titan-e1t-medium',
   'modelName': 'Titan Text Embeddings',
   'providerName': 'Amazon',
   'inputModalities': ['TEXT'],
   'outputModalities'

# 2. Embeddings Generation

Embeddings are a key concept in generative AI and machine learning in general. An embedding is a representation of an object (like a word, image, video, etc.) in a vector space. Typically, semantically similar objects will have embeddings that are close together in the vector space. These are very powerful for use-cases like semantic search, recommendations and Classifications.

## We will be using the Titan Embeddings Model to generate our Embeddings.

In [28]:
import json

# Define the path to your JSON file, use a small dataset, only 5000 ads records
json_file_path = 'amazon_combined_scrapped_data_5k.json'

# Load the JSON data from the file
with open(json_file_path, 'r') as json_file:
    ads_data = json.load(json_file)

# Now, nyt_articles_data contains the data from the JSON file as a Python object (likely a list of dictionaries)


In [30]:
import json
bedrock_client = boto3.client('bedrock-runtime')

def get_embedding(body, modelId, accept, contentType):
    response = bedrock_client.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    embedding = response_body.get('embedding')
    return embedding

In [34]:
import json

# Assuming you've loaded the JSON data into the 'data' variable

# Display the structure of the first few records with proper indentation
num_records_to_show = 5  # You can adjust this number
formatted_data = json.dumps(ads_data[:num_records_to_show], indent=4)
print(formatted_data)



[
    {
        "ad": "MTNG Women's 69439 Trainers, (Soft Metallic Light Pink C45918), 3.5 3 UK",
        "np_array": 0,
        "ad_id": "2023-1875908"
    },
    {
        "ad": "Fremont Die NCAA Florida State Seminoles Boat Flag, Small, Red",
        "np_array": 0,
        "ad_id": "2023-2093788"
    },
    {
        "ad": "10 Label/card Holder Nickel Plated 5/8x2 1/2 W/screws",
        "np_array": 0,
        "ad_id": "2023-2237808"
    },
    {
        "ad": "Digital AM FM Portable Pocket Radio with Alarm Clock- Best Reception and Longest Lasting. AM FM Compact Radio Player Operated by 2 AAA Battery, Stereo Headphone Socket (Black), by Vondior",
        "np_array": 0,
        "ad_id": "2023-1190215"
    },
    {
        "ad": "Lcolyoli Tragus Earrings 16G Surgical Steel & Diamond CZ Cartilage Hoop Rings Labret Monroe Medusa Lip Ring Retainer Rook Helix Earring Stud Barbell Piercing Jewelry for Women Men Girls 8mm",
        "np_array": 0,
        "ad_id": "2023-1394847"
    }
]


In [None]:
# Initialize a list to store the results
results = []

# Loop through each record in the data
for record in ads_data:
    # Extract relevant fields from the record
    ad = record['ad']
    np_array = record['np_array']
    ad_id = record['ad_id']
    
    body = json.dumps({"inputText": ad})
    
    modelId = 'amazon.titan-embed-g1-text-02'
    accept = 'application/json'
    contentType = 'application/json'
    embedding = get_embedding(body, modelId, accept, contentType)

    # Create a result dictionary with embedding

    result = {
        'ad': ad,
        'ad_id': ad_id,
        'np_array': np_array,
        'embedding': embedding
        }
    
    # Append the result to the list of results
    results.append(result)


In [35]:
# Save the results to 'vectors.json'
with open('ads_vectors.json', 'w', encoding='utf-8') as output_file:
    json.dump(results, output_file, indent=4)

print('Embedding vectors have been saved to ads_vectors.json')

Embedding vectors have been saved to ads_vectors.json


## Convert to np array

In [36]:
import json
import numpy as np

# Define the path to your JSON file
json_file_path = 'ads_vectors.json'

# Load the JSON data from the file
with open(json_file_path, 'r') as json_file:
     data = json.load(json_file)

# Loop through each record in the JSON data
for record in data:
    embedding = record.get('embedding')  # Get the 'embedding' value
    if embedding:
        # Convert the 'embedding' value to a NumPy array
        embedding_np = np.array(embedding)

        # Update the 'np_array' key with the NumPy array
        record['np_array'] = embedding_np.tolist()  # Convert NumPy array back to a list

# Write the updated JSON data back to 'vectors.json' with proper indentation
with open('ads_vectors.json', 'w', encoding='utf-8') as output_file:
    json.dump(data, output_file, indent=4)


In [37]:
# Specify the filename of the JSON file you want to load
json_file_path = 'ads_vectors.json'

# Load the JSON data from the file
with open(json_file_path, 'r') as json_file:
    np_array_data = json.load(json_file)

# Display the first record (dictionary)
first_record = np_array_data[0]
print(first_record)

{'ad': "MTNG Women's 69439 Trainers, (Soft Metallic Light Pink C45918), 3.5 3 UK", 'ad_id': '2023-1875908', 'np_array': [-0.21484375, 0.59375, -0.027709961, -0.09082031, -0.29101562, -0.100097656, 0.122558594, 9.10759e-05, -0.49414062, -0.34179688, -0.55078125, 0.15039062, 0.14941406, -0.071777344, 0.10498047, -0.17382812, 0.40234375, -0.28320312, 0.08691406, 0.0046691895, 0.3125, 0.15136719, -0.040527344, 0.18066406, -0.029785156, 0.30078125, -0.20996094, 0.17675781, -0.140625, -0.33984375, 0.08496094, 0.54296875, -0.036865234, -0.14453125, 0.22460938, -0.072265625, -0.002166748, 0.18554688, 0.34375, 0.36328125, -0.016845703, 0.19433594, 0.14941406, -1.046875, 0.21972656, -0.68359375, 0.79296875, -0.18847656, 0.828125, -0.12890625, -0.010375977, 0.27734375, 0.25585938, -0.22558594, 0.87109375, 0.18945312, -0.05419922, -0.24023438, -0.5234375, 0.047851562, 0.17382812, 0.44726562, -0.37890625, 0.27539062, -0.06542969, 0.026245117, -0.3984375, -0.14160156, 0.23925781, -0.0016555786, -0.2