# Amazon Bedrock embedding for news/content/articles - Building with Bedrock Embeddings

In this demo notebook, we demonstrate how to use the Bedrock Python SDK for Embeddings Generation.

1. [Set Up](#1.-Set-Up)
2. [Embeddings Generation](#2.-Embeddings-Generation)

Note: This notebook was tested in Amazon SageMaker Studio with Python 3 (Data Science 3.0) kernel, suggest 2vCPU and 8GiB memory. 


### 1. Set Up

---
Before executing the notebook for the first time, execute this cell to add Amazon Bedrock extensions to the Python boto3 SDK

---

In [None]:
%pip install --upgrade pip
%pip install boto3 --upgrade
%pip install botocore --upgrade

In [2]:
import boto3
import botocore

# Get the Boto3 version
boto3_version = boto3.__version__

# Get the Botocore version
botocore_version = botocore.__version__


# Print the Boto3 version
print("Current Boto3 Version:", boto3_version)

# Print the Botocore version
print("Current Botocore Version:", botocore_version)

Current Boto3 Version: 1.28.63
Current Botocore Version: 1.31.63


Let's initialize the boto3 client to use Bedrock

In [3]:
import boto3
import json
bedrock = boto3.client(
 service_name='bedrock',
 region_name='us-east-1',
 endpoint_url='https://bedrock.us-east-1.amazonaws.com'
)


Lets test the endpoint to see what models are available

In [3]:
bedrock.list_foundation_models()

{'ResponseMetadata': {'RequestId': '73711208-ac8a-45bb-b5b2-9e0bd2e2ae5e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Sun, 24 Sep 2023 04:52:35 GMT',
   'content-type': 'application/json',
   'content-length': '3596',
   'connection': 'keep-alive',
   'x-amzn-requestid': '73711208-ac8a-45bb-b5b2-9e0bd2e2ae5e'},
  'RetryAttempts': 0},
 'modelSummaries': [{'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-tg1-large',
   'modelId': 'amazon.titan-tg1-large'},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-e1t-medium',
   'modelId': 'amazon.titan-e1t-medium'},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-g1-text-02',
   'modelId': 'amazon.titan-embed-g1-text-02'},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/stability.stable-diffusion-xl',
   'modelId': 'stability.stable-diffusion-xl'},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/ai21.j2-grande-instruct',
   'modelId': 'ai

### 2. Embeddings Generation

Embeddings are a key concept in generative AI and machine learning in general. An embedding is a representation of an object (like a word, image, video, etc.) in a vector space. Typically, semantically similar objects will have embeddings that are close together in the vector space. These are very powerful for use-cases like semantic search, recommendations and Classifications.

# We will be using the Titan Embeddings Model to generate our Embeddings.

In [19]:
import json

# Define the path to your JSON file
json_file_path = 'nyt_articles_2020.json'

# Load the JSON data from the file
with open(json_file_path, 'r') as json_file:
    nyt_articles_data = json.load(json_file)

# Now, nyt_articles_data contains the data from the JSON file as a Python object (likely a list of dictionaries)


In [20]:
# Display the first record (dictionary)
first_record = nyt_articles_data[0]
print(first_record)


{'newsdesk': 'Editorial', 'section': 'Opinion', 'subsection': 'Unknown', 'material': 'Editorial', 'headline': 'Protect Veterans From Fraud', 'abstract': 'Congress could do much more to protect Americans who have served their country from predatory for-profit colleges.', 'keywords': "['Veterans', 'For-Profit Schools', 'Financial Aid (Education)', 'Frauds and Swindling', 'Colleges and Universities', 'Veterans Affairs Department', 'Federal Trade Commission', 'University of Phoenix', 'Career Education Corporation']", 'word_count': 680, 'pub_date': '2020-01-01 00:18:54+00:00', 'n_comments': 186, 'uniqueID': 'nyt://article/69a7090b-9f36-569e-b5ab-b0ba5bb3ccbd', 'np_array': 0}


## Define get embedding function

In [21]:
import json
bedrock_client = boto3.client('bedrock-runtime')

def get_embedding(body, modelId, accept, contentType):
    response = bedrock_client.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    embedding = response_body.get('embedding')
    return embedding

In [23]:
# Specify the filename of the JSON file you want to load
input_filename = 'nyt_articles_2020.json'

# Load the JSON data from the file into a Python variable
with open(input_filename, 'r', encoding='utf-8') as input_file:
    data = json.load(input_file)

In [24]:
import json

# Assuming you've loaded the JSON data into the 'data' variable

# Display the structure of the first few records with proper indentation
num_records_to_show = 5  # You can adjust this number
formatted_data = json.dumps(data[:num_records_to_show], indent=4)
print(formatted_data)



[
    {
        "newsdesk": "Editorial",
        "section": "Opinion",
        "subsection": "Unknown",
        "material": "Editorial",
        "headline": "Protect Veterans From Fraud",
        "abstract": "Congress could do much more to protect Americans who have served their country from predatory for-profit colleges.",
        "keywords": "['Veterans', 'For-Profit Schools', 'Financial Aid (Education)', 'Frauds and Swindling', 'Colleges and Universities', 'Veterans Affairs Department', 'Federal Trade Commission', 'University of Phoenix', 'Career Education Corporation']",
        "word_count": 680,
        "pub_date": "2020-01-01 00:18:54+00:00",
        "n_comments": 186,
        "uniqueID": "nyt://article/69a7090b-9f36-569e-b5ab-b0ba5bb3ccbd",
        "np_array": 0
    },
    {
        "newsdesk": "Games",
        "section": "Crosswords & Games",
        "subsection": "Unknown",
        "material": "News",
        "headline": "\u2018It\u2019s Green and Slimy\u2019",
        "abstr

In [25]:
# Initialize a list to store the results
results = []

# Loop through each record in the data
for record in data:
    # Extract relevant fields from the record
    newsdesk = record['newsdesk']
    section = record['section']
    subsection = record['subsection']
    material = record['material']
    headline = record['headline']
    abstract = record['abstract']
    keywords = record['keywords']
    word_count = record['word_count']
    pub_date = record['pub_date']
    n_comments = record['n_comments']
    uniqueID = record['uniqueID']
    np_array = record['np_array']
    
    body = json.dumps({"inputText": abstract})
    
    modelId = 'amazon.titan-embed-g1-text-02'
    accept = 'application/json'
    contentType = 'application/json'
    embedding = get_embedding(body, modelId, accept, contentType)

    # Create a result dictionary with embedding

    result = {
        'newsdesk': newsdesk,
        'section': section,
        'subsection': subsection,
        'material': material,
        'headline': headline,
        'abstract': abstract,
        'keywords': keywords,
        'word_count': word_count,
        'pub_date': pub_date,
        'n_comments': n_comments,
        'uniqueID': uniqueID,
        'np_array': np_array,
        'embedding': embedding
        }
    
    # Append the result to the list of results
    results.append(result)


In [26]:
# Save the results to 'vectors.json'
with open('content_vectors.json', 'w', encoding='utf-8') as output_file:
    json.dump(results, output_file, indent=4)

print('Embedding vectors have been saved to content_vectors.json')

Embedding vectors have been saved to content_vectors.json


## Convert to np array

In [27]:
import json
import numpy as np

# Define the path to your JSON file
json_file_path = 'content_vectors.json'

# Load the JSON data from the file
with open(json_file_path, 'r') as json_file:
     data = json.load(json_file)

# Loop through each record in the JSON data
for record in data:
    embedding = record.get('embedding')  # Get the 'embedding' value
    if embedding:
        # Convert the 'embedding' value to a NumPy array
        embedding_np = np.array(embedding)

        # Update the 'np_array' key with the NumPy array
        record['np_array'] = embedding_np.tolist()  # Convert NumPy array back to a list

# Write the updated JSON data back to 'vectors.json' with proper indentation
with open('content_vectors.json', 'w', encoding='utf-8') as output_file:
    json.dump(data, output_file, indent=4)


In [17]:
# Specify the filename of the JSON file you want to load
json_file_path = 'content_vectors.json'

# Load the JSON data from the file
with open(json_file_path, 'r') as json_file:
    np_array_data = json.load(json_file)

# Display the first record (dictionary)
first_record = np_array_data[0]
print(first_record)

{'newsdesk': 'Editorial', 'section': 'Opinion', 'subsection': 'Unknown', 'material': 'Editorial', 'headline': 'Protect Veterans From Fraud', 'abstract': 'Congress could do much more to protect Americans who have served their country from predatory for-profit colleges.', 'keywords': "['Veterans', 'For-Profit Schools', 'Financial Aid (Education)', 'Frauds and Swindling', 'Colleges and Universities', 'Veterans Affairs Department', 'Federal Trade Commission', 'University of Phoenix', 'Career Education Corporation']", 'word_count': 680, 'pub_date': '2020-01-01 00:18:54+00:00', 'n_comments': 186, 'uniqueID': 'nyt://article/69a7090b-9f36-569e-b5ab-b0ba5bb3ccbd', 'np_array': [-1.0, -0.084472656, -0.56640625, -0.19824219, 0.54296875, 0.1484375, -0.07763672, -0.00024604797, 0.60546875, 0.4609375, -0.55859375, 0.024536133, -0.038085938, -0.58984375, -0.40820312, 0.625, -0.296875, 0.17773438, 0.09326172, 0.1328125, 0.41210938, 0.29492188, 0.44335938, -0.33984375, -0.66015625, -1.4375, -0.37695312,