# Outline

## Workshop Purpose
This tutorial demonstrates how to create a synthetic retail dataset using generative AI and implement a search system using language models and vector stores.

## Key Components
1. Data generation
   - Product information generation using Claude AI
   - Review and rating generation using Claude AI
2. Data processing and storage
   - Saving generated data to CSV files
3. Vector store creation
   - Using LangChain and ChromaDB
   - Embedding generation with Amazon Bedrock
4. Search functionality implementation
   - Similarity search using the vector store

## Expected Outcomes
After practicing with this code, users will:
1. Understand how to use generative AI to create synthetic datasets
2. Learn to process and structure data for use in a vector store
3. Gain experience with LangChain, ChromaDB, and Amazon Bedrock for creating embeddings and vector stores
4. Implement a basic similarity search function for product and review data
5. Have a framework for building more complex retail analytics and recommendation systems

The tutorial provides a hands-on approach to creating a full pipeline from data generation to search functionality, allowing users to experiment with AI-powered retail data analysis and retrieval systems.

In [7]:
print("Hello world!")
x = range(10)
my_age = "18"


Hello world!


# Workshop

In [8]:
%pip install --quiet boto3 langchain_community langchain chromadb
#boto3 AWS SDK for Python
#langchain_community langchain core - framework to build agents
#chromadb - open-source database for vector embeddings

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m42.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m615.5/615.5 kB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m63.6 MB/s[0m eta [36m0:00:00[

In [10]:
# Import necessary libraries
import os #operation system utilities
import pandas as pd #open source data analysis library
import numpy as np #NumPy (Numerical Python) is a data science library
#sklearn library for Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt #visualisation library
import boto3
import json
from tqdm import tqdm #shows progress bars

from langchain_community.document_loaders import CSVLoader
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import BedrockEmbeddings
from langchain.schema import Document
from langchain_community.vectorstores.utils import filter_complex_metadata
#from langchain_anthropic import ChatAnthropic

Setup environment variables

In [14]:
from google.colab import userdata
#print(userdata.get("test"))

os.environ['AWS_ACCESS_KEY_ID'] = userdata.get('AWS_ACCESS_KEY_ID')
os.environ['AWS_SECRET_ACCESS_KEY'] = userdata.get('AWS_SECRET_ACCESS_KEY')

Generate dummy data

In [16]:
# 1. Data Collection (simulated for this example)
print("1. Data Collection")
# For this example, we'll use a sample retail dataset
data = {
    'product_id': range(1, 101),
    'product_name': [f'Product {i}' for i in range(1, 101)],
    'category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books'], 100),
    'price': np.random.uniform(10, 1000, 100).round(2),
    'rating': np.random.uniform(1, 5, 100).round(1),
    'review': [f'This is a review for product {i}' for i in range(1, 101)]
}
df = pd.DataFrame(data)

print(df.head()) #returns the first 5 rows of the DataFrame
#print(type(df))

1. Data Collection
    product_id product_name     category   price  rating  \
0            1    Product 1        Books  279.08     4.3   
1            2    Product 2        Books  247.44     3.4   
2            3    Product 3         Home  586.35     1.5   
3            4    Product 4        Books  583.60     3.5   
4            5    Product 5  Electronics  790.52     3.4   
..         ...          ...          ...     ...     ...   
95          96   Product 96     Clothing  308.61     4.8   
96          97   Product 97  Electronics  528.40     1.1   
97          98   Product 98         Home  851.99     1.3   
98          99   Product 99        Books  662.23     4.0   
99         100  Product 100        Books  674.60     2.4   

                              review  
0     This is a review for product 1  
1     This is a review for product 2  
2     This is a review for product 3  
3     This is a review for product 4  
4     This is a review for product 5  
..                        

Generate product description


In [20]:
# Set up Amazon Bedrock client
bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-east-1')


def generate_product_data(product_id):
    prompt = f"""Generate realistic product data for a retail item with the following ID: {product_id}

    Provide the response in the following JSON format:
    {{
        "product_name": "A creative and realistic product name",
        "category": "One of: Electronics, Clothing, Home, Books",
        "description": "A brief product description, 15-30 words long",
        "price": A realistic price as a float between 10 and 1000
    }}
    """

    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 300,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "temperature": 0.7,
        "top_p": 0.9,
    })

    response = bedrock.invoke_model(
        body=body,
        modelId="anthropic.claude-3-sonnet-20240229-v1:0",
        accept='application/json',
        contentType='application/json'
    )

    response_body = json.loads(response.get('body').read())
    generated_text = response_body['content'][0]['text']

    try:
        result = json.loads(generated_text)
        return result
    except json.JSONDecodeError:
        return {
            "product_name": f"Product {product_id}",
            "category": np.random.choice(['Electronics', 'Clothing', 'Home', 'Books']),
            "description": "Error generating description",
            "price": np.random.uniform(10, 1000)
        }

Genrate product data using AI call function

In [21]:
print("Generating product data...")
products = []
for product_id in tqdm(range(1, 11)):
    product = generate_product_data(product_id)
    product['product_id'] = product_id
    products.append(product)
    #time.sleep(2) # Wait for 2 seconds before the next API call

df = pd.DataFrame(products)

Generating product data...


100%|██████████| 10/10 [00:29<00:00,  2.98s/it]


In [22]:
df_products = pd.DataFrame(products)


Unnamed: 0,product_name,category,description,price,product_id
0,Wireless Noise-Canceling Headphones,Electronics,Experience crystal-clear audio with these slee...,199.99,1
1,Wireless Noise-Cancelling Headphones,Electronics,Experience crystal-clear sound with these prem...,149.99,2
2,Wireless Noise-Cancelling Headphones,Electronics,Experience immersive sound with these premium ...,249.99,3
3,Wireless Noise-Cancelling Headphones,Electronics,Experience superior sound quality with these s...,199.99,4
4,EnergyBoost Portable Power Bank,Electronics,"Compact and powerful 20,000mAh power bank with...",39.99,5
5,Wireless Noise-Cancelling Headphones,Electronics,Experience crystal-clear sound and immersive a...,249.99,6
6,Wireless Noise-Cancelling Headphones,Electronics,Experience immersive sound with these premium ...,199.99,7
7,Wireless Noise-Cancelling Headphones,Electronics,Experience superior sound quality with these s...,249.99,8
8,PowerGrip Cordless Drill Set,Electronics,A powerful and versatile cordless drill set wi...,89.99,9
9,Smart Thermostat with Voice Control,Electronics,Control your home's temperature with voice com...,149.99,10


Generate review and rating

In [23]:
def generate_review_and_rating(product_name, category, description):
    prompt = f"""Generate a realistic product review and rating for the following product:
    Product Name: {product_name}
    Category: {category}
    Description: {description}

    Provide the response in the following JSON format:
    {{
        "review": "The generated review text",
        "rating": A number between 1 and 5, with one decimal place
    }}

    Ensure the review is between 20 and 50 words long and the rating reflects the sentiment of the review.
    """

    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 300,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "temperature": 0.7,
        "top_p": 0.9,
    })

    response = bedrock.invoke_model(
        body=body,
        modelId="anthropic.claude-3-sonnet-20240229-v1:0",
        accept='application/json',
        contentType='application/json'
    )

    response_body = json.loads(response.get('body').read())
    generated_text = response_body['content'][0]['text']

    try:
        result = json.loads(generated_text)
        return result['review'], result['rating']
    except json.JSONDecodeError:
        return "Error generating review", 3.0

In [24]:
#Call generate rating and reviews
print("\nGenerating reviews and ratings...")
reviews_and_ratings = []
df_products = df

for _, row in tqdm(df_products.iterrows(), total=len(df_products)):
    for _ in range(3):  # Generate 3 reviews per product
        review, rating = generate_review_and_rating(row['product_name'], row['category'], row['description'])
        reviews_and_ratings.append({
            'product_id': row['product_id'],
            'review': review,
            'rating': rating
        })

print(df.head())
df_reviews = pd.DataFrame(reviews_and_ratings)


Generating reviews and ratings...


100%|██████████| 10/10 [02:12<00:00, 13.23s/it]

                           product_name     category  \
0   Wireless Noise-Canceling Headphones  Electronics   
1  Wireless Noise-Cancelling Headphones  Electronics   
2  Wireless Noise-Cancelling Headphones  Electronics   
3  Wireless Noise-Cancelling Headphones  Electronics   
4       EnergyBoost Portable Power Bank  Electronics   

                                         description   price  product_id  
0  Experience crystal-clear audio with these slee...  199.99           1  
1  Experience crystal-clear sound with these prem...  149.99           2  
2  Experience immersive sound with these premium ...  249.99           3  
3  Experience superior sound quality with these s...  199.99           4  
4  Compact and powerful 20,000mAh power bank with...   39.99           5  





In [25]:
df_reviews

Unnamed: 0,product_id,review,rating
0,1,These wireless noise-canceling headphones are ...,4.8
1,1,These wireless noise-canceling headphones are ...,4.7
2,1,These wireless noise-canceling headphones are ...,4.8
3,2,These wireless noise-cancelling headphones are...,4.7
4,2,These wireless noise-cancelling headphones are...,4.8
5,2,These wireless noise-cancelling headphones are...,4.8
6,3,These noise-cancelling headphones are a game-c...,4.8
7,3,These wireless noise-cancelling headphones are...,4.8
8,3,These wireless noise-cancelling headphones are...,4.7
9,4,These wireless noise-cancelling headphones are...,4.8


Save generated data

In [26]:
# Create 'data' folder if it doesn't exist
data_folder = 'data'
if not os.path.exists(data_folder):
    os.makedirs(data_folder)

# Save files to the 'data' folder
df_products.to_csv(os.path.join(data_folder, 'synthetic_product_data.csv'), index=False)
df_reviews.to_csv(os.path.join(data_folder, 'synthetic_review_data.csv'), index=False)
print(f"Data saved to '{data_folder}/synthetic_retail_data.csv' and '{data_folder}/synthetic_product_data.csv'")

Data saved to 'data/synthetic_retail_data.csv' and 'data/synthetic_product_data.csv'


Create Chroma DB and injest data

In [27]:
# Function to load and process a CSV file
def load_and_process_csv(file_path):
    df = pd.read_csv(file_path)
    processed_documents = []

    if 'synthetic_product_data' in file_path:
        for _, row in df.iterrows():
            metadata = {
                'product_id': row['product_id'],
                'product_name': row['product_name'],
                'category': row['category'],
                'price': float(row['price'])
            }
            page_content = f"{row['product_name']}\n{row['category']}\n{row['description']}"
            processed_documents.append(Document(page_content=page_content, metadata=metadata))

    elif 'synthetic_review_data' in file_path:
        for _, row in df.iterrows():
            metadata = {
                'product_id': row['product_id'],
                'rating': float(row['rating']),
                'review': row['review']
            }
            page_content = row['review']
            processed_documents.append(Document(page_content=page_content, metadata=metadata))

    else:
        print(f"Skipping unknown file type: {file_path}")

    return processed_documents

# Load and process both CSV files
product_data = load_and_process_csv(os.path.join(data_folder, 'synthetic_product_data.csv'))
review_data = load_and_process_csv(os.path.join(data_folder, 'synthetic_review_data.csv'))
# Combine all documents
chroma_documents = product_data + review_data

In [28]:
# Make sure you have the AWS CLI configured with the proper credentials and region
bedrock = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-east-1'  # Replace with your preferred AWS region
)
# Initialize the Bedrock embedding function
embeddings = BedrockEmbeddings(
    client=bedrock,
    model_id="amazon.titan-embed-text-v1"
)

  embeddings = BedrockEmbeddings(


In [29]:
# Create the ChromaDB vector store
vector_store = Chroma.from_documents(
    documents=chroma_documents,
    embedding=embeddings,
    persist_directory=os.path.join(data_folder, 'chroma_db')
)

print(f"ChromaDB index created and stored in {os.path.join(data_folder, 'chroma_db')}")


ChromaDB index created and stored in data/chroma_db


Search in DB

In [31]:
def search_products_chroma(query, top_n=5):
    results = vector_store.similarity_search(query, k=top_n)
    return results

In [35]:
search_res = search_products_chroma(query="20,000mAh ")
from google.colab import data_table
data_table.enable_dataframe_formatter()
s_res = pd.DataFrame(search_res)

s_res

Unnamed: 0,0,1,2,3
0,"(id, None)","(metadata, {'product_id': 5, 'rating': 4.7, 'r...","(page_content, The EnergyBoost Portable Power ...","(type, Document)"
1,"(id, None)","(metadata, {'category': 'Electronics', 'price'...","(page_content, EnergyBoost Portable Power Bank...","(type, Document)"
2,"(id, None)","(metadata, {'product_id': 5, 'rating': 4.7, 'r...","(page_content, The EnergyBoost Portable Power ...","(type, Document)"
3,"(id, None)","(metadata, {'product_id': 5, 'rating': 4.7, 'r...","(page_content, The EnergyBoost Portable Power ...","(type, Document)"
4,"(id, None)","(metadata, {'product_id': 9, 'rating': 4.7, 'r...","(page_content, The PowerGrip Cordless Drill Se...","(type, Document)"
