# Outline

## Purpose
This tutorial demonstrates how to create a synthetic retail dataset using generative AI and implement a search system using language models and vector stores.

## Key Components
1. Data generation
   - Product information generation using Claude AI
   - Review and rating generation using Claude AI
2. Data processing and storage
   - Saving generated data to CSV files
3. Vector store creation
   - Using LangChain and ChromaDB
   - Embedding generation with Amazon Bedrock
4. Search functionality implementation
   - Similarity search using the vector store

## Expected Outcomes
After practicing with this code, users will:
1. Understand how to use generative AI to create synthetic datasets
2. Learn to process and structure data for use in a vector store
3. Gain experience with LangChain, ChromaDB, and Amazon Bedrock for creating embeddings and vector stores
4. Implement a basic similarity search function for product and review data
5. Have a framework for building more complex retail analytics and recommendation systems

The tutorial provides a hands-on approach to creating a full pipeline from data generation to search functionality, allowing users to experiment with AI-powered retail data analysis and retrieval systems.

# Install Dependencies and Environment Variables

In [67]:
%pip install --quiet boto3 langchain_community langchain langchain-openai chromadb

In [54]:
# Import necessary libraries
import os
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import boto3
import json
from tqdm import tqdm

from langchain_community.document_loaders import CSVLoader
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import BedrockEmbeddings
from langchain.schema import Document
from langchain_community.vectorstores.utils import filter_complex_metadata
#from langchain_anthropic import ChatAnthropic

In [55]:
from google.colab import userdata
os.environ['AWS_ACCESS_KEY_ID'] = userdata.get('AWS_ACCESS_KEY_ID')
os.environ['AWS_SECRET_ACCESS_KEY'] = userdata.get('AWS_SECRET_ACCESS_KEY')

# Simple Synthetic Retail Data Generator

In [56]:
# 1. Data Collection (simulated for this example)
print("1. Data Collection")
# For this example, we'll use a sample retail dataset
data = {
    'product_id': range(1, 101),
    'product_name': [f'Product {i}' for i in range(1, 101)],
    'category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books'], 100),
    'price': np.random.uniform(10, 1000, 100).round(2),
    'rating': np.random.uniform(1, 5, 100).round(1),
    'review': [f'This is a review for product {i}' for i in range(1, 101)]
}
df = pd.DataFrame(data)
print(df.head())

1. Data Collection
   product_id product_name  category   price  rating  \
0           1    Product 1  Clothing  810.03     4.3   
1           2    Product 2      Home  521.23     4.4   
2           3    Product 3      Home  677.71     2.5   
3           4    Product 4     Books  955.76     3.0   
4           5    Product 5  Clothing  868.38     3.8   

                           review  
0  This is a review for product 1  
1  This is a review for product 2  
2  This is a review for product 3  
3  This is a review for product 4  
4  This is a review for product 5  


# Generative AI Synthetic Retail Data Generator

## Product Description Generation

In [None]:
# Set up Amazon Bedrock client
bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-east-1')
#llm = ChatAnthropic(model="claude-3-5-sonnet-20240620", temperature=0.7)

def generate_product_data(product_id):
    prompt = f"""Generate realistic product data for a retail item with the following ID: {product_id}

    Provide the response in the following JSON format:
    {{
        "product_name": "A creative and realistic product name",
        "category": "One of: Electronics, Clothing, Home, Books",
        "description": "A brief product description, 15-30 words long",
        "price": A realistic price as a float between 10 and 1000
    }}
    """

    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 300,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "temperature": 0.7,
        "top_p": 0.9,
    })

    response = bedrock.invoke_model(
        body=body,
        modelId="anthropic.claude-3-sonnet-20240229-v1:0",
        accept='application/json',
        contentType='application/json'
    )

    response_body = json.loads(response.get('body').read())
    generated_text = response_body['content'][0]['text']

    try:
        result = json.loads(generated_text)
        return result
    except json.JSONDecodeError:
        return {
            "product_name": f"Product {product_id}",
            "category": np.random.choice(['Electronics', 'Clothing', 'Home', 'Books']),
            "description": "Error generating description",
            "price": np.random.uniform(10, 1000)
        }

In [57]:
#OpenAI version
from openai import OpenAI
from openai.types.chat import ChatCompletionMessageParam
import json
import pandas as pd
import numpy as np
from tqdm import tqdm
import re

# Initialize OpenAI client
openai_api_key = userdata.get('OPENAI_API_KEY')
client = OpenAI(api_key=openai_api_key)

def clean_json_from_model_output(text):
    """Strip markdown code block formatting like ```json ... ```."""
    text = text.strip()
    # Remove code block markdown formatting
    if text.startswith("```"):
        text = re.sub(r"^```(?:json)?\s*", "", text)
        text = re.sub(r"\s*```$", "", text)
    return text

def generate_product_data(product_id):
    prompt = f"""You are an API that returns ONLY raw JSON. Generate realistic product data for a retail item with the following ID: {product_id}

Provide the response in the following JSON format:
{{
    "product_name": "A creative and realistic product name",
    "category": "One of: Electronics, Clothing, Home, Books",
    "description": "A brief product description, 15-30 words long",
    "price": A realistic price as a float between 10 and 1000
}}"""

    try:
        messages: list[ChatCompletionMessageParam] = [
            {"role": "user", "content": prompt}
        ]

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            temperature=0.7,
            max_tokens=300,
            top_p=0.9
        )

        generated_text = response.choices[0].message.content.strip()
        # Clean and parse the JSON
        cleaned_text = clean_json_from_model_output(generated_text)
        result = json.loads(cleaned_text)

        return result

    except Exception as e:
        print(f"Error generating data for product {product_id}: {e}")
        return {
            "product_name": f"Product {product_id}",
            "category": np.random.choice(['Electronics', 'Clothing', 'Home', 'Books']),
            "description": "Error generating description",
            "price": float(np.random.uniform(10, 1000))
        }


## Generate product data

In [58]:
print("Generating product data...")
products = []
for product_id in tqdm(range(1, 3)):
    product = generate_product_data(product_id)
    product['product_id'] = product_id
    products.append(product)

df = pd.DataFrame(products)

Generating product data...


100%|██████████| 2/2 [00:03<00:00,  1.69s/it]


In [59]:
df_products = pd.DataFrame(products)
print(df_products.head())

                    product_name     category  \
0     EchoWave Bluetooth Speaker  Electronics   
1  UltraSoft Wireless Headphones  Electronics   

                                         description   price  product_id  
0  Compact and powerful, this speaker delivers cr...   79.99           1  
1  Experience unparalleled sound quality with our...  149.99           2  


## Product Review and Rating Generation

In [None]:
#Bedrock Sonnet version
def generate_review_and_rating(product_name, category, description):
    prompt = f"""Generate a realistic product review and rating for the following product:
    Product Name: {product_name}
    Category: {category}
    Description: {description}

    Provide the response in the following JSON format:
    {{
        "review": "The generated review text",
        "rating": A number between 1 and 5, with one decimal place
    }}

    Ensure the review is between 20 and 50 words long and the rating reflects the sentiment of the review.
    """

    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 300,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "temperature": 0.7,
        "top_p": 0.9,
    })

    response = bedrock.invoke_model(
        body=body,
        modelId="anthropic.claude-3-sonnet-20240229-v1:0",
        accept='application/json',
        contentType='application/json'
    )

    response_body = json.loads(response.get('body').read())
    generated_text = response_body['content'][0]['text']

    try:
        result = json.loads(generated_text)
        return result['review'], result['rating']
    except json.JSONDecodeError:
        return "Error generating review", 3.0



In [61]:
#OpenAI version

# Helper to clean markdown-like code blocks from LLM output
def clean_json(text):
    text = text.strip()
    if text.startswith("```"):
        text = re.sub(r"^```(?:json)?\s*", "", text)
        text = re.sub(r"\s*```$", "", text)
    return text

# Function to generate review and rating
def generate_review_and_rating_OpenAI(product_name, category, description):
    prompt = f"""You are a JSON-only API. Generate a realistic product review and rating for the following product:

Product Name: {product_name}
Category: {category}
Description: {description}

Respond strictly in the following JSON format:
{{
    "review": "The generated review text",
    "rating": A number between 1 and 5, with one decimal place
}}

Make sure the review is 20-50 words long and the rating reflects the sentiment.
"""

    try:
        messages: list[ChatCompletionMessageParam] = [
            {"role": "user", "content": prompt}
        ]

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            temperature=0.7,
            max_tokens=300,
            top_p=0.9
        )

        raw_output = response.choices[0].message.content.strip()
        print("### Raw Output:\n", raw_output)

        cleaned_output = clean_json(raw_output)
        result = json.loads(cleaned_output)

        return result["review"], result["rating"]

    except Exception as e:
        print(f"Error generating review and rating: {e}")
        return "Error generating review", 3.0




## Generate reviews and ratings

In [62]:
print("\nGenerating reviews and ratings...")
reviews_and_ratings = []
df_products = df

for _, row in tqdm(df_products.iterrows(), total=len(df_products)):
    for _ in range(3):  # Generate 3 reviews per product
        # OpenAI version
        review, rating = generate_review_and_rating_OpenAI(row['product_name'], row['category'], row['description'])
        # Bedrock Sonnet version
        #review, rating = generate_review_and_rating(row['product_name'], row['category'], row['description'])
        reviews_and_ratings.append({
            'product_id': row['product_id'],
            'review': review,
            'rating': rating
        })

print(df.head())
df_reviews = pd.DataFrame(reviews_and_ratings)



Generating reviews and ratings...


  0%|          | 0/2 [00:00<?, ?it/s]

### Raw Output:
 {
    "review": "The EchoWave Bluetooth Speaker surpasses expectations with its robust sound and deep bass. Its compact design makes it perfect for travel. Highly recommend for any music enthusiast.",
    "rating": 4.7
}
### Raw Output:
 {
    "review": "The EchoWave Bluetooth Speaker is a fantastic choice for music enthusiasts. Its compact design doesn't compromise on sound quality, delivering rich bass and clear audio. Perfect for any setting!",
    "rating": 4.7
}


 50%|█████     | 1/2 [00:04<00:04,  4.05s/it]

### Raw Output:
 ```json
{
    "review": "The EchoWave Bluetooth Speaker is a fantastic addition to my audio setup. Its compact size doesn't compromise on sound quality, delivering clear audio and deep bass. Highly recommend for music lovers!",
    "rating": 4.7
}
```
### Raw Output:
 ```json
{
    "review": "The UltraSoft Wireless Headphones offer incredible sound quality and are extremely comfortable for long listening sessions. The battery life is impressive, lasting through an entire day of use without needing a recharge.",
    "rating": 4.8
}
```
### Raw Output:
 ```json
{
    "review": "The UltraSoft Wireless Headphones deliver impressive sound quality and comfort. The battery life is exceptional, lasting through long listening sessions. A bit pricey, but worth it for the quality.",
    "rating": 4.5
}
```


100%|██████████| 2/2 [00:11<00:00,  5.59s/it]

### Raw Output:
 {
    "review": "The UltraSoft Wireless Headphones offer superb sound quality and exceptional comfort. The battery life is impressive, easily lasting through long listening sessions. Highly recommended for audiophiles seeking both quality and comfort.",
    "rating": 4.8
}
                    product_name     category  \
0     EchoWave Bluetooth Speaker  Electronics   
1  UltraSoft Wireless Headphones  Electronics   

                                         description   price  product_id  
0  Compact and powerful, this speaker delivers cr...   79.99           1  
1  Experience unparalleled sound quality with our...  149.99           2  





# Saving generated data

In [63]:
print("\n8. Saving generated data")
# Create 'data' folder if it doesn't exist
data_folder = 'data'
if not os.path.exists(data_folder):
    os.makedirs(data_folder)

# Save files to the 'data' folder
df_products.to_csv(os.path.join(data_folder, 'synthetic_product_data.csv'), index=False)
df_reviews.to_csv(os.path.join(data_folder, 'synthetic_review_data.csv'), index=False)
print(f"Data saved to '{data_folder}/synthetic_product_data.csv' and '{data_folder}/synthetic_review_data.csv'")


8. Saving generated data
Data saved to 'data/synthetic_retail_data.csv' and 'data/synthetic_product_data.csv'


## Create LangChain index and ChromaDB store

In [64]:
# Function to load and process a CSV file
def load_and_process_csv(file_path):
    df = pd.read_csv(file_path)
    processed_documents = []

    if 'synthetic_product_data' in file_path:
        for _, row in df.iterrows():
            metadata = {
                'product_id': row['product_id'],
                'product_name': row['product_name'],
                'category': row['category'],
                'price': float(row['price'])
            }
            page_content = f"{row['product_name']}\n{row['category']}\n{row['description']}"
            processed_documents.append(Document(page_content=page_content, metadata=metadata))

    elif 'synthetic_review_data' in file_path:
        for _, row in df.iterrows():
            metadata = {
                'product_id': row['product_id'],
                'rating': float(row['rating']),
                'review': row['review']
            }
            page_content = row['review']
            processed_documents.append(Document(page_content=page_content, metadata=metadata))

    else:
        print(f"Skipping unknown file type: {file_path}")

    return processed_documents

# Load and process both CSV files
product_data = load_and_process_csv(os.path.join(data_folder, 'synthetic_product_data.csv'))
review_data = load_and_process_csv(os.path.join(data_folder, 'synthetic_review_data.csv'))
# Combine all documents
chroma_documents = product_data + review_data

In [None]:
# Make sure you have the AWS CLI configured with the proper credentials and region
bedrock = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-east-1'  # Replace with your preferred AWS region
)
# Initialize the Bedrock embedding function
embeddings = BedrockEmbeddings(
    client=bedrock,
    model_id="amazon.titan-embed-text-v1"
)

  embeddings = BedrockEmbeddings(


In [None]:
# Create the ChromaDB vector store
vector_store = Chroma.from_documents(
    documents=chroma_documents,
    embedding=embeddings,
    persist_directory=os.path.join(data_folder, 'chroma_db')
)

print(f"ChromaDB index created and stored in {os.path.join(data_folder, 'chroma_db')}")


ChromaDB index created and stored in data/chroma_db


In [69]:
# OpenAI Version
from langchain.embeddings import OpenAIEmbeddings

# Initialize OpenAI embedding function
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

# Create ChromaDB vector store from documents
vector_store = Chroma.from_documents(
    documents=chroma_documents,
    embedding=embeddings,
    persist_directory=os.path.join(data_folder, 'chroma_db')
)

print(f"ChromaDB index created and stored in {os.path.join(data_folder, 'chroma_db')}")

ChromaDB index created and stored in data/chroma_db


In [70]:
def search_products_chroma(query, top_n=5):
    results = vector_store.similarity_search(query, k=top_n)
    return results

In [71]:
search_res = search_products_chroma(query="electronics review")
from google.colab import data_table
data_table.enable_dataframe_formatter()
search_res


[Document(metadata={'product_name': 'EchoWave Bluetooth Speaker', 'product_id': 1, 'price': 79.99, 'category': 'Electronics'}, page_content='EchoWave Bluetooth Speaker\nElectronics\nCompact and powerful, this speaker delivers crystal-clear sound with impressive bass for all music lovers.'),
 Document(metadata={'product_name': 'EchoWave Bluetooth Speaker', 'price': 79.99, 'category': 'Electronics', 'product_id': 1}, page_content='EchoWave Bluetooth Speaker\nElectronics\nCompact and powerful, this speaker delivers crystal-clear sound with impressive bass for all music lovers.'),
 Document(metadata={'product_name': 'EchoWave Bluetooth Speaker', 'price': 79.99, 'product_id': 1, 'category': 'Electronics'}, page_content='EchoWave Bluetooth Speaker\nElectronics\nCompact and powerful, this speaker delivers crystal-clear sound with impressive bass for all music lovers.'),
 Document(metadata={'product_id': 1, 'category': 'Electronics', 'price': 79.99, 'product_name': 'EchoWave Bluetooth Speaker'

In [72]:
# Pretified version for Colab
search_res = search_products_chroma(query="electronics review")

# Convert results to a DataFrame
def format_search_results(results):
    rows = []
    for doc in results:
        # Each `doc` is typically a Document object with .page_content and .metadata
        rows.append({
            "Content": doc.page_content,
            **doc.metadata  # Unpack any metadata (like category, product_name, etc.)
        })
    return pd.DataFrame(rows)

# Create formatted table
df_results = format_search_results(search_res)

# Display using Google Colab data table
from google.colab import data_table
data_table.enable_dataframe_formatter()

# Show interactive table
df_results


Unnamed: 0,Content,product_id,category,price,product_name,rating,review
0,EchoWave Bluetooth Speaker\nElectronics\nCompa...,1,Electronics,79.99,EchoWave Bluetooth Speaker,,
1,EchoWave Bluetooth Speaker\nElectronics\nCompa...,1,Electronics,79.99,EchoWave Bluetooth Speaker,,
2,EchoWave Bluetooth Speaker\nElectronics\nCompa...,1,Electronics,79.99,EchoWave Bluetooth Speaker,,
3,EchoWave Bluetooth Speaker\nElectronics\nCompa...,1,Electronics,79.99,EchoWave Bluetooth Speaker,,
4,The EchoWave Bluetooth Speaker surpasses expec...,1,,,,4.7,The EchoWave Bluetooth Speaker surpasses expec...
