# The Big Project begins!!

## The Product Pricer

A model that can estimate how much something costs, from its description.

## Data Curation Part 1

Today we'll begin our scrubbing and curating our dataset by focusing on a subset of the data: Home Appliances.

The dataset is here:  
https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023

And the folder with all the product datasets is here:  
https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories

In [1]:
# imports

import os
from dotenv import load_dotenv
from huggingface_hub import login
from datasets import load_dataset, Dataset, DatasetDict
import matplotlib.pyplot as plt

In [2]:
# environment

load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv(
    key="OPENAI_API_KEY", default="your-key-if-not-using-env"
)
os.environ["ANTHROPIC_API_KEY"] = os.getenv(
    key="ANTHROPIC_API_KEY", default="your-key-if-not-using-env"
)
os.environ["HF_TOKEN"] = os.getenv(key="HF_TOKEN", default="your-key-if-not-using-env")

In [3]:
# Log in to HuggingFace

hf_token: str = os.environ["HF_TOKEN"]
login(hf_token, add_to_git_credential=True)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [4]:
# One more import - the Item class
# If you get an error that you need to agree to Meta's terms when you run this, then follow the link it provides you and follow their instructions
# You should get approved by Meta within minutes
# Any problems - message me or email me!
# With thanks to student Dr John S. for pointing out that this import needs to come after signing in to HF

from items import Item

In [5]:
%matplotlib inline

In [6]:
# Load in our dataset

dataset = load_dataset(
    "McAuley-Lab/Amazon-Reviews-2023",
    f"raw_meta_Appliances",
    split="full",
    trust_remote_code=True,
)

In [7]:
print(dataset)
print(f"Number of samples: {len(dataset)}")

Dataset({
    features: ['main_category', 'title', 'average_rating', 'rating_number', 'features', 'description', 'price', 'images', 'videos', 'store', 'categories', 'details', 'parent_asin', 'bought_together', 'subtitle', 'author'],
    num_rows: 94327
})
Number of samples: 94327


In [8]:
print(dataset.features)

{'main_category': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'average_rating': Value(dtype='float64', id=None), 'rating_number': Value(dtype='int64', id=None), 'features': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'description': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'price': Value(dtype='string', id=None), 'images': Sequence(feature={'hi_res': Value(dtype='string', id=None), 'large': Value(dtype='string', id=None), 'thumb': Value(dtype='string', id=None), 'variant': Value(dtype='string', id=None)}, length=-1, id=None), 'videos': Sequence(feature={'title': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'user_id': Value(dtype='string', id=None)}, length=-1, id=None), 'store': Value(dtype='string', id=None), 'categories': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'details': Value(dtype='string', id=None), 'parent_asin': Value(dtype='string', id=None

In [9]:
print(f"Number of Appliances: {len(dataset):,}")

Number of Appliances: 94,327


In [10]:
print(dataset.shuffle().select(range(5)))

Dataset({
    features: ['main_category', 'title', 'average_rating', 'rating_number', 'features', 'description', 'price', 'images', 'videos', 'store', 'categories', 'details', 'parent_asin', 'bought_together', 'subtitle', 'author'],
    num_rows: 5
})


In [11]:
import pandas as pd

# Convert to pandas DataFrame for easier analysis
df = dataset.to_pandas()

# Display basic statistics for numerical columns
print(df.describe())

       average_rating  rating_number
count    94327.000000   94327.000000
mean         4.118859     136.367901
std          0.864040     977.516100
min          1.000000       1.000000
25%          3.800000       3.000000
50%          4.300000      13.000000
75%          4.700000      53.000000
max          5.000000   90203.000000


In [12]:
# For example, if 'category' is a column in your dataset
print(df["main_category"].value_counts())

main_category
Tools & Home Improvement        42694
Appliances                      25572
Amazon Home                     13915
Industrial & Scientific          5521
Automotive                        568
Health & Personal Care            268
All Electronics                   168
Sports & Outdoors                 124
Grocery                            95
Cell Phones & Accessories          85
Musical Instruments                83
Baby                               82
AMAZON FASHION                     71
Office Products                    68
All Beauty                         68
Computers                          57
Pet Supplies                       49
Camera & Photo                     39
Toys & Games                       38
Arts, Crafts & Sewing              33
Home Audio & Theater               31
Portable Audio & Accessories        5
Books                               5
Car Electronics                     4
Premium Beauty                      2
Digital Music                       

In [None]:
print(df["categories"].value_counts())

In [None]:
# Investigate a particular datapoint
datapoint = dataset[2]

In [None]:
# Investigate

print(datapoint["title"])
print(datapoint["description"])
print(datapoint["features"])
print(datapoint["details"])
print(datapoint["price"])

In [None]:
# How many have prices?

prices = 0
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            prices += 1
    except ValueError as e:
        pass

print(f"There are {prices:,} with prices which is {prices/len(dataset)*100:,.1f}%")

In [None]:
# For those with prices, gather the price and the length

prices = []
lengths = []
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            prices.append(price)
            contents = (
                datapoint["title"]
                + str(datapoint["description"])
                + str(datapoint["features"])
                + str(datapoint["details"])
            )
            lengths.append(len(contents))
    except ValueError as e:
        pass

In [None]:
# Plot the distribution of lengths

plt.figure(figsize=(15, 6))
plt.title(
    f"Lengths: Avg {sum(lengths)/len(lengths):,.0f} and highest {max(lengths):,}\n"
)
plt.xlabel("Length (chars)")
plt.ylabel("Count")
plt.hist(lengths, rwidth=0.7, color="lightblue", bins=range(0, 6000, 100))
plt.show()

In [None]:
# Plot the distribution of prices

plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.2f} and highest {max(prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="orange", bins=range(0, 1000, 10))
plt.show()

In [None]:
# So what is this item??

for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 21000:
            print(datapoint['title'])
    except ValueError as e:
        pass

This is the closest I can find - looks like it's going at a bargain price!!

https://www.amazon.com/TurboChef-Electric-Countertop-Microwave-Convection/dp/B01D05U9NO/

## Now it's time to curate our dataset

We select items that cost between 1 and 999 USD

We will be create Item instances, which truncate the text to fit within 180 tokens using the right Tokenizer

And will create a prompt to be used during Training.

Items will be rejected if they don't have sufficient characters.

In [None]:
# Create an Item object for each with a price

items = []
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            item = Item(datapoint, price)
            if item.include:
                items.append(item)
    except ValueError as e:
        pass

print(f"There are {len(items):,} items")

In [None]:
# Look at the first item

items[1]

In [None]:
# Investigate the prompt that will be used during training - the model learns to complete this

print(items[100].prompt)

In [None]:
# Investigate the prompt that will be used during testing - the model has to complete this

print(items[100].test_prompt())

In [None]:
# Plot the distribution of token counts

tokens = [item.token_count for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Token counts: Avg {sum(tokens)/len(tokens):,.1f} and highest {max(tokens):,}\n")
plt.xlabel('Length (tokens)')
plt.ylabel('Count')
plt.hist(tokens, rwidth=0.7, color="green", bins=range(0, 300, 10))
plt.show()

In [None]:
# Plot the distribution of prices

prices = [item.price for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="purple", bins=range(0, 300, 10))
plt.show()

## Sidenote

If you like the variety of colors that matplotlib can use in its charts, you should bookmark this:

https://matplotlib.org/stable/gallery/color/named_colors.html

## Todos for you:

- Review the Item class and check you're comfortable with it
- Examine some Item objects, look at the training prompt with `item.prompt` and test prompt with `item.test_prompt()`
- Make some more histograms to better understand the data

## Next time we will combine with many other types of product

Like Electronics and Automotive. This will give us a massive dataset, and we can then be picky about choosing a subset that will be most suitable for training.