# 💸 Price Prediction Dataset Preparation (Electronics)

This notebook prepares a high-quality dataset for fine-tuning and evaluating language models for product **price prediction**, using the **Electronics** category from the [Amazon Reviews 2023 dataset](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023).


Key steps include:
- Loading and processing product data into structured prompts
- Filtering examples based on token length and content quality
- Visualizing token and price distributions
- Splitting into training and test sets
- Creating a `DatasetDict` compatible with Hugging Face 🤗
- Saving to disk with `pickle` for fast reuse

> ℹ️ **Note**: The full dataset available [here](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories)
. For faster experimentation, we focus here on **Electronics**, but the same pipeline applies to other categories like Home Appliances or a combination of them.

> ⚠️ **Note:** This notebook is designed to run on **CPU only** — no GPU or TPU is required.



In [None]:
!pip install -q datasets

In [None]:
# 🔧 System & Environment
import os
import sys
import random
import pickle  # For saving/loading processed data objects
sys.path.append('/content/sample_data')  # Add custom code directory to Python path (if needed)

# 📊 Data Manipulation & Visualization
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter, defaultdict  # For counting and grouping items

# 🤗 Hugging Face Datasets & Hub
from datasets import load_dataset, Dataset, DatasetDict  # Load and structure datasets
from huggingface_hub import login  # Login for pushing datasets/models to the HF Hub

# 🔐 Colab Integration
from google.colab import userdata  # Securely access environment variables (e.g., HF token)


### 🔐 Authenticate with Hugging Face

We use a secure token from Colab's `userdata` to authenticate with Hugging Face.  
This allows us to access private models or push datasets to the Hub.

> 💡 Make sure you've uploaded your token to Colab Secrets first:
>
> Go to ` Secrets` → `+ Add new secret` → Key: `HF_TOKEN`, Value: *your token*


In [None]:
hf_token = userdata.get('HF_TOKEN')  # Hugging Face token stored in Colab Secrets
login(hf_token, add_to_git_credential=True)

### 🧹 Product Data Preprocessing Class for Price Prediction

This **Item** class takes in raw product data and cleans, curates, and formats it into a prompt suitable for training or testing a language model to predict product prices.

In [None]:
from typing import Optional
from transformers import AutoTokenizer
import re

# 🔧 Configuration constants
BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"  # Tokenizer source
MIN_TOKENS = 150   # Minimum number of tokens required to include an item
MAX_TOKENS = 160   # Max token cutoff before truncating
MIN_CHARS = 300    # Minimum character length for raw product content
CEILING_CHARS = MAX_TOKENS * 7  # Hard character ceiling for truncation

class Item:
    """
    An Item is a cleaned, curated datapoint of a Product with a Price.
    It handles text cleaning, token length control, and prompt construction.
    """

    # Tokenizer used for token counting and truncation
    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)

    # Prompt structure
    PREFIX = "Price is $"
    QUESTION = "How much does this cost to the nearest dollar?"

    # Common noisy metadata strings to be removed from the details
    REMOVALS = [
        '"Batteries Included?": "No"', '"Batteries Included?": "Yes"',
        '"Batteries Required?": "No"', '"Batteries Required?": "Yes"',
        "By Manufacturer", "Item", "Date First", "Package", ":", "Number of",
        "Best Sellers", "Number", "Product "
    ]

    # Expected instance variables
    title: str
    price: float
    category: str
    token_count: int = 0
    details: Optional[str]
    prompt: Optional[str] = None
    include = False  # Marks if the item is usable based on filters

    def __init__(self, data, price):
        self.title = data['title']
        self.price = price
        self.parse(data)  # Begin processing and filtering

    def scrub_details(self):
        """
        Clean up the product 'details' string by removing irrelevant metadata fields.
        """
        details = self.details
        for remove in self.REMOVALS:
            details = details.replace(remove, "")
        return details

    def scrub(self, stuff):
        """
        Clean up text:
        - Normalize whitespace and symbols
        - Remove long alphanumeric codes (likely product codes)
        """
        stuff = re.sub(r'[:\[\]"{}【】\s]+', ' ', stuff).strip()
        stuff = stuff.replace(" ,", ",").replace(",,,", ",").replace(",,", ",")
        words = stuff.split(' ')
        select = [word for word in words if len(word) < 7 or not any(char.isdigit() for char in word)]
        return " ".join(select)

    def parse(self, data):
        """
        Compose the full product text from its fields.
        Filter based on character length and token count.
        If it qualifies, generate the training prompt and mark it for inclusion.
        """
        contents = '\n'.join(data['description'])
        if contents:
            contents += '\n'
        features = '\n'.join(data['features'])
        if features:
            contents += features + '\n'
        self.details = data['details']
        if self.details:
            contents += self.scrub_details() + '\n'

        if len(contents) > MIN_CHARS:
            contents = contents[:CEILING_CHARS]
            text = f"{self.scrub(self.title)}\n{self.scrub(contents)}"
            tokens = self.tokenizer.encode(text, add_special_tokens=False)
            if len(tokens) > MIN_TOKENS:
                tokens = tokens[:MAX_TOKENS]
                text = self.tokenizer.decode(tokens)
                self.make_prompt(text)
                self.include = True

    def make_prompt(self, text):
        """
        Create the final training prompt (question + product content + price answer).
        """
        self.prompt = f"{self.QUESTION}\n\n{text}\n\n"
        self.prompt += f"{self.PREFIX}{str(round(self.price))}.00"
        self.token_count = len(self.tokenizer.encode(self.prompt, add_special_tokens=False))

    def test_prompt(self):
        """
        Return a test version of the prompt with the price answer removed (for prediction).
        """
        return self.prompt.split(self.PREFIX)[0] + self.PREFIX

    def __repr__(self):
        """
        Developer-friendly string representation of the object.
        """
        return f"<{self.title} = ${self.price}>"


### 🚚 ItemLoader Class for Efficient Parallel Dataset Processing

This class is responsible for:
- Loading a large category-specific dataset from Hugging Face
- Filtering and cleaning datapoints to produce usable `Item` objects
- Parallelizing the process using multiple CPU workers to improve speed

In [None]:
from datetime import datetime
from tqdm import tqdm
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

# 💰 Price filtering thresholds
CHUNK_SIZE = 1000
MIN_PRICE = 0.5
MAX_PRICE = 999.49

class ItemLoader:
    """
    Load and preprocess product data from the Hugging Face Amazon dataset.
    Converts raw datapoints into clean `Item` objects using multiprocessing.
    """

    def __init__(self, name):
        self.name = name  # Category name (e.g., "Electronics")
        self.dataset = None  # Will hold the raw HF dataset

    def from_datapoint(self, datapoint):
        """
        Try to create an Item from this datapoint.
        Returns a valid Item object if it passes filtering, or None otherwise.
        """
        try:
            price_str = datapoint['price']
            if price_str:
                price = float(price_str)
                if MIN_PRICE <= price <= MAX_PRICE:
                    item = Item(datapoint, price)
                    return item if item.include else None
        except ValueError:
            return None  # Skip if price is missing or malformed

    def from_chunk(self, chunk):
        """
        Create a list of cleaned Items from a chunk of datapoints.
        Used in parallel execution.
        """
        batch = []
        for datapoint in chunk:
            result = self.from_datapoint(datapoint)
            if result:
                batch.append(result)
        return batch

    def chunk_generator(self):
        """
        Generator to yield fixed-size chunks of the dataset.
        Enables parallel processing by dividing the dataset.
        """
        size = len(self.dataset)
        for i in range(0, size, CHUNK_SIZE):
            yield self.dataset.select(range(i, min(i + CHUNK_SIZE, size)))

    def load_in_parallel(self, workers):
        """
        Parallelize the item conversion process using ProcessPoolExecutor.
        This dramatically improves performance for large datasets.
        """
        results = []
        chunk_count = (len(self.dataset) // CHUNK_SIZE) + 1
        with ProcessPoolExecutor(max_workers=workers) as pool:
            for batch in tqdm(pool.map(self.from_chunk, self.chunk_generator()), total=chunk_count):
                results.extend(batch)

        # Annotate each item with the dataset category
        for result in results:
            result.category = self.name

        return results

    def load(self, workers=8):
        """
        Load and process the dataset from Hugging Face.
        Uses `load_in_parallel()` to apply cleaning and filtering in parallel.
        """
        start = datetime.now()
        print(f"Loading dataset {self.name}", flush=True)

        # Load raw category dataset (e.g., "raw_meta_Electronics")
        self.dataset = load_dataset(
            "McAuley-Lab/Amazon-Reviews-2023",
            f"raw_meta_{self.name}",
            split="full",
            trust_remote_code=True
        )

        # Clean and convert items in parallel
        results = self.load_in_parallel(workers)

        finish = datetime.now()
        print(f"Completed {self.name} with {len(results):,} datapoints in {(finish-start).total_seconds()/60:.1f} mins", flush=True)

        return results


In [None]:
%matplotlib inline

### 📘 Dataset Strategy

We have access to multiple product categories from the `McAuley-Lab/Amazon-Reviews-2023` dataset, such as:

- Automotive  
- Office Products  
- Tools and Home Improvement  
- Cell Phones and Accessories  
- Toys and Games  
- Appliances  
- Musical Instruments  
- **Electronics** (our primary focus)

To start, we use the **Automotive** category for initial exploration because it's relatively small.

Later, we will shift to the **Electronics** category as our **main dataset**, which offers a larger and more diverse set of examples, but keeping the same structure.



In [None]:
df = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_meta_Appliances", split="full", trust_remote_code=True).to_pandas()

In [None]:
print("🔍 Non-missing values per column:")
print(df.count())

Print the first row of the dataset to inspect the structure and available fields:


In [None]:
print(df.iloc[0])

## Now use the main Dataset

#### Load and process the "Electronics" category using the custom ItemLoader class with 2 parallel workers


In [None]:
items = ItemLoader("Electronics").load(workers=2)

In [None]:
print(f"A grand total of {len(items):,} items")

#### Print the training prompt of the 1st item to inspect the formatted structure:


In [None]:
print(items[0].prompt)

#### Print the test prompt of the first item (with the price removed) to simulate inference-time input:


In [None]:
print(items[0].test_prompt())

#### Visualize the distribution of token counts for the generated prompts. This helps understand how long the prompts are and whether they stay within expected token limits:

In [None]:
tokens = [item.token_count for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Token counts: Avg {sum(tokens)/len(tokens):,.1f} and highest {max(tokens):,}\n")
plt.xlabel('Length (tokens)')
plt.ylabel('Count')
plt.hist(tokens, rwidth=0.7, color="skyblue", bins=range(0, 300, 10))
plt.show()

We kept token counts within this range to ensure prompts are informative enough for learning, and are short enough to keep training efficient and fast.

#### Plot a histogram showing the distribution of product prices across all items:

In [None]:
prices = [item.price for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="blueviolet", bins=range(0, 1000, 10))
plt.show()

## Objective

Craft a dataset which is more balanced in terms of prices. Less heavily scewed to cheap items, with an average that's higher than $72.3.

In [None]:
# Group items by rounded price from $1 to $999
# This creates a dictionary where each key is a rounded price,
# and the value is a list of items that have that price

slots = defaultdict(list)
for item in items:
    slots[round(item.price)].append(item)

In [None]:
# 🎲 Sample items across price buckets ($1–$999) with balanced representation
np.random.seed(42)
random.seed(42)

sample = []

for i in range(1, 1000):
    slot = slots[i]
    if i >= 240 or len(slot) <= 500:
        sample.extend(slot)
    else:
        sample.extend(random.sample(slot, 500))

print(f"There are {len(sample):,} items in the sample")

In [None]:
# Plot the distribution of prices in sample

prices = [float(item.price) for item in sample]
plt.figure(figsize=(15, 10))
plt.title(f"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="darkblue", bins=range(0, 1000, 10))
plt.show()


\* LLaMA's tokenizer maps numbers 1–999 to single tokens, unlike Qwen2, Gemma, and Phi-3, which split digits. This is a helpful (but not critical) advantage for the project.


## Finally

It's time to break down our data into a training, test and validation dataset.

It's typical to use 5%-10% of your data for testing purposes, but actually we have far more than we need at this point. We'll take 100,000 points for training, and we'll reserve 2,000 for testing, although we won't use all of them.


In [None]:
random.seed(40)
random.shuffle(sample)
train = sample[:100_000]
test = sample[100_000:102_000]
print(f"Divided into a training set of {len(train):,} items and test set of {len(test):,} items")

In [None]:
print(train[0].prompt)

In [None]:
print(test[0].test_prompt())

In [None]:
# Plot the distribution of prices in the first 250 test points

prices = [float(item.price) for item in test[:250]]
plt.figure(figsize=(15, 6))
plt.title(f"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="darkblue", bins=range(0, 1000, 10))
plt.show()

## Finally - upload your brand new dataset

Convert to prompts and upload to HuggingFace hub

In [None]:
train_prompts = [item.prompt for item in train]
train_prices = [item.price for item in train]
test_prompts = [item.test_prompt() for item in test]
test_prices = [item.price for item in test]

In [None]:
# Create a Dataset from the lists

train_dataset = Dataset.from_dict({"text": train_prompts, "price": train_prices})
test_dataset = Dataset.from_dict({"text": test_prompts, "price": test_prices})
dataset = DatasetDict({
    "train": train_dataset,
    "test": test_dataset
})

In [None]:
# Uncomment these lines if you're ready to push to the hub, and replace my name with your HF username

# HF_USER = "vassilis19"
# DATASET_NAME = f"{HF_USER}/pricer-electronics-data"
# dataset.push_to_hub(DATASET_NAME, private=True)

In [None]:
# One more thing!
# Let's pickle the training and test dataset so we don't have to execute all this code next time!

with open('train.pkl', 'wb') as file:
    pickle.dump(train, file)

with open('test.pkl', 'wb') as file:
    pickle.dump(test, file)