Can LLMs Play The Price Is Right

A comprehensive benchmark for evaluating Large Language Models (LLMs) on pricing estimation tasks, inspired by "The Price is Right" game show. This project tests how well different LLMs can estimate the retail value of consumer products.

Overview

PriceIsRightLLM implements a simplified version of The Price is Right's Showcase where LLM contestants bid on collections of prizes. The benchmark evaluates models on their ability to:

Estimate retail prices of consumer products
Make strategic bids without going over the actual price
Learn from example pricing data
Provide reasoning for their estimates

Features

Pre-generated Showcases: 100+ deterministic showcases using random seeds for reproducible results
Multiple LLM Backends: Support for OpenAI, Anthropic, Google, Groq, xAI and Fireworks models
Comprehensive Analysis: Detailed statistics and performance metrics
Cost Optimization: Prevents duplicate API calls and supports dry-run testing
Model Management: Automated model discovery, testing, and backend mapping
Status Change Tracking: Compare model availability and functionality across test runs
Featured Models: Curated selection of current models from each backend for focused benchmarking
Modern CLI: Clean command-line interface using Typer with helpful commands and options

Installation

Prerequisites

Python 3.8+
Conda (recommended) or pip

Setup

Clone the repository:

git clone git@github.com:pymc-labs/PriceIsRightLLM.git
cd PriceIsRightLLM

Create and activate conda environment:

make create_environment
conda activate PriceIsRightLLM

Install dependencies:
```
make requirements
```
Install development dependencies (optional):
```
pip install -r requirements-dev.txt
```

Environment Variables

Set up API keys for the LLM providers you want to use:

# OpenAI
export OPENAI_API_KEY="your-openai-api-key"

# Anthropic
export ANTHROPIC_API_KEY="your-anthropic-api-key"

# Google (Gemini)
export GOOGLE_API_KEY="your-google-api-key"

# Groq
export GROQ_API_KEY="your-groq-api-key"

# xAI
export XAI_API_KEY="your-xai-api-key"

# Fireworks
export FIREWORKS_API_KEY="your-fireworks-api-key"

Quick Start

1. Set Up Model Backend Mapping

First, discover available models and create the backend mapping:

cd src
python list_models.py create-map

This command automatically:

Discovers available models from all configured backends
Creates/updates the model-to-backend mapping
Saves the model list to a historical database for tracking changes over time
Shows a summary of model counts by backend

You can add models manually specifying their name and backend type

python list_models.py add-model gpt-4o openai

2. Historical Model Tracking

The system tracks model availability changes over time. You can:

View historical data:

python list_models.py show-history

See changes between runs:

python list_models.py show-changes

Export historical trends to CSV:

python list_models.py export-history

Skip historical tracking (if desired):

python list_models.py create-map --no-save-history

The historical database helps you:

Track when new models become available
Monitor model deprecations and removals
Understand provider ecosystem changes

Files:

Reads: data/model_backend_map.csv (if exists)
Writes: data/model_backend_map.csv, data/model_history.db
Exports: data/model_history.csv (optional)

3. Test Models

Test the models to see which ones work and respond to text queries:

python showcase.py test-models

This command:

Tests all models from the backend mapping
Compares results with previous test runs (if contestants.csv exists)
Categorizes models into status change categories:
- 🆕 NEW MODELS - WORKING: New models that work
- 🆕 NEW MODELS - BROKEN: New models that don't work
- 🔧 FIXED - PREVIOUSLY BROKEN, NOW WORKING: Models that were fixed
- 💥 BROKEN - PREVIOUSLY WORKING, NOW BROKEN: Models that broke since last test
- ✅ STILL WORKING: Models that continue to work
- ❌ STILL BROKEN: Models that remain broken
Updates contestants.csv with current model status

Files:

Reads: data/model_backend_map.csv, data/contestants.csv (if exists), data/full_price_is_right_products.csv
Writes: data/contestants.csv

4. Generate Showcases

Create the showcase database with pre-generated pricing scenarios:

python showcase.py make-showcases --n-seeds 100

This creates 100 showcases (seeds 0-99) stored in data/showcases.db.

5. Run Featured Models

Run the curated list of featured models:

python showcase.py run --n-seeds 1

Files:

Reads: data/model_backend_map.csv, data/showcases.db, data/contestants.csv
Writes: data/responses.db

6. Run a Specific Model

Test a single model on multiple showcases:

python showcase.py run --model gpt-4o --n-seeds 10

7. Generate Statistics

Generate performance statistics, simulate head-to-head competitions, and compute the leader board:

python showcase.py stats

Files:

Reads: data/responses.db, data/contestants.csv, data/model_backend_map.csv
Writes: data/model_stats.csv, data/leaderboard.csv, data/leaderboard.json,

Command Reference

Both scripts use Typer for a modern command-line interface. You can get help for any command:

# Get a list of commands
python showcase.py --help
python list_models.py --help

# Get help for specific commands
python showcase.py run --help
python showcase.py test-models --help
python showcase.py make-showcases --help
python list_models.py create-map --help

Usage Examples

Showcase Commands

# Test all models and track status changes
python showcase.py test-models

# Run on specific number of showcases
python showcase.py run --model gpt-4o --n-seeds 5

# Dry run (no API calls)
python showcase.py run --model gpt-4o --dry-run

# Force re-run (ignore existing responses)
python showcase.py run --model gpt-4o --force

# Run all models in FEATURED_MODELS
python showcase.py run --n-seeds 5

# Remove model responses from database
python showcase.py remove-model --model gpt-4o

# Show how many showcases are prepared in the database
python showcase.py how-many-showcases

Project Structure

PriceIsRightLLM/
├── src/
│   ├── showcase.py              # Main showcase experiment script
│   ├── list_models.py           # Model discovery and management
│   ├── llm_backends.py          # LLM backend implementations
│   ├── data_loader.py           # Data loading utilities
│   ├── constants.py             # Project constants
│   └── parse_price_is_right_prices.py  # Price data parsing
├── data/
│   ├── full_price_is_right_products.csv  # Product pricing data
│   ├── contestants.csv          # Model test results and status
│   ├── model_backend_map.csv    # Model-to-backend mapping (CSV format)
│   ├── showcases.db             # Pre-generated showcases (shelve database)
│   ├── responses.db             # Model responses (shelve database)
│   ├── model_history.db         # Historical model availability tracking
│   ├── competitions.csv         # Head-to-head competition results
│   └── model_stats.csv          # Performance statistics
├── requirements.txt             # Python dependencies
├── requirements-dev.txt         # Development dependencies
├── Makefile                     # Build and environment management
└── README.md                   # This file

Data Files

Prices for the showcase items are from Fandom.com, where viewers have compiled prices of items on the show. Because this dataset is available on the internet, it could be in the training corpus of some LLMs, or they might access it before generating responses. However, so far none of the models are so accurate that it seems likely they are "cheating" in this way. Another limitation of this dataset is that different prices were compiled at different times, which adds an extraneous element of time to the challenge. To address these limitations, we are exploring alternative datasets for a future version of the benchmark.

Core Data Files

full_price_is_right_products.csv: Product pricing data with names, descriptions, and retail prices
contestants.csv: Model test results with success/failure status and outcomes
model_backend_map.csv: Mapping of model names to their backend providers (OpenAI, Anthropic, etc.)

Database Files

showcases.db: Shelve database containing pre-generated showcases (seeds 0-99)
responses.db: Shelve database storing all model responses and bids
model_history.db: Shelve database tracking historical model availability changes over time

Results Files

competitions.csv: Head-to-head competition results between models (no longer used)
model_stats.csv: Performance statistics and metrics for all models
leaderboard.csv: Leaderboard sorted by Elo score, CSV format
leaderboard.json: Leaderboard sorted by Elo score, JSON format

Data Format

Showcases

Each showcase contains:

Training items: 10 example products with prices (for context)
Test items: 3 products to bid on
Seed: Deterministic identifier for reproducibility
Total value: Actual retail price of the test items

Responses

Model responses include:

Bid: Estimated total value
Rationale: Reasoning for the estimate
Performance: Difference from actual value, over-bid status
Metadata: Timestamp, model, showcase seed

Featured Models

The system includes 19 curated featured models from major providers:

OpenAI: gpt-4o, gpt-4-turbo, gpt-5, o1, o3, gpt-3.5-turbo
Anthropic: claude-3-5-sonnet, claude-sonnet-4, claude-opus-4, claude-3-5-haiku
Google: gemini-2.0-flash, gemini-1.5-pro, gemini-1.5-flash
Fireworks: qwen3-30b-a3b, llama-v3p3-70b-instruct, deepseek-v3, glm-4p5

Benchmark Results

The benchmark evaluates models on several metrics:

Accuracy: How close bids are to actual values (MAPE)
Over-bid Rate: Percentage of bids that exceed actual value
Win Rate: Success rate in head-to-head competitions
Elo Rating: Inferred rating based on head-to-head competitions

License

This project is licensed under the Apache License - see the LICENSE file for details.

The price data, from Fandom.com, is under a CC-BY-SA license.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

License

pymc-labs/PriceIsRightLLM

Folders and files

Latest commit

History

Repository files navigation

Can LLMs Play The Price Is Right

Overview

Features

Installation

Prerequisites

Setup

Environment Variables

Quick Start

1. Set Up Model Backend Mapping

2. Historical Model Tracking

3. Test Models

4. Generate Showcases

5. Run Featured Models

6. Run a Specific Model

7. Generate Statistics

Command Reference

Usage Examples

Showcase Commands

Project Structure

Data Files

Core Data Files

Database Files

Results Files

Data Format

Showcases

Responses

Featured Models

Benchmark Results

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages