A comprehensive benchmark for evaluating Large Language Models (LLMs) on pricing estimation tasks, inspired by "The Price is Right" game show. This project tests how well different LLMs can estimate the retail value of consumer products.
PriceIsRightLLM implements a simplified version of The Price is Right's Showcase where LLM contestants bid on collections of prizes. The benchmark evaluates models on their ability to:
- Estimate retail prices of consumer products
- Make strategic bids without going over the actual price
- Learn from example pricing data
- Provide reasoning for their estimates
- Pre-generated Showcases: 100+ deterministic showcases using random seeds for reproducible results
- Multiple LLM Backends: Support for OpenAI, Anthropic, Google, Groq, xAI and Fireworks models
- Comprehensive Analysis: Detailed statistics and performance metrics
- Cost Optimization: Prevents duplicate API calls and supports dry-run testing
- Model Management: Automated model discovery, testing, and backend mapping
- Status Change Tracking: Compare model availability and functionality across test runs
- Featured Models: Curated selection of current models from each backend for focused benchmarking
- Modern CLI: Clean command-line interface using Typer with helpful commands and options
- Python 3.8+
- Conda (recommended) or pip
-
Clone the repository:
git clone git@github.com:pymc-labs/PriceIsRightLLM.git cd PriceIsRightLLM
-
Create and activate conda environment:
make create_environment conda activate PriceIsRightLLM
-
Install dependencies:
make requirements
-
Install development dependencies (optional):
pip install -r requirements-dev.txt
Set up API keys for the LLM providers you want to use:
# OpenAI
export OPENAI_API_KEY="your-openai-api-key"
# Anthropic
export ANTHROPIC_API_KEY="your-anthropic-api-key"
# Google (Gemini)
export GOOGLE_API_KEY="your-google-api-key"
# Groq
export GROQ_API_KEY="your-groq-api-key"
# xAI
export XAI_API_KEY="your-xai-api-key"
# Fireworks
export FIREWORKS_API_KEY="your-fireworks-api-key"
First, discover available models and create the backend mapping:
cd src
python list_models.py create-map
This command automatically:
- Discovers available models from all configured backends
- Creates/updates the model-to-backend mapping
- Saves the model list to a historical database for tracking changes over time
- Shows a summary of model counts by backend
You can add models manually specifying their name and backend type
python list_models.py add-model gpt-4o openai
The system tracks model availability changes over time. You can:
View historical data:
python list_models.py show-history
See changes between runs:
python list_models.py show-changes
Export historical trends to CSV:
python list_models.py export-history
Skip historical tracking (if desired):
python list_models.py create-map --no-save-history
The historical database helps you:
- Track when new models become available
- Monitor model deprecations and removals
- Understand provider ecosystem changes
Files:
- Reads:
data/model_backend_map.csv
(if exists) - Writes:
data/model_backend_map.csv
,data/model_history.db
- Exports:
data/model_history.csv
(optional)
Test the models to see which ones work and respond to text queries:
python showcase.py test-models
This command:
- Tests all models from the backend mapping
- Compares results with previous test runs (if
contestants.csv
exists) - Categorizes models into status change categories:
- π NEW MODELS - WORKING: New models that work
- π NEW MODELS - BROKEN: New models that don't work
- π§ FIXED - PREVIOUSLY BROKEN, NOW WORKING: Models that were fixed
- π₯ BROKEN - PREVIOUSLY WORKING, NOW BROKEN: Models that broke since last test
- β STILL WORKING: Models that continue to work
- β STILL BROKEN: Models that remain broken
- Updates
contestants.csv
with current model status
Files:
- Reads:
data/model_backend_map.csv
,data/contestants.csv
(if exists),data/full_price_is_right_products.csv
- Writes:
data/contestants.csv
Create the showcase database with pre-generated pricing scenarios:
python showcase.py make-showcases --n-seeds 100
This creates 100 showcases (seeds 0-99) stored in data/showcases.db
.
Run the curated list of featured models:
python showcase.py run --n-seeds 1
Files:
- Reads:
data/model_backend_map.csv
,data/showcases.db
,data/contestants.csv
- Writes:
data/responses.db
Test a single model on multiple showcases:
python showcase.py run --model gpt-4o --n-seeds 10
Generate performance statistics, simulate head-to-head competitions, and compute the leader board:
python showcase.py stats
Files:
- Reads:
data/responses.db
,data/contestants.csv
,data/model_backend_map.csv
- Writes:
data/model_stats.csv
,data/leaderboard.csv
,data/leaderboard.json
,
Both scripts use Typer for a modern command-line interface. You can get help for any command:
# Get a list of commands
python showcase.py --help
python list_models.py --help
# Get help for specific commands
python showcase.py run --help
python showcase.py test-models --help
python showcase.py make-showcases --help
python list_models.py create-map --help
# Test all models and track status changes
python showcase.py test-models
# Run on specific number of showcases
python showcase.py run --model gpt-4o --n-seeds 5
# Dry run (no API calls)
python showcase.py run --model gpt-4o --dry-run
# Force re-run (ignore existing responses)
python showcase.py run --model gpt-4o --force
# Run all models in FEATURED_MODELS
python showcase.py run --n-seeds 5
# Remove model responses from database
python showcase.py remove-model --model gpt-4o
# Show how many showcases are prepared in the database
python showcase.py how-many-showcases
PriceIsRightLLM/
βββ src/
β βββ showcase.py # Main showcase experiment script
β βββ list_models.py # Model discovery and management
β βββ llm_backends.py # LLM backend implementations
β βββ data_loader.py # Data loading utilities
β βββ constants.py # Project constants
β βββ parse_price_is_right_prices.py # Price data parsing
βββ data/
β βββ full_price_is_right_products.csv # Product pricing data
β βββ contestants.csv # Model test results and status
β βββ model_backend_map.csv # Model-to-backend mapping (CSV format)
β βββ showcases.db # Pre-generated showcases (shelve database)
β βββ responses.db # Model responses (shelve database)
β βββ model_history.db # Historical model availability tracking
β βββ competitions.csv # Head-to-head competition results
β βββ model_stats.csv # Performance statistics
βββ requirements.txt # Python dependencies
βββ requirements-dev.txt # Development dependencies
βββ Makefile # Build and environment management
βββ README.md # This file
Prices for the showcase items are from Fandom.com, where viewers have compiled prices of items on the show. Because this dataset is available on the internet, it could be in the training corpus of some LLMs, or they might access it before generating responses. However, so far none of the models are so accurate that it seems likely they are "cheating" in this way. Another limitation of this dataset is that different prices were compiled at different times, which adds an extraneous element of time to the challenge. To address these limitations, we are exploring alternative datasets for a future version of the benchmark.
full_price_is_right_products.csv
: Product pricing data with names, descriptions, and retail pricescontestants.csv
: Model test results with success/failure status and outcomesmodel_backend_map.csv
: Mapping of model names to their backend providers (OpenAI, Anthropic, etc.)
showcases.db
: Shelve database containing pre-generated showcases (seeds 0-99)responses.db
: Shelve database storing all model responses and bidsmodel_history.db
: Shelve database tracking historical model availability changes over time
competitions.csv
: Head-to-head competition results between models (no longer used)model_stats.csv
: Performance statistics and metrics for all modelsleaderboard.csv
: Leaderboard sorted by Elo score, CSV formatleaderboard.json
: Leaderboard sorted by Elo score, JSON format
Each showcase contains:
- Training items: 10 example products with prices (for context)
- Test items: 3 products to bid on
- Seed: Deterministic identifier for reproducibility
- Total value: Actual retail price of the test items
Model responses include:
- Bid: Estimated total value
- Rationale: Reasoning for the estimate
- Performance: Difference from actual value, over-bid status
- Metadata: Timestamp, model, showcase seed
The system includes 19 curated featured models from major providers:
- OpenAI: gpt-4o, gpt-4-turbo, gpt-5, o1, o3, gpt-3.5-turbo
- Anthropic: claude-3-5-sonnet, claude-sonnet-4, claude-opus-4, claude-3-5-haiku
- Google: gemini-2.0-flash, gemini-1.5-pro, gemini-1.5-flash
- Fireworks: qwen3-30b-a3b, llama-v3p3-70b-instruct, deepseek-v3, glm-4p5
The benchmark evaluates models on several metrics:
- Accuracy: How close bids are to actual values (MAPE)
- Over-bid Rate: Percentage of bids that exceed actual value
- Win Rate: Success rate in head-to-head competitions
- Elo Rating: Inferred rating based on head-to-head competitions
This project is licensed under the Apache License - see the LICENSE file for details.
The price data, from Fandom.com, is under a CC-BY-SA license.