Skip to content

pymc-labs/PriceIsRightLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Can LLMs Play The Price Is Right

A comprehensive benchmark for evaluating Large Language Models (LLMs) on pricing estimation tasks, inspired by "The Price is Right" game show. This project tests how well different LLMs can estimate the retail value of consumer products.

Overview

PriceIsRightLLM implements a simplified version of The Price is Right's Showcase where LLM contestants bid on collections of prizes. The benchmark evaluates models on their ability to:

  • Estimate retail prices of consumer products
  • Make strategic bids without going over the actual price
  • Learn from example pricing data
  • Provide reasoning for their estimates

Features

  • Pre-generated Showcases: 100+ deterministic showcases using random seeds for reproducible results
  • Multiple LLM Backends: Support for OpenAI, Anthropic, Google, Groq, xAI and Fireworks models
  • Comprehensive Analysis: Detailed statistics and performance metrics
  • Cost Optimization: Prevents duplicate API calls and supports dry-run testing
  • Model Management: Automated model discovery, testing, and backend mapping
  • Status Change Tracking: Compare model availability and functionality across test runs
  • Featured Models: Curated selection of current models from each backend for focused benchmarking
  • Modern CLI: Clean command-line interface using Typer with helpful commands and options

Installation

Prerequisites

  • Python 3.8+
  • Conda (recommended) or pip

Setup

  1. Clone the repository:

    git clone git@github.com:pymc-labs/PriceIsRightLLM.git
    cd PriceIsRightLLM
  2. Create and activate conda environment:

    make create_environment
    conda activate PriceIsRightLLM
  3. Install dependencies:

    make requirements
  4. Install development dependencies (optional):

    pip install -r requirements-dev.txt

Environment Variables

Set up API keys for the LLM providers you want to use:

# OpenAI
export OPENAI_API_KEY="your-openai-api-key"

# Anthropic
export ANTHROPIC_API_KEY="your-anthropic-api-key"

# Google (Gemini)
export GOOGLE_API_KEY="your-google-api-key"

# Groq
export GROQ_API_KEY="your-groq-api-key"

# xAI
export XAI_API_KEY="your-xai-api-key"

# Fireworks
export FIREWORKS_API_KEY="your-fireworks-api-key"

Quick Start

1. Set Up Model Backend Mapping

First, discover available models and create the backend mapping:

cd src
python list_models.py create-map

This command automatically:

  • Discovers available models from all configured backends
  • Creates/updates the model-to-backend mapping
  • Saves the model list to a historical database for tracking changes over time
  • Shows a summary of model counts by backend

You can add models manually specifying their name and backend type

python list_models.py add-model gpt-4o openai

2. Historical Model Tracking

The system tracks model availability changes over time. You can:

View historical data:

python list_models.py show-history

See changes between runs:

python list_models.py show-changes

Export historical trends to CSV:

python list_models.py export-history

Skip historical tracking (if desired):

python list_models.py create-map --no-save-history

The historical database helps you:

  • Track when new models become available
  • Monitor model deprecations and removals
  • Understand provider ecosystem changes

Files:

  • Reads: data/model_backend_map.csv (if exists)
  • Writes: data/model_backend_map.csv, data/model_history.db
  • Exports: data/model_history.csv (optional)

3. Test Models

Test the models to see which ones work and respond to text queries:

python showcase.py test-models

This command:

  • Tests all models from the backend mapping
  • Compares results with previous test runs (if contestants.csv exists)
  • Categorizes models into status change categories:
    • πŸ†• NEW MODELS - WORKING: New models that work
    • πŸ†• NEW MODELS - BROKEN: New models that don't work
    • πŸ”§ FIXED - PREVIOUSLY BROKEN, NOW WORKING: Models that were fixed
    • πŸ’₯ BROKEN - PREVIOUSLY WORKING, NOW BROKEN: Models that broke since last test
    • βœ… STILL WORKING: Models that continue to work
    • ❌ STILL BROKEN: Models that remain broken
  • Updates contestants.csv with current model status

Files:

  • Reads: data/model_backend_map.csv, data/contestants.csv (if exists), data/full_price_is_right_products.csv
  • Writes: data/contestants.csv

4. Generate Showcases

Create the showcase database with pre-generated pricing scenarios:

python showcase.py make-showcases --n-seeds 100

This creates 100 showcases (seeds 0-99) stored in data/showcases.db.

5. Run Featured Models

Run the curated list of featured models:

python showcase.py run --n-seeds 1

Files:

  • Reads: data/model_backend_map.csv, data/showcases.db, data/contestants.csv
  • Writes: data/responses.db

6. Run a Specific Model

Test a single model on multiple showcases:

python showcase.py run --model gpt-4o --n-seeds 10

7. Generate Statistics

Generate performance statistics, simulate head-to-head competitions, and compute the leader board:

python showcase.py stats

Files:

  • Reads: data/responses.db, data/contestants.csv, data/model_backend_map.csv
  • Writes: data/model_stats.csv, data/leaderboard.csv, data/leaderboard.json,

Command Reference

Both scripts use Typer for a modern command-line interface. You can get help for any command:

# Get a list of commands
python showcase.py --help
python list_models.py --help

# Get help for specific commands
python showcase.py run --help
python showcase.py test-models --help
python showcase.py make-showcases --help
python list_models.py create-map --help

Usage Examples

Showcase Commands

# Test all models and track status changes
python showcase.py test-models

# Run on specific number of showcases
python showcase.py run --model gpt-4o --n-seeds 5

# Dry run (no API calls)
python showcase.py run --model gpt-4o --dry-run

# Force re-run (ignore existing responses)
python showcase.py run --model gpt-4o --force

# Run all models in FEATURED_MODELS
python showcase.py run --n-seeds 5

# Remove model responses from database
python showcase.py remove-model --model gpt-4o

# Show how many showcases are prepared in the database
python showcase.py how-many-showcases

Project Structure

PriceIsRightLLM/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ showcase.py              # Main showcase experiment script
β”‚   β”œβ”€β”€ list_models.py           # Model discovery and management
β”‚   β”œβ”€β”€ llm_backends.py          # LLM backend implementations
β”‚   β”œβ”€β”€ data_loader.py           # Data loading utilities
β”‚   β”œβ”€β”€ constants.py             # Project constants
β”‚   └── parse_price_is_right_prices.py  # Price data parsing
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ full_price_is_right_products.csv  # Product pricing data
β”‚   β”œβ”€β”€ contestants.csv          # Model test results and status
β”‚   β”œβ”€β”€ model_backend_map.csv    # Model-to-backend mapping (CSV format)
β”‚   β”œβ”€β”€ showcases.db             # Pre-generated showcases (shelve database)
β”‚   β”œβ”€β”€ responses.db             # Model responses (shelve database)
β”‚   β”œβ”€β”€ model_history.db         # Historical model availability tracking
β”‚   β”œβ”€β”€ competitions.csv         # Head-to-head competition results
β”‚   └── model_stats.csv          # Performance statistics
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ requirements-dev.txt         # Development dependencies
β”œβ”€β”€ Makefile                     # Build and environment management
└── README.md                   # This file

Data Files

Prices for the showcase items are from Fandom.com, where viewers have compiled prices of items on the show. Because this dataset is available on the internet, it could be in the training corpus of some LLMs, or they might access it before generating responses. However, so far none of the models are so accurate that it seems likely they are "cheating" in this way. Another limitation of this dataset is that different prices were compiled at different times, which adds an extraneous element of time to the challenge. To address these limitations, we are exploring alternative datasets for a future version of the benchmark.

Core Data Files

  • full_price_is_right_products.csv: Product pricing data with names, descriptions, and retail prices
  • contestants.csv: Model test results with success/failure status and outcomes
  • model_backend_map.csv: Mapping of model names to their backend providers (OpenAI, Anthropic, etc.)

Database Files

  • showcases.db: Shelve database containing pre-generated showcases (seeds 0-99)
  • responses.db: Shelve database storing all model responses and bids
  • model_history.db: Shelve database tracking historical model availability changes over time

Results Files

  • competitions.csv: Head-to-head competition results between models (no longer used)
  • model_stats.csv: Performance statistics and metrics for all models
  • leaderboard.csv: Leaderboard sorted by Elo score, CSV format
  • leaderboard.json: Leaderboard sorted by Elo score, JSON format

Data Format

Showcases

Each showcase contains:

  • Training items: 10 example products with prices (for context)
  • Test items: 3 products to bid on
  • Seed: Deterministic identifier for reproducibility
  • Total value: Actual retail price of the test items

Responses

Model responses include:

  • Bid: Estimated total value
  • Rationale: Reasoning for the estimate
  • Performance: Difference from actual value, over-bid status
  • Metadata: Timestamp, model, showcase seed

Featured Models

The system includes 19 curated featured models from major providers:

  • OpenAI: gpt-4o, gpt-4-turbo, gpt-5, o1, o3, gpt-3.5-turbo
  • Anthropic: claude-3-5-sonnet, claude-sonnet-4, claude-opus-4, claude-3-5-haiku
  • Google: gemini-2.0-flash, gemini-1.5-pro, gemini-1.5-flash
  • Fireworks: qwen3-30b-a3b, llama-v3p3-70b-instruct, deepseek-v3, glm-4p5

Benchmark Results

The benchmark evaluates models on several metrics:

  • Accuracy: How close bids are to actual values (MAPE)
  • Over-bid Rate: Percentage of bids that exceed actual value
  • Win Rate: Success rate in head-to-head competitions
  • Elo Rating: Inferred rating based on head-to-head competitions

License

This project is licensed under the Apache License - see the LICENSE file for details.

The price data, from Fandom.com, is under a CC-BY-SA license.

About

Analysis of the Price Is Right benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •