Skip to content

zjiang4/LLM-CAT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BenchmarkCAT

Computerized Adaptive Testing for LLM Benchmarks

Python 3.8+ License: MIT GitHub Stars GitHub Issues GitHub Last Commit

A comprehensive framework for evaluating Large Language Models using Item Response Theory (IRT) and Computerized Adaptive Testing (CAT)

📖 Documentation🚀 Quick Start🌐 Web UI📊 Examples


Overview

BenchmarkCAT applies Item Response Theory (IRT) and Computerized Adaptive Testing (CAT) to LLM evaluation, enabling:

  • Efficient Testing: Adaptive item selection reduces test length while maintaining measurement precision
  • Precise Ability Estimation: IRT-based ability estimates (θ) with confidence intervals
  • Fair Comparison: Compare different LLMs on the same ability scale
  • Item-Level Analysis: Understand which items discriminate well between models

Features

Feature Description
🔧 Full catsim Integration All 14 selectors, 7 estimators, and 4 stopping criteria
🤖 Multi-LLM Support OpenAI, Claude, Gemini, LMStudio, Ollama
🎯 Answer Extraction Multi-strategy extraction with confidence scoring
🌐 Streamlit Web UI Interactive 6-page dashboard for testing
Async API Efficient concurrent testing
📊 Visualization Built-in plotting with Plotly
🔄 Batch Testing Parallel execution with aggregate analysis
📁 Multiple Formats CSV, JSON, YAML configuration support

Installation

# Core library
pip install -r requirements.txt

# With Web UI support
pip install -r requirements.txt
pip install -r requirements-ui.txt

Or install from source:

git clone https://github.com/zjiang4/LLM-CAT.git
cd LLM-CAT
pip install -e .

Quick Start

Basic Usage

import asyncio
from benchmarkcat import CATEngine, ItemBank, CATConfig, LLMConfig

async def main():
    # Load item bank
    item_bank = ItemBank.from_csv("item_bank.csv", encoding="utf-8")
    
    # Configure CAT
    cat_config = CATConfig(
        selector={"type": "MaxInfoSelector"},
        estimator={"method": "bounded"},
        stopper={"type": "MaxItemStopper", "max_items": 20}
    )
    
    # Configure LLM
    llm_config = LLMConfig(
        provider="openai",
        model="gpt-4o-mini",
        api_key="your-api-key"  # or use OPENAI_API_KEY env var
    )
    
    # Run test
    engine = CATEngine(cat_config, llm_config, item_bank)
    result = await engine.run_test()
    
    print(result.summary())

asyncio.run(main())

Web UI

Launch the interactive Streamlit dashboard:

streamlit run ui/app.py

The UI provides 6 pages:

  1. LLM Connection - Configure and test LLM providers
  2. Item Pool - Upload and validate item banks
  3. CAT Configuration - Select selectors, estimators, stoppers
  4. Run Test - Execute tests with real-time progress
  5. Results - Interactive visualizations and export
  6. Batch Testing - Parallel multi-test execution

Configuration Options

Selectors (12 types)

Selector Description Key Parameters
MaxInfoSelector Maximum information at current θ r_max
RandomSelector Random selection -
LinearSelector Predefined order indexes
RandomesqueSelector Random from top-n bin_size
IntervalInfoSelector Maximize interval info interval
UrrySelector Urry's method -
The54321Selector 5-4-3-2-1 method -
ClusterSelector Cluster-based clusters, method
AStratSelector α-stratified test_size
AStratBBlockSelector α-strat + b-blocking test_size
MaxInfoStratSelector Max info stratification test_size
MaxInfoBBlockSelector MIS + b-blocking test_size

Estimators (7 methods)

Method Description Speed Accuracy
bounded scipy bounded optimization ★★★★★ ★★★★★
brent Brent's method ★★★★★ ★★★★★
golden Golden-section search ★★★★☆ ★★★★☆
golden2 Improved golden ★★★★☆ ★★★★☆
ternary Ternary search ★★★☆☆ ★★★★☆
dichotomous Binary search ★★★☆☆ ★★★★☆
fibonacci Fibonacci search ★★★★☆ ★★★★☆

Stopping Criteria (4 types)

Stopper Description Key Parameters
MaxItemStopper Fixed test length max_items
MinErrorStopper Target precision min_error
TestLengthStopper Min/max bounds min_items, max_items
ConfidenceIntervalStopper CI-based stopping confidence, interval_bounds

Using Different LLM Providers

# OpenAI
llm_config = LLMConfig(
    provider="openai",
    model="gpt-4o",
    api_key="sk-..."
)

# Claude (Anthropic)
llm_config = LLMConfig(
    provider="claude",
    model="claude-3-5-sonnet-20241022",
    api_key="sk-ant-..."
)

# Gemini (Google)
llm_config = LLMConfig(
    provider="gemini",
    model="gemini-pro",
    api_key="..."
)

# LMStudio (Local)
llm_config = LLMConfig(
    provider="lmstudio",
    base_url="http://localhost:1234/v1"
)

# Ollama (Local)
llm_config = LLMConfig(
    provider="ollama",
    model="llama2",
    base_url="http://localhost:11434"
)

Item Bank Format

CSV Format

Question,Key,a,b,c,d
"What is 2+2?",A,1.0,-1.0,0.0,1.0
"Capital of France?",B,1.2,0.0,0.0,1.0

Where:

  • Question: The test question text
  • Key: Correct answer (A, B, C, D, E)
  • a: Discrimination parameter (>0, typically 0.5-2.5)
  • b: Difficulty parameter (typically -3 to 3)
  • c: Guessing parameter (0-0.5, default 0.25 for 4 options)
  • d: Upper asymptote (0.9-1.0, default 1.0)

JSON Format

{
  "items": [
    {
      "question": "What is 2+2?",
      "answer": "A",
      "a": 1.0,
      "b": -1.0,
      "c": 0.0,
      "d": 1.0
    }
  ]
}

Answer Extraction

BenchmarkCAT includes a multi-strategy answer extractor:

from benchmarkcat.core.answer_extraction import AnswerExtractor, create_extractor

# Create extractor
extractor = create_extractor(strategy="multi_stage", confidence_threshold=0.7)

# Extract answer from LLM response
result = extractor.extract('{"Answer": "B"}')
print(f"Answer: {result.answer}, Confidence: {result.confidence}")

Extraction strategies (tried in order):

  1. JSON Exact - Parse valid JSON
  2. JSON Fuzzy - Handle malformed JSON
  3. Regex Patterns - Match common patterns
  4. Letter Extraction - Standalone letters

Visualization

from benchmarkcat import visualize_result

# Generate combined visualization
fig = visualize_result(result, plot_type="combined", save_path="results.png")

# Or individual plots
fig = visualize_result(result, plot_type="theta")    # θ progression
fig = visualize_result(result, plot_type="sem")      # SEM over time
fig = visualize_result(result, plot_type="responses") # Response pattern

API Reference

CATEngine

engine = CATEngine(cat_config, llm_config, item_bank)

# Run single test
result = await engine.run_test()

# Run batch tests
results = await engine.run_tests_batch(n_tests=10)

# Create session for manual control
session = engine.create_session()

CATResult

result.final_theta      # Final ability estimate
result.final_sem        # Final standard error
result.ci_lower         # 95% CI lower bound
result.ci_upper         # 95% CI upper bound
result.num_items        # Items administered
result.accuracy         # Response accuracy
result.test_duration    # Test duration (seconds)
result.theta_estimates  # History of theta estimates
result.sem_history      # History of SEM values
result.administered_items  # Indices of administered items
result.responses        # Binary responses (0/1)
result.detailed_log     # Step-by-step log
result.stop_reason      # Why test stopped

Development

Setup

git clone https://github.com/zjiang4/LLM-CAT.git
cd LLM-CAT
pip install -e ".[dev]"

Run Tests

pytest tests/ -v

Type Checking

mypy src/benchmarkcat

Project Structure

LLM-CAT/
├── src/benchmarkcat/       # Core library
│   ├── config/             # Configuration (CAT, LLM)
│   ├── core/               # CAT engine, item bank, results
│   ├── llm/                # LLM providers
│   ├── utils/              # IRT utilities
│   └── visualization.py    # Plotting
├── ui/                     # Streamlit Web UI
│   ├── app.py              # Main application
│   ├── pages/              # 6 UI pages
│   ├── core/               # Session, executor, batch runner
│   └── data/               # Sample data
├── examples/               # Usage examples
├── tests/                  # Test suite
└── requirements*.txt       # Dependencies

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use BenchmarkCAT in your research, please cite:

@software{benchmarkcat2024,
  title = {BenchmarkCAT: Computerized Adaptive Testing for LLM Benchmarks},
  author = {Jiang, Zhehan},
  year = {2024},
  url = {https://github.com/zjiang4/LLM-CAT}
}

Acknowledgments


Made with ❤️ by zjiang4

About

Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors