BenchmarkCAT

Computerized Adaptive Testing for LLM Benchmarks

A comprehensive framework for evaluating Large Language Models using Item Response Theory (IRT) and Computerized Adaptive Testing (CAT)

📖 Documentation • 🚀 Quick Start • 🌐 Web UI • 📊 Examples

Overview

BenchmarkCAT applies Item Response Theory (IRT) and Computerized Adaptive Testing (CAT) to LLM evaluation, enabling:

Efficient Testing: Adaptive item selection reduces test length while maintaining measurement precision
Precise Ability Estimation: IRT-based ability estimates (θ) with confidence intervals
Fair Comparison: Compare different LLMs on the same ability scale
Item-Level Analysis: Understand which items discriminate well between models

Features

Feature	Description
🔧 Full catsim Integration	All 14 selectors, 7 estimators, and 4 stopping criteria
🤖 Multi-LLM Support	OpenAI, Claude, Gemini, LMStudio, Ollama
🎯 Answer Extraction	Multi-strategy extraction with confidence scoring
🌐 Streamlit Web UI	Interactive 6-page dashboard for testing
⚡ Async API	Efficient concurrent testing
📊 Visualization	Built-in plotting with Plotly
🔄 Batch Testing	Parallel execution with aggregate analysis
📁 Multiple Formats	CSV, JSON, YAML configuration support

Installation

# Core library
pip install -r requirements.txt

# With Web UI support
pip install -r requirements.txt
pip install -r requirements-ui.txt

Or install from source:

git clone https://github.com/zjiang4/LLM-CAT.git
cd LLM-CAT
pip install -e .

Quick Start

Basic Usage

import asyncio
from benchmarkcat import CATEngine, ItemBank, CATConfig, LLMConfig

async def main():
    # Load item bank
    item_bank = ItemBank.from_csv("item_bank.csv", encoding="utf-8")
    
    # Configure CAT
    cat_config = CATConfig(
        selector={"type": "MaxInfoSelector"},
        estimator={"method": "bounded"},
        stopper={"type": "MaxItemStopper", "max_items": 20}
    )
    
    # Configure LLM
    llm_config = LLMConfig(
        provider="openai",
        model="gpt-4o-mini",
        api_key="your-api-key"  # or use OPENAI_API_KEY env var
    )
    
    # Run test
    engine = CATEngine(cat_config, llm_config, item_bank)
    result = await engine.run_test()
    
    print(result.summary())

asyncio.run(main())

Web UI

Launch the interactive Streamlit dashboard:

streamlit run ui/app.py

The UI provides 6 pages:

LLM Connection - Configure and test LLM providers
Item Pool - Upload and validate item banks
CAT Configuration - Select selectors, estimators, stoppers
Run Test - Execute tests with real-time progress
Results - Interactive visualizations and export
Batch Testing - Parallel multi-test execution

Configuration Options

Selectors (12 types)

Selector	Description	Key Parameters
`MaxInfoSelector`	Maximum information at current θ	`r_max`
`RandomSelector`	Random selection	-
`LinearSelector`	Predefined order	`indexes`
`RandomesqueSelector`	Random from top-n	`bin_size`
`IntervalInfoSelector`	Maximize interval info	`interval`
`UrrySelector`	Urry's method	-
`The54321Selector`	5-4-3-2-1 method	-
`ClusterSelector`	Cluster-based	`clusters`, `method`
`AStratSelector`	α-stratified	`test_size`
`AStratBBlockSelector`	α-strat + b-blocking	`test_size`
`MaxInfoStratSelector`	Max info stratification	`test_size`
`MaxInfoBBlockSelector`	MIS + b-blocking	`test_size`

Estimators (7 methods)

Method	Description	Speed	Accuracy
`bounded`	scipy bounded optimization	★★★★★	★★★★★
`brent`	Brent's method	★★★★★	★★★★★
`golden`	Golden-section search	★★★★☆	★★★★☆
`golden2`	Improved golden	★★★★☆	★★★★☆
`ternary`	Ternary search	★★★☆☆	★★★★☆
`dichotomous`	Binary search	★★★☆☆	★★★★☆
`fibonacci`	Fibonacci search	★★★★☆	★★★★☆

Stopping Criteria (4 types)

Stopper	Description	Key Parameters
`MaxItemStopper`	Fixed test length	`max_items`
`MinErrorStopper`	Target precision	`min_error`
`TestLengthStopper`	Min/max bounds	`min_items`, `max_items`
`ConfidenceIntervalStopper`	CI-based stopping	`confidence`, `interval_bounds`

Using Different LLM Providers

# OpenAI
llm_config = LLMConfig(
    provider="openai",
    model="gpt-4o",
    api_key="sk-..."
)

# Claude (Anthropic)
llm_config = LLMConfig(
    provider="claude",
    model="claude-3-5-sonnet-20241022",
    api_key="sk-ant-..."
)

# Gemini (Google)
llm_config = LLMConfig(
    provider="gemini",
    model="gemini-pro",
    api_key="..."
)

# LMStudio (Local)
llm_config = LLMConfig(
    provider="lmstudio",
    base_url="http://localhost:1234/v1"
)

# Ollama (Local)
llm_config = LLMConfig(
    provider="ollama",
    model="llama2",
    base_url="http://localhost:11434"
)

Item Bank Format

CSV Format

Question,Key,a,b,c,d
"What is 2+2?",A,1.0,-1.0,0.0,1.0
"Capital of France?",B,1.2,0.0,0.0,1.0

Where:

Question: The test question text
Key: Correct answer (A, B, C, D, E)
a: Discrimination parameter (>0, typically 0.5-2.5)
b: Difficulty parameter (typically -3 to 3)
c: Guessing parameter (0-0.5, default 0.25 for 4 options)
d: Upper asymptote (0.9-1.0, default 1.0)

JSON Format

{
  "items": [
    {
      "question": "What is 2+2?",
      "answer": "A",
      "a": 1.0,
      "b": -1.0,
      "c": 0.0,
      "d": 1.0
    }
  ]
}

Answer Extraction

BenchmarkCAT includes a multi-strategy answer extractor:

from benchmarkcat.core.answer_extraction import AnswerExtractor, create_extractor

# Create extractor
extractor = create_extractor(strategy="multi_stage", confidence_threshold=0.7)

# Extract answer from LLM response
result = extractor.extract('{"Answer": "B"}')
print(f"Answer: {result.answer}, Confidence: {result.confidence}")

Extraction strategies (tried in order):

JSON Exact - Parse valid JSON
JSON Fuzzy - Handle malformed JSON
Regex Patterns - Match common patterns
Letter Extraction - Standalone letters

Visualization

from benchmarkcat import visualize_result

# Generate combined visualization
fig = visualize_result(result, plot_type="combined", save_path="results.png")

# Or individual plots
fig = visualize_result(result, plot_type="theta")    # θ progression
fig = visualize_result(result, plot_type="sem")      # SEM over time
fig = visualize_result(result, plot_type="responses") # Response pattern

API Reference

CATEngine

engine = CATEngine(cat_config, llm_config, item_bank)

# Run single test
result = await engine.run_test()

# Run batch tests
results = await engine.run_tests_batch(n_tests=10)

# Create session for manual control
session = engine.create_session()

CATResult

result.final_theta      # Final ability estimate
result.final_sem        # Final standard error
result.ci_lower         # 95% CI lower bound
result.ci_upper         # 95% CI upper bound
result.num_items        # Items administered
result.accuracy         # Response accuracy
result.test_duration    # Test duration (seconds)
result.theta_estimates  # History of theta estimates
result.sem_history      # History of SEM values
result.administered_items  # Indices of administered items
result.responses        # Binary responses (0/1)
result.detailed_log     # Step-by-step log
result.stop_reason      # Why test stopped

Development

Setup

git clone https://github.com/zjiang4/LLM-CAT.git
cd LLM-CAT
pip install -e ".[dev]"

Run Tests

pytest tests/ -v

Type Checking

mypy src/benchmarkcat

Project Structure

LLM-CAT/
├── src/benchmarkcat/       # Core library
│   ├── config/             # Configuration (CAT, LLM)
│   ├── core/               # CAT engine, item bank, results
│   ├── llm/                # LLM providers
│   ├── utils/              # IRT utilities
│   └── visualization.py    # Plotting
├── ui/                     # Streamlit Web UI
│   ├── app.py              # Main application
│   ├── pages/              # 6 UI pages
│   ├── core/               # Session, executor, batch runner
│   └── data/               # Sample data
├── examples/               # Usage examples
├── tests/                  # Test suite
└── requirements*.txt       # Dependencies

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use BenchmarkCAT in your research, please cite:

@software{benchmarkcat2024,
  title = {BenchmarkCAT: Computerized Adaptive Testing for LLM Benchmarks},
  author = {Jiang, Zhehan},
  year = {2024},
  url = {https://github.com/zjiang4/LLM-CAT}
}

Acknowledgments

catsim - CAT simulation library
pydantic - Data validation
Streamlit - Web UI framework
Plotly - Interactive visualizations

Made with ❤️ by zjiang4

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
examples		examples
llm_cat		llm_cat
src/benchmarkcat		src/benchmarkcat
tests		tests
ui		ui
.gitignore		.gitignore
BrmManuscript.md		BrmManuscript.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LLM-CAT.py		LLM-CAT.py
README.md		README.md
Ref1_LLMCAT.pdf		Ref1_LLMCAT.pdf
Ref2_LLMCAT.pdf		Ref2_LLMCAT.pdf
SECURITY.md		SECURITY.md
abParm.csv		abParm.csv
catsim.pdf		catsim.pdf
clean.py		clean.py
example.py		example.py
fix_item_bank.py		fix_item_bank.py
fix_llm_cat.py		fix_llm_cat.py
mock_test.py		mock_test.py
pyproject.toml		pyproject.toml
requirements-ui.txt		requirements-ui.txt
requirements.txt		requirements.txt
requirements_江哲涵_2016189038_2026_02_26.txt		requirements_江哲涵_2016189038_2026_02_26.txt
run_cat.py		run_cat.py
setup.py		setup.py
softwarex-osp-template.docx		softwarex-osp-template.docx
softwarex-osp-template.md		softwarex-osp-template.md
test_api.py		test_api.py
test_llm_cat.py		test_llm_cat.py
test_llm_cat_catsim_integration.py		test_llm_cat_catsim_integration.py
test_llm_cat_simple.py		test_llm_cat_simple.py
test_llm_cat_vs_catsim.py		test_llm_cat_vs_catsim.py
verify_cat.py		verify_cat.py
verify_cat_selection.py		verify_cat_selection.py
verify_llm.py		verify_llm.py
verify_llm_cat.py		verify_llm_cat.py
workplan.md		workplan.md
参考.pdf		参考.pdf

Folders and files

Latest commit

History

Repository files navigation

BenchmarkCAT

Overview

Features

Installation

Quick Start

Basic Usage

Web UI

Configuration Options

Selectors (12 types)

Estimators (7 methods)

Stopping Criteria (4 types)

Using Different LLM Providers

Item Bank Format

CSV Format

JSON Format

Answer Extraction

Visualization

API Reference

CATEngine

CATResult

Development

Setup

Run Tests

Type Checking

Project Structure

Contributing

License

Citation

Acknowledgments

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages