Computerized Adaptive Testing for LLM Benchmarks
A comprehensive framework for evaluating Large Language Models using Item Response Theory (IRT) and Computerized Adaptive Testing (CAT)
BenchmarkCAT applies Item Response Theory (IRT) and Computerized Adaptive Testing (CAT) to LLM evaluation, enabling:
- Efficient Testing: Adaptive item selection reduces test length while maintaining measurement precision
- Precise Ability Estimation: IRT-based ability estimates (θ) with confidence intervals
- Fair Comparison: Compare different LLMs on the same ability scale
- Item-Level Analysis: Understand which items discriminate well between models
| Feature | Description |
|---|---|
| 🔧 Full catsim Integration | All 14 selectors, 7 estimators, and 4 stopping criteria |
| 🤖 Multi-LLM Support | OpenAI, Claude, Gemini, LMStudio, Ollama |
| 🎯 Answer Extraction | Multi-strategy extraction with confidence scoring |
| 🌐 Streamlit Web UI | Interactive 6-page dashboard for testing |
| ⚡ Async API | Efficient concurrent testing |
| 📊 Visualization | Built-in plotting with Plotly |
| 🔄 Batch Testing | Parallel execution with aggregate analysis |
| 📁 Multiple Formats | CSV, JSON, YAML configuration support |
# Core library
pip install -r requirements.txt
# With Web UI support
pip install -r requirements.txt
pip install -r requirements-ui.txtOr install from source:
git clone https://github.com/zjiang4/LLM-CAT.git
cd LLM-CAT
pip install -e .import asyncio
from benchmarkcat import CATEngine, ItemBank, CATConfig, LLMConfig
async def main():
# Load item bank
item_bank = ItemBank.from_csv("item_bank.csv", encoding="utf-8")
# Configure CAT
cat_config = CATConfig(
selector={"type": "MaxInfoSelector"},
estimator={"method": "bounded"},
stopper={"type": "MaxItemStopper", "max_items": 20}
)
# Configure LLM
llm_config = LLMConfig(
provider="openai",
model="gpt-4o-mini",
api_key="your-api-key" # or use OPENAI_API_KEY env var
)
# Run test
engine = CATEngine(cat_config, llm_config, item_bank)
result = await engine.run_test()
print(result.summary())
asyncio.run(main())Launch the interactive Streamlit dashboard:
streamlit run ui/app.pyThe UI provides 6 pages:
- LLM Connection - Configure and test LLM providers
- Item Pool - Upload and validate item banks
- CAT Configuration - Select selectors, estimators, stoppers
- Run Test - Execute tests with real-time progress
- Results - Interactive visualizations and export
- Batch Testing - Parallel multi-test execution
| Selector | Description | Key Parameters |
|---|---|---|
MaxInfoSelector |
Maximum information at current θ | r_max |
RandomSelector |
Random selection | - |
LinearSelector |
Predefined order | indexes |
RandomesqueSelector |
Random from top-n | bin_size |
IntervalInfoSelector |
Maximize interval info | interval |
UrrySelector |
Urry's method | - |
The54321Selector |
5-4-3-2-1 method | - |
ClusterSelector |
Cluster-based | clusters, method |
AStratSelector |
α-stratified | test_size |
AStratBBlockSelector |
α-strat + b-blocking | test_size |
MaxInfoStratSelector |
Max info stratification | test_size |
MaxInfoBBlockSelector |
MIS + b-blocking | test_size |
| Method | Description | Speed | Accuracy |
|---|---|---|---|
bounded |
scipy bounded optimization | ★★★★★ | ★★★★★ |
brent |
Brent's method | ★★★★★ | ★★★★★ |
golden |
Golden-section search | ★★★★☆ | ★★★★☆ |
golden2 |
Improved golden | ★★★★☆ | ★★★★☆ |
ternary |
Ternary search | ★★★☆☆ | ★★★★☆ |
dichotomous |
Binary search | ★★★☆☆ | ★★★★☆ |
fibonacci |
Fibonacci search | ★★★★☆ | ★★★★☆ |
| Stopper | Description | Key Parameters |
|---|---|---|
MaxItemStopper |
Fixed test length | max_items |
MinErrorStopper |
Target precision | min_error |
TestLengthStopper |
Min/max bounds | min_items, max_items |
ConfidenceIntervalStopper |
CI-based stopping | confidence, interval_bounds |
# OpenAI
llm_config = LLMConfig(
provider="openai",
model="gpt-4o",
api_key="sk-..."
)
# Claude (Anthropic)
llm_config = LLMConfig(
provider="claude",
model="claude-3-5-sonnet-20241022",
api_key="sk-ant-..."
)
# Gemini (Google)
llm_config = LLMConfig(
provider="gemini",
model="gemini-pro",
api_key="..."
)
# LMStudio (Local)
llm_config = LLMConfig(
provider="lmstudio",
base_url="http://localhost:1234/v1"
)
# Ollama (Local)
llm_config = LLMConfig(
provider="ollama",
model="llama2",
base_url="http://localhost:11434"
)Question,Key,a,b,c,d
"What is 2+2?",A,1.0,-1.0,0.0,1.0
"Capital of France?",B,1.2,0.0,0.0,1.0Where:
Question: The test question textKey: Correct answer (A, B, C, D, E)a: Discrimination parameter (>0, typically 0.5-2.5)b: Difficulty parameter (typically -3 to 3)c: Guessing parameter (0-0.5, default 0.25 for 4 options)d: Upper asymptote (0.9-1.0, default 1.0)
{
"items": [
{
"question": "What is 2+2?",
"answer": "A",
"a": 1.0,
"b": -1.0,
"c": 0.0,
"d": 1.0
}
]
}BenchmarkCAT includes a multi-strategy answer extractor:
from benchmarkcat.core.answer_extraction import AnswerExtractor, create_extractor
# Create extractor
extractor = create_extractor(strategy="multi_stage", confidence_threshold=0.7)
# Extract answer from LLM response
result = extractor.extract('{"Answer": "B"}')
print(f"Answer: {result.answer}, Confidence: {result.confidence}")Extraction strategies (tried in order):
- JSON Exact - Parse valid JSON
- JSON Fuzzy - Handle malformed JSON
- Regex Patterns - Match common patterns
- Letter Extraction - Standalone letters
from benchmarkcat import visualize_result
# Generate combined visualization
fig = visualize_result(result, plot_type="combined", save_path="results.png")
# Or individual plots
fig = visualize_result(result, plot_type="theta") # θ progression
fig = visualize_result(result, plot_type="sem") # SEM over time
fig = visualize_result(result, plot_type="responses") # Response patternengine = CATEngine(cat_config, llm_config, item_bank)
# Run single test
result = await engine.run_test()
# Run batch tests
results = await engine.run_tests_batch(n_tests=10)
# Create session for manual control
session = engine.create_session()result.final_theta # Final ability estimate
result.final_sem # Final standard error
result.ci_lower # 95% CI lower bound
result.ci_upper # 95% CI upper bound
result.num_items # Items administered
result.accuracy # Response accuracy
result.test_duration # Test duration (seconds)
result.theta_estimates # History of theta estimates
result.sem_history # History of SEM values
result.administered_items # Indices of administered items
result.responses # Binary responses (0/1)
result.detailed_log # Step-by-step log
result.stop_reason # Why test stoppedgit clone https://github.com/zjiang4/LLM-CAT.git
cd LLM-CAT
pip install -e ".[dev]"pytest tests/ -vmypy src/benchmarkcatLLM-CAT/
├── src/benchmarkcat/ # Core library
│ ├── config/ # Configuration (CAT, LLM)
│ ├── core/ # CAT engine, item bank, results
│ ├── llm/ # LLM providers
│ ├── utils/ # IRT utilities
│ └── visualization.py # Plotting
├── ui/ # Streamlit Web UI
│ ├── app.py # Main application
│ ├── pages/ # 6 UI pages
│ ├── core/ # Session, executor, batch runner
│ └── data/ # Sample data
├── examples/ # Usage examples
├── tests/ # Test suite
└── requirements*.txt # Dependencies
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
If you use BenchmarkCAT in your research, please cite:
@software{benchmarkcat2024,
title = {BenchmarkCAT: Computerized Adaptive Testing for LLM Benchmarks},
author = {Jiang, Zhehan},
year = {2024},
url = {https://github.com/zjiang4/LLM-CAT}
}- catsim - CAT simulation library
- pydantic - Data validation
- Streamlit - Web UI framework
- Plotly - Interactive visualizations