Open-source benchmarking framework for evaluating context compression on LLM workloads. Measure accuracy, latency, and cost when using ScaleDown compression versus baseline LLMs across a variety of QA, retrieval, and summarization tasks.
ScaleBench runs structured experiments comparing a baseline (full context sent directly to the LLM) against a ScaleDown mode (context compressed before being sent), reporting:
- Accuracy — Exact Match, Token F1, BERTScore, ROUGE, Set metrics, Retrieval metrics
- Latency — End-to-end latency per example, compression overhead
- Cost — Token usage and USD cost per example and in aggregate
Supported datasets:
| Dataset | Task type |
|---|---|
| HotpotQA | Multi-hop QA |
| LongBench (4 tasks) | Long-context QA |
| ETHIC | Long-document list extraction |
| MS-MARCO v2.1 | Passage retrieval + QA |
| FinanceBench | Financial document QA |
| QMSum | Query-based meeting summarization |
Supported models: OpenAI (gpt-4o, gpt-4.1, gpt-5 variants), Gemini (2.5-flash, 2.5-pro, 2.0-flash), Claude, ScaleDown compression + summarization endpoints
Supported metrics: exact_match, f1, bertscore, rouge1, rouge2, rougeL, set_exact_match, set_precision, set_recall, set_f1, retrieval_precision, retrieval_recall, retrieval_f1
Using uv (recommended):
uv venv
source .venv/bin/activate
uv pip install -r requirements.txtOr standard venv:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtCopy .env.example to .env and fill in your values:
cp .env.example .envRequired keys depend on which experiments you run:
# Baseline experiments
OPENAI_API_KEY=your_openai_key
GEMINI_API_KEY=your_gemini_key
# ScaleDown compression mode
SCALEDOWN_API_KEY=your_scaledown_key
# ScaleDown summarization mode
SCALEDOWN_SUMMARIZE_URL=https://your-endpoint/summarization/abstractive
# Optional: LangSmith tracing
LANGSMITH_API_KEY=your_langsmith_keyAll commands use uv run to ensure the correct environment is used:
# Dry-run to validate config without executing
uv run python scripts/run_benchmarks.py --dry-run
# Run a single experiment by ID
uv run python scripts/run_benchmarks.py --experiment-id hotpotqa_baseline_openai
# Run all experiments with a tag
uv run python scripts/run_benchmarks.py --tag qmsum# Run all experiments in config (serial)
uv run python scripts/run_benchmarks.py
# Run with verbose logging
uv run python scripts/run_benchmarks.py --verbose# Run multiple experiments concurrently (one thread per experiment)
uv run python scripts/run_benchmarks.py --parallel-experiments
# Combine with tag filtering
uv run python scripts/run_benchmarks.py --tag qmsum --parallel-experiments
# Force serial execution (overrides config default)
uv run python scripts/run_benchmarks.py --serial-experiments| Flag | Description |
|---|---|
--experiment-id <id> |
Run only the specified experiment |
--tag <tag> |
Run all experiments with the given tag |
--parallel-experiments |
Run experiments concurrently (one thread per experiment) |
--serial-experiments |
Run experiments one after another (default) |
--dry-run |
Validate config without running anything |
--verbose |
Enable debug logging |
--config <path> |
Path to config YAML (default: config/experiments.yaml) |
--mlflow |
Enable MLflow tracing (overrides config) |
--no-mlflow |
Disable MLflow tracing (overrides config) |
--mlflow-uri <uri> |
MLflow tracking URI (default: http://localhost:5000) |
--langsmith |
Enable LangSmith tracing (overrides config) |
--no-langsmith |
Disable LangSmith tracing (overrides config) |
--langsmith-project <name> |
LangSmith project name (default: scalebench) |
--upload-hf |
Upload results to HuggingFace after completion |
--no-upload-hf |
Disable HF upload (overrides config) |
--hf-token <token> |
HuggingFace token (falls back to HF_TOKEN env var) |
Edit config/experiments.yaml to define experiments. Example:
- id: my_experiment
model:
name: openai:gpt-4o
mode: scaledown # or 'baseline'
pricing: # Optional: override default pricing
input_per_1m_tokens: 2.50
output_per_1m_tokens: 10.00
scaledown:
rate: auto # or 0.5 for 50% compression
compression_model: gemini-2.5-flash # Optional: model for compression
dataset:
name: hotpotqa
split: validation
num_examples: 50 # omit or set to null to use the full split
# Note: seed is only used when num_examples is smaller than the dataset size
task:
name: rag_qa
metrics:
- exact_match
- f1When using mode: scaledown, ScaleDown uses a separate model for context compression. Specify via scaledown.compression_model:
Supported compression models:
- Gemini:
gemini-2.5-flash(default),gemini-2.5-pro,gemini-2.5-flash-lite,gemini-2.0-flash - OpenAI:
gpt-4o,gpt-4o-mini
The compression model is separate from the downstream generation model. For example, compress with gemini-2.5-flash and generate with gpt-4.1:
- id: hotpotqa_scaledown_gpt41
model:
name: openai:gpt-4.1
mode: scaledown
scaledown:
rate: auto
compression_model: gemini-2.5-flashThe framework includes default pricing (per 1M tokens) for common models:
- gpt-4o: $2.50 input / $10.00 output
- gpt-4o-mini: $0.15 input / $0.60 output
- gpt-4.1: $2.00 input / $8.00 output
- gpt-4.1-mini: $0.40 input / $1.60 output
- gpt-5-mini: $0.25 input / $2.00 output
- gemini-2.5-flash: $0.30 input / $2.50 output
- gemini-2.5-flash-lite: $0.10 input / $0.40 output
- gemini-2.5-pro: $1.25 input / $10.00 output
To use custom pricing, add the pricing field to your model config as shown above.
Per-example JSONL files with timestamp: {experiment_id}_{timestamp}.jsonl
Each line contains:
{
"id": "example_id",
"model": "openai:gpt-4o",
"input_question": "What is the capital of France?",
"input_context": "France is a country in Europe...",
"pred": "Paris",
"gold": ["Paris"],
"input_tokens": 150,
"output_tokens": 5,
"latency_ms": 450.5,
"cost_usd": 0.000375,
"metrics": {"exact_match": 1.0, "f1": 1.0}
}input_context contains the compressed context in ScaleDown mode, or the full context in baseline mode. cost_usd is null if pricing is not configured.
Two files per experiment:
CSV ({experiment_id}_{timestamp}_summary.csv): flat format for easy comparison across runs.
JSON ({experiment_id}_{timestamp}_summary.json): nested format with full statistics:
{
"experiment_id": "hotpotqa_baseline_openai_20251211_150028",
"model": "openai:gpt-4o",
"n_examples": 10,
"metrics": {
"exact_match": {"mean": 0.4, "std": 0.52, "p50": 0.0, "p95": 1.0},
"f1": {"mean": 0.59, "std": 0.42, "p50": 0.6, "p95": 1.0}
},
"latency_ms": {"mean": 1135.0, "std": 495.0, "p50": 1199.0, "p95": 1859.5},
"tokens": {"avg_input_tokens": 1327.6, "avg_output_tokens": 8.4},
"cost": {
"total_usd": 0.03356,
"mean_usd": 0.003356,
"std_usd": 0.000124,
"p50_usd": 0.003312,
"p95_usd": 0.003598
}
}- exact_match: 1.0 if prediction exactly matches answer after normalization, else 0.0
- f1: Token-level F1 score (0.0–1.0), measuring token overlap between prediction and answer
- avg_input_tokens: Average input tokens sent to LLM (should be ~80–90% lower for ScaleDown)
- avg_output_tokens: Average tokens in model responses
- avg_latency_ms: Average end-to-end latency (includes compression time for ScaleDown)
- avg_cost_usd: Average cost per example in USD
- total_cost_usd: Total cost for all examples in the experiment
- compression_ratio: For ScaleDown only — compressed/original token ratio
Look for:
- Token reduction: ScaleDown input tokens should be 80–90% lower
- Accuracy preservation: Exact match and F1 should be similar (±5–10%)
- Latency trade-off: ScaleDown adds compression overhead but may reduce generation time
- Cost savings: ScaleDown should significantly reduce costs due to lower input token usage
LongBench is a benchmark for long-context evaluation with documents up to 40k+ tokens. ScaleBench supports 4 English QA tasks:
- narrativeqa: Question answering over long narratives (stories, novels)
- qasper: Question answering over scientific papers
- multifieldqa_en: Multi-domain question answering
- 2wikimqa: Multi-hop questions requiring information from multiple sources
- id: longbench_narrativeqa_baseline_openai
model:
name: openai:gpt-4o
mode: baseline
dataset:
name: longbench
split: test
task: narrativeqa # One of: narrativeqa, qasper, multifieldqa_en, 2wikimqa
num_examples: 20
seed: 42
task:
name: rag_qa
metrics:
- f1
- bertscoreEvaluate all 4 QA tasks in a single experiment by setting task: null:
- id: longbench_all_baseline_openai
model:
name: openai:gpt-4o
mode: baseline
dataset:
name: longbench
split: test
task: null # Runs all 4 QA tasks automatically
num_examples: 20 # Per task (80 total examples)
seed: 42
task:
name: rag_qa
metrics:
- f1
- bertscoreMulti-task results structure:
- Separate raw JSONL files for each task:
{exp_id}_narrativeqa_{timestamp}.jsonl, etc. - Summary CSV includes overall row + per-task breakdown
- Summary JSON includes nested
per_taskstructure with individual metrics
BERTScore evaluates semantic similarity using contextual embeddings, providing more nuanced evaluation than token-level metrics.
Returns three scores:
bertscore_precision: How much of the prediction is semantically relevantbertscore_recall: How much of the reference is covered by the predictionbertscore_f1: Harmonic mean of precision and recall
Model download: On first use, BERTScore downloads distilbert-base-uncased (~268MB), cached for future runs.
Optional MLflow integration provides experiment tracking and tracing.
-
Start MLflow server (in a separate terminal):
mlflow server --host 127.0.0.1 --port 5000 --allowed-hosts "*"Note: MLflow 3.5.0+ validates Host headers. Use
--allowed-hosts "*"for local development. -
Run benchmarks with MLflow enabled:
uv run python scripts/run_benchmarks.py --mlflow
-
View results at http://127.0.0.1:5000
Enable via config/experiments.yaml:
mlflow:
enabled: true
tracking_uri: "http://127.0.0.1:5000"
experiment_name: "scalebench"- Traces: Per-example execution with full prompts/contexts, auto-traced OpenAI and Gemini calls, ScaleDown compression spans
- Parameters: Experiment ID, dataset, model, mode, compression rate
- Metrics: Aggregated F1, exact match, latency, token counts, cost, compression ratio
- Artifacts: Raw JSONL result files, summary JSON/CSV
403 Error — MLflow security middleware (added in v3.5.0+) is blocking the request. Restart with:
mlflow server --host 127.0.0.1 --port 5000 --allowed-hosts "localhost,127.0.0.1"404 Error: Experiment not found — The experiment name in your config doesn't exist yet; the code creates it automatically on first run.
Optional LangSmith integration provides per-example tracing with full prompt visibility.
- Get a LangSmith API key at smith.langchain.com
- Add to
.env:LANGSMITH_API_KEY=your_key - Run with LangSmith enabled:
uv run python scripts/run_benchmarks.py --langsmith --langsmith-project MyProject
Each benchmark run produces a trace hierarchy:
experiment run
└─ example_<id>
├─ llm_<model> (raw model call with full prompts)
└─ feedback scores (rouge1, f1, exact_match, etc.)
Per-example data captured: system/user prompts, context length, prediction, gold answer, token counts, latency, cost, metric scores.
scalebench/
├── config/
│ └── experiments.yaml # Experiment configurations
├── scripts/
│ └── run_benchmarks.py # CLI entrypoint
├── src/
│ ├── core/
│ │ ├── config.py # Config dataclasses and YAML parsing
│ │ └── registry.py # Component factories (models/datasets/tasks/metrics)
│ ├── dataset/
│ │ ├── base.py # Dataset ABC
│ │ ├── hotpotqa.py # HotpotQA loader
│ │ ├── longbench.py # LongBench loader (4 QA tasks)
│ │ ├── ethic.py # ETHIC loader (list extraction)
│ │ ├── msmarco.py # MS-MARCO v2.1 loader
│ │ ├── financebench.py # FinanceBench loader
│ │ └── qmsum.py # QMSum loader
│ ├── tasks/
│ │ ├── base.py # Task ABC
│ │ ├── rag_task.py # RAG QA task
│ │ ├── retrieval_task.py # Retrieval task
│ │ └── summarization_task.py # Summarization task
│ ├── models/
│ │ ├── base.py # ModelClient interface
│ │ ├── openai_client.py # OpenAI wrapper
│ │ ├── gemini_client.py # Gemini wrapper
│ │ ├── claude_client.py # Claude wrapper
│ │ ├── scaledown_client.py # ScaleDown compression wrapper
│ │ └── scaledown_summarize_client.py # ScaleDown summarization client
│ ├── metrics/
│ │ ├── classification.py # Exact match and F1
│ │ ├── generation.py # BERTScore
│ │ ├── rouge_metrics.py # ROUGE metrics
│ │ ├── set_metrics.py # Set-based metrics (list outputs)
│ │ └── retrieval_metrics.py # Retrieval metrics
│ ├── evaluation/
│ │ ├── runner.py # Experiment execution engine
│ │ └── aggregation.py # Results statistics computation
│ ├── utils/
│ │ ├── mlflow_utils.py # MLflow integration
│ │ ├── langsmith_utils.py # LangSmith integration
│ │ ├── parallel.py # Concurrent API request processing
│ │ └── cost.py # Cost calculation
│ ├── huggingface/
│ │ └── uploader.py # HuggingFace dataset upload
│ ├── hf_space/
│ │ └── app.py # Gradio leaderboard UI
│ ├── load_testing/ # Locust-based load testing
│ ├── prompts/ # Prompt templates
│ └── settings.py # Environment configuration
├── tests/ # Test suite
├── .env.example # Environment variable template
├── pyproject.toml
└── requirements.txt
- Create
src/dataset/my_dataset.pyimplementing theDatasetABC (seehotpotqa.pyas an example) - Register in
src/core/registry.py:DATASET_REGISTRY["my_dataset"] = MyDataset - Add experiments to
config/experiments.yaml
- Create a function in
src/metrics/returning{"metric_name": float} - Register in
src/core/registry.py:METRIC_REGISTRY["my_metric"] = my_metric_fn - Add to experiment config
metricslist
- Create a client in
src/models/implementing theModelClientinterface - Register in
src/core/registry.py:MODEL_REGISTRY["provider"] = MyClient - Use format
provider:model_namein experiment config
The framework includes a load testing module for evaluating ScaleDown's compression API under concurrent load.
# Run with web UI
locust -f src/load_testing/locustfile.py
# Run headless
locust -f src/load_testing/locustfile.py --headless --users=50 --spawn-rate=5 --run-time=5m- Realistic load testing using actual dataset examples (HotpotQA, etc.)
- Multiple compression configs (different rates and models)
- Custom metrics: compression ratios, token savings, latency
- Comprehensive outputs: Locust built-ins + custom JSON/CSV exports
Web UI: http://localhost:8089 (when running without --headless)
See load_testing/README.md for full documentation.
Upload and visualize benchmark results on HuggingFace.
# Upload results during a benchmark run
uv run python scripts/run_benchmarks.py --upload-hf
# Upload existing results
python src/huggingface/upload_to_hf.py --all
python src/huggingface/upload_to_hf.py --file results/summaries/*hotpotqa*.jsonLocal Benchmarks ──▶ HF Dataset ──▶ HF Space (Gradio UI)
└── src/huggingface/ └── Private └── src/hf_space/
uploader.py Storage app.py
Contributions are welcome! Please open an issue or pull request. When adding a new dataset, model, or metric, follow the patterns in the existing implementations and add corresponding tests.
MIT — © 2025 ScaleDown