A comprehensive, scalable benchmark framework for comparing Radon language performance against Python and Go. Features regression testing, version comparison, trend tracking, and beautiful HTML reports.
- Features
- Quick Start
- Installation
- Usage
- Configuration
- Output Formats
- Adding New Scenarios
- Architecture
- Examples
- Makefile Reference
- CLI Reference
- Multi-Runtime Support - Compare Radon, Python, and Go side-by-side
- Regression Testing - Detect performance regressions across versions
- Version Comparison - Track Radon's improvement over time
- Trend Analysis - Visualize how the gap to Python closes
- Beautiful Reports - Interactive HTML reports with Chart.js visualizations
- CI Ready - Exit codes for regression detection in pipelines
- Configurable Thresholds - Customize what counts as a regression
- Multiple Profiles - Smoke, standard, and deep benchmark modes
cd benchmarks
# Run a quick smoke test (fastest)
python runner/main.py --profile smoke
# Run standard benchmarks with all runtimes
python runner/main.py --profile standard --runtimes radon python go
# Save as baseline and view HTML report
python runner/main.py --profile standard --baseline radon-v0.1.0
# Open results/latest/report.html in your browserOr use Make:
make smoke # Quick sanity check
make standard # Development runs
make deep # Release benchmarks- Python 3.10+ (for running the benchmark harness)
- Radon (the language being benchmarked)
- Go 1.20+ (optional, for Go comparisons)
pip install psutil matplotlibcd benchmarks
python runner/main.py --profile smoke --scenarios simple_sum# Run all scenarios with all runtimes
python runner/main.py --profile standard
# Run specific runtimes
python runner/main.py --profile standard --runtimes radon python
# Run specific scenarios
python runner/main.py --profile smoke --scenarios simple_sum fib_20
# Combine options
python runner/main.py --profile standard --runtimes radon python --scenarios simple_sum| Profile | Warmups | Repeats | Use Case |
|---|---|---|---|
smoke |
1 | 3 | Quick sanity check (~30s) |
standard |
2 | 10 | Development runs (~2-5min) |
deep |
5 | 30 | Release benchmarks (~10-15min) |
python runner/main.py --profile smoke # Fast
python runner/main.py --profile standard # Balanced
python runner/main.py --profile deep # ThoroughSave benchmark results as a named baseline for future comparison:
# Save after a benchmark run
python runner/main.py --profile standard --baseline radon-v0.1.0
# List all saved baselines
python runner/main.py --list-baselines
# Delete a baseline
python runner/main.py --delete-baseline radon-v0.0.1Baseline files are stored in baselines/ as JSON and include:
- Runtime versions (Radon, Python, Go)
- Git commit hash
- Per-scenario timing data
- Cross-language ratios
Run new benchmarks and compare against a saved baseline:
python runner/main.py --profile standard --compare radon-v0.0.1Output:
📊 Version Comparison: 0.0.1 → 0.1.0
==================================================
✅ Overall: 8.5% faster
Regressions: 0
Warnings: 1
Improvements: 3
Stable: 1
Compare two saved baselines without running new benchmarks:
python runner/main.py --diff radon-v0.0.1 radon-v0.1.0Compare multiple versions at once:
python runner/main.py --matrix radon-v0.0.1 radon-v0.1.0 radon-v0.2.0View performance trends across all saved baselines:
python runner/main.py --trendsOutput:
📈 Performance Trends (3 versions)
============================================================
RADON vs python:
First: 6.0x slower
Latest: 4.2x slower
✅ Gap closed by 30%!
Export trend data:
python runner/main.py --trends --export trends.jsonCheck if the latest run has regressions (for CI pipelines):
# Exit code 0 = no regressions, 1 = regressions found
python runner/main.py --check-regressions --against radon-v0.0.1
# Custom threshold (default: 15%)
python runner/main.py --check-regressions --threshold 10name: Performance Regression Check
on:
push:
tags: ['v*']
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: pip install psutil matplotlib
- name: Run benchmarks
run: |
cd benchmarks
python runner/main.py --profile standard --baseline current
- name: Check for regressions
run: |
cd benchmarks
python runner/main.py --check-regressions --against previous-release --threshold 15
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: benchmarks/results/latest/Edit config/profiles.json:
{
"smoke": {
"description": "Quick sanity check",
"warmups": 1,
"repeats": 3,
"timeout_ms": 300000
},
"standard": {
"description": "Default developer run",
"warmups": 2,
"repeats": 10,
"timeout_ms": 300000
},
"deep": {
"description": "Release-quality with more samples",
"warmups": 5,
"repeats": 30,
"timeout_ms": 600000
}
}Edit config/scenarios.json:
{
"scenarios": [
{
"id": "simple_sum",
"family": "algorithm",
"description": "Sum 1 to 1000",
"expected_output": "500500",
"comparable": true,
"tags": ["loop", "arithmetic"]
}
],
"runtimes": ["radon", "python", "go"],
"baseline_runtime": "python"
}Edit config/thresholds.json:
{
"improved_threshold": -10.0,
"stable_threshold": 5.0,
"warning_threshold": 15.0
}| Change | Status | CI Action |
|---|---|---|
| ≤ -10% | ✅ Improved | Pass |
| -10% to +5% | ⚡ Stable | Pass |
| +5% to +15% | Pass + Alert | |
| > +15% | 🔴 Regression | Fail CI |
Results are saved to results/latest/:
| File | Description |
|---|---|
results.json |
Raw benchmark data (machine-readable) |
summary.md |
Markdown report (human-readable) |
report.html |
Interactive HTML report with charts |
chart_execution_time.png |
Bar chart of execution times |
chart_performance_ratio.png |
Ratio chart vs Python |
chart_overview.png |
Overview doughnut chart |
- 🌙 Dark/Light theme toggle
- 📊 Interactive Chart.js visualizations
- 💻 Detailed system information
- 📋 Per-scenario results table
- 📈 Version comparison section (when using
--compare)
The benchmark suite includes a landing page (index.html) for browsing results on GitHub Pages.
Features:
- Modern UI with Space Grotesk font
- Latest benchmark run summary with runtime versions
- History list auto-populated from
results/history/index.json - Dark/Light theme toggle
- Direct links to each run's HTML report
How History Works:
The runner/main.py script automatically:
- Saves each run to
results/history/<timestamp>/ - Updates
results/history/index.jsonwith metadata
To enable GitHub Pages:
- Go to repo Settings → Pages
- Set source to "Deploy from a branch" →
main// (root) - Visit
https://radon-project.github.io/benchmark/
Manual History Rebuild:
If history entries are missing from index.json:
python -c "
import json
from pathlib import Path
history = Path('results/history')
entries = []
for folder in sorted(history.iterdir()):
if not folder.is_dir(): continue
rf = folder / 'results.json'
if rf.exists():
data = json.loads(rf.read_text())
entries.append({
'id': folder.name,
'label': f\"{data['profile'].capitalize()} benchmark run\",
'radon_version': data['host'].get('radon_version', '-'),
'python_version': '.'.join(data['host'].get('python_version', '').split('.')[:2]),
'go_version': data['host'].get('go_version') or '-',
'scenarios': len(set(c['scenario_id'] for c in data.get('cases', [])))
})
(history / 'index.json').write_text(json.dumps({'entries': entries}, indent=2))
print(f'Indexed {len(entries)} entries')
"Add to config/scenarios.json:
{
"id": "my_scenario",
"family": "algorithm",
"description": "Description of what it tests",
"expected_output": "expected stdout",
"comparable": true,
"tags": ["cpu_bound", "loop"]
}Create matching implementations in each runtime:
fixtures/radon/my_scenario.rn
# My benchmark scenario - Radon
# Expected output: 12345
var result = 0
# ... benchmark code ...
print(result)
fixtures/python/my_scenario.py
# My benchmark scenario - Python
# Expected output: 12345
result = 0
# ... benchmark code ...
print(result)fixtures/go/my_scenario.go
// My benchmark scenario - Go
// Expected output: 12345
package main
import "fmt"
func main() {
result := 0
// ... benchmark code ...
fmt.Println(result)
}python runner/main.py --profile smoke --scenarios my_scenariobenchmarks/
├── index.html # GitHub Pages landing page
├── runner/ # Benchmark harness
│ ├── main.py # CLI entry point
│ ├── orchestrator.py # Runs benchmarks across runtimes
│ ├── reporter.py # Generates reports (JSON, MD, HTML, PNG)
│ └── comparator.py # Version comparison engine
├── adapters/ # Runtime adapters
│ ├── base_adapter.py # Abstract base class
│ ├── radon_adapter.py # Radon runtime
│ ├── python_adapter.py # Python runtime
│ └── go_adapter.py # Go runtime
├── config/ # Configuration
│ ├── profiles.json # Benchmark profiles
│ ├── scenarios.json # Scenario definitions
│ └── thresholds.json # Regression thresholds
├── fixtures/ # Benchmark code
│ ├── radon/ # Radon implementations
│ ├── python/ # Python implementations
│ └── go/ # Go implementations
├── baselines/ # Saved baselines for comparison
├── trends/ # Aggregated trend data
├── results/ # Benchmark output
│ ├── latest/ # Most recent run
│ └── history/ # Historical runs
│ └── index.json # History registry for GitHub Pages
└── Makefile # Convenience targets
# Before release v0.1.0
python runner/main.py --profile standard --baseline radon-v0.0.1
# ... make improvements to Radon ...
# After release v0.1.0
python runner/main.py --profile standard --baseline radon-v0.1.0 --compare radon-v0.0.1
# View trends
python runner/main.py --trends# Fast smoke test before committing
python runner/main.py --profile smoke --runtimes radon python --scenarios simple_sum fib_20# Deep benchmark with all runtimes, save as release baseline
python runner/main.py --profile deep --runtimes radon python go --baseline radon-v1.0.0
# Generate comparison against previous release
python runner/main.py --diff radon-v0.9.0 radon-v1.0.0# Run benchmarks
python runner/main.py --profile standard --baseline current-run
# Check for regressions (fails if >10% slower)
python runner/main.py --check-regressions --against previous-release --threshold 10
echo "Exit code: $?" # 0 = pass, 1 = regression detectedThe Makefile provides convenient shortcuts for common operations.
make smoke # Quick sanity check (1 warmup, 3 repeats)
make standard # Default run (2 warmups, 10 repeats)
make deep # Release-quality (5 warmups, 30 repeats)
# Run with specific scenarios
make smoke SCENARIOS='simple_sum fib_20'
# Save result as baseline
make standard NAME=radon-v0.1.0make radon-only # Benchmark only Radon
make python-only # Benchmark only Python
make compare-all # Benchmark all runtimes (radon, python, go)
# Runtime targets also support NAME=
make radon-only NAME=radon-only-v0.1.0# Save a baseline
make baseline NAME=radon-v0.1.0
# List all saved baselines
make list-baselines
# Run benchmarks and compare against baseline
make compare BASELINE=radon-v0.0.1
# Diff two baselines (no new benchmark run)
make diff BASE1=radon-v0.0.1 BASE2=radon-v0.1.0
# View trends across all baselines
make trends
# CI regression check
make check-regressions BASELINE=radon-v0.0.1
make check-regressions BASELINE=radon-v0.0.1 THRESHOLD=10make clean # Remove results/latest/*
make clean-go # Remove compiled Go binaries
make clean-baselines # Remove all saved baselines
make help # Show all available targets| Variable | Description | Example |
|---|---|---|
NAME |
Baseline name for saving | NAME=radon-v0.1.0 |
BASELINE |
Baseline to compare against | BASELINE=radon-v0.0.1 |
BASE1, BASE2 |
Baselines for diff | BASE1=v0.0.1 BASE2=v0.1.0 |
SCENARIOS |
Scenarios to run | SCENARIOS='simple_sum fib_20' |
THRESHOLD |
Regression threshold (%) | THRESHOLD=10 |
python runner/main.py [OPTIONS]
Benchmark Execution:
--profile, -p {smoke,standard,deep} Benchmark profile (default: standard)
--runtimes, -r RUNTIME [...] Runtimes to benchmark (default: all)
--scenarios, -s SCENARIO [...] Scenarios to run (default: all)
--output-dir, -o DIR Output directory (default: results/latest)
--no-html Skip HTML report generation
Baseline Management:
--baseline, -b NAME Save current run as named baseline
--list-baselines List all saved baselines
--delete-baseline NAME Delete a saved baseline
Comparison:
--compare, -c BASELINE Compare against a saved baseline
--diff BASELINE1 BASELINE2 Diff two baselines (no benchmark run)
--matrix BASELINE [...] Multi-version matrix comparison
Trend Analysis:
--trends Show performance trends
--export FILE Export trend data to JSON
CI Integration:
--check-regressions Check for regressions (exit code 1 if found)
--threshold PERCENT Regression threshold (default: 15)
--against BASELINE Baseline to compare against
Part of the Radon Programming Language project.