A command-line tool to benchmark local (Ollama) and remote (Google Gemini) Large Language Models (LLMs). It evaluates models against a configurable set of tasks, monitors system resources, generates a detailed HTML report with performance visualizations, and supports exporting results.
- Multi-Provider Support: Benchmark models served locally via Ollama (host configurable) and remotely via the Google Gemini API.
- Flexible Task Definition: Define benchmark tasks in a simple JSON format (
benchmark_tasks.json
), organized by category. Includes tasks with varied prompts (e.g., different instructions, personas). - Configuration File: Manage default settings (models, paths, API keys/URLs, weights, features, timeouts) via a
config.yaml
file. CLI arguments and environment variables override file settings. - Diverse Evaluation Methods:
- Keyword matching (strict 'all' or flexible 'any')
- Weighted keyword scoring for nuanced evaluation
- JSON and YAML structure validation and comparison
- Regex-based information extraction with optional validation rules
- Python code execution and testing against defined test cases
- Classification with confidence score checking
- Semantic similarity comparison (optional, requires
sentence-transformers
)
- Resource Monitoring (Optional):
- Track CPU RAM usage delta for Ollama models (requires
psutil
). - Track NVIDIA GPU memory usage delta (GPU 0) for Ollama models (requires
pynvml
).
- Track CPU RAM usage delta for Ollama models (requires
- Performance Metrics: Measure API response time and tokens/second (Ollama only).
- Scoring: Calculates overall accuracy, average scores for partial credit tasks, an "Ollama Performance Score", and a category-weighted "Overall Score".
- Reporting:
- Generates a comprehensive HTML report with summary tables, performance plots (rankings for scores, accuracy, token/sec, resource usage, comparison by prompt stage), and detailed per-task results.
- Optional export of summary results to CSV (
--export-summary-csv
). - Optional export of detailed task results to JSON (
--export-details-json
).
- Caching: Caches results to speed up subsequent runs (configurable TTL).
- Utilities: Includes a
--check-dependencies
flag to verify installation and basic functionality of optional libraries. - Configurable: Control models, tasks, retries, paths, optional features, scoring weights, API endpoints, and more via
config.yaml
, environment variables, and command-line arguments.
-
Clone the repository:
git clone https://github.com/colonelpanik/llm-bench.git cd llm-bench
-
Recommended: Create and Activate a Virtual Environment:
python -m venv .venv # On Linux/macOS: source .venv/bin/activate # On Windows: # .\.venv\Scripts\activate
-
Install Core Dependencies: The core requirements are
requests
andPyYAML
.pip install requests PyYAML
-
Install Optional Dependencies (As Needed): Install libraries for features you intend to use. See
requirements.txt
for details.- RAM Monitoring (
--ram-monitor enable
):pip install psutil
- GPU Monitoring (
--gpu-monitor enable
):pip install pynvml
(Requires NVIDIA drivers/CUDA toolkit correctly installed) - Report Plots (
--visualizations enable
):pip install matplotlib
- Semantic Evaluation (
--semantic-eval enable
):pip install sentence-transformers
(Downloads model files on first use)
You can check the status of optional dependencies using:
python -m benchmark_cli --check-dependencies
- RAM Monitoring (
Settings are determined in the following order (later steps override earlier ones):
- Base Defaults: Hardcoded minimal defaults in
config.py
. config.yaml
: Settings loaded from the YAML configuration file (default:config.yaml
, path configurable via--config-file
). This is the primary place to set your defaults.- Environment Variables:
GEMINI_API_KEY
overridesapi.gemini_api_key
fromconfig.yaml
.OLLAMA_HOST
overridesapi.ollama_host_url
fromconfig.yaml
(e.g.,OLLAMA_HOST=http://some-other-ip:11434
).
- Command-Line Arguments: Any arguments provided on the command line override all previous settings (e.g.,
--test-model
,--tasks-file
,--gemini-key
,--ram-monitor disable
).
The main entry point is benchmark_cli.py
.
Show Help:
python -m benchmark_cli --help
Basic Run (Uses defaults from config.yaml
or base):
python -m benchmark_cli
Run specific models, overriding config defaults, clear cache:
python -m benchmark_cli --test-model llama3:8b --test-model gemini-1.5-flash-latest --clear-cache -v
Run only 'nlp' tasks category and open report:
python -m benchmark_cli --task-set nlp --open-report
Run only specific tasks by name:
python -m benchmark_cli --task-name "Sentiment - Complex Complaint" --task-name "Code - Python Factorial"
Set Gemini API Key via CLI (highest precedence):
python -m benchmark_cli --gemini-key "YOUR_API_KEY" --test-model gemini-1.5-pro-latest
(Alternatively, set GEMINI_API_KEY
environment variable or define in config.yaml
)
Use a custom configuration file:
python -m benchmark_cli --config-file my_settings.yaml
Run and export summary results to CSV:
python -m benchmark_cli --export-summary-csv
Check if optional dependencies are installed and working:
python -m benchmark_cli --check-dependencies
config.yaml
(Default): Define default models, API endpoints (ollama_host_url
,gemini_api_key
), paths, weights, timeouts, feature toggles, etc. See the default file for structure and comments.benchmark_tasks.json
(Default): Define your benchmark tasks here. Path configurable inconfig.yaml
or via--tasks-file
.report_template.html
(Default): Customize the HTML report template. Path configurable inconfig.yaml
or via--template-file
.
- HTML Report (
benchmark_report/report.html
): Detailed report. Path configurable. - Plots (
benchmark_report/images/*.png
): PNG images embedded in the report. Path configurable. - Cache Files (
benchmark_cache/cache_*.json
): Stored results. Path configurable. - Export Files (Optional):
benchmark_report/summary_*.csv
: Summary CSV file if--export-summary-csv
is used.benchmark_report/details_*.json
: Detailed JSON file if--export-details-json
is used.
- Console Output: Progress, summaries, warnings, errors. Use
-v
for more detail.
This project uses GitHub Actions for automated testing. Unit tests are run automatically on pushes and pull requests to the main
branch against multiple Python versions.
The tests mock external services (Ollama, Gemini) and do not require live instances or API keys to run in the CI environment.
A Dockerfile
is provided for running the benchmark tool in a containerized environment with all dependencies included.
1. Build the Docker Image:
From the project root directory:
docker build -t llm-bench .
2. Run the Benchmark:
Running the benchmark requires connecting the container to your running Ollama instance. The method depends on your operating system.
-
On Linux: Use
--network=host
to share the host's network stack. Ollama running onlocalhost:11434
on the host will be accessible via the same address inside the container. Mount volumes for configuration, tasks, reports, and cache.docker run --rm -it --network=host \ -v ./config.yaml:/app/config.yaml \ -v ./benchmark_tasks.json:/app/benchmark_tasks.json \ -v ./benchmark_report:/app/benchmark_report \ -v ./benchmark_cache:/app/benchmark_cache \ llm-bench \ --test-model llama3:8b --open-report
(Note:
--open-report
might not work reliably from within Docker unless you have a browser configured.) -
On macOS or Windows (Docker Desktop):
--network=host
is not typically supported. Instead, Docker provides a special DNS namehost.docker.internal
which resolves to the host machine. You need to tellllm-bench
to use this address for Ollama.-
Option A (Recommended): Using Environment Variable: Set the
OLLAMA_HOST
environment variable when running the container.docker run --rm -it \ -v ./config.yaml:/app/config.yaml \ -v ./benchmark_tasks.json:/app/benchmark_tasks.json \ -v ./benchmark_report:/app/benchmark_report \ -v ./benchmark_cache:/app/benchmark_cache \ -e OLLAMA_HOST="http://host.docker.internal:11434" \ llm-bench \ --test-model llama3:8b
-
Option B: Modifying
config.yaml
: Add or modify theapi.ollama_host_url
setting in yourconfig.yaml
(which you mount into the container) to point tohttp://host.docker.internal:11434
.Example
config.yaml
snippet:api: # ... other keys ... ollama_host_url: http://host.docker.internal:11434
Then run without the
-e OLLAMA_HOST
flag:docker run --rm -it \ -v ./config.yaml:/app/config.yaml \ # ... other volumes ... llm-bench \ --test-model llama3:8b
-
Running Specific Commands:
You can pass any llm-bench
command-line arguments after the image name:
# Check dependencies inside the container
docker run --rm -it llm-bench --check-dependencies
# Run specific models with verbose output
docker run --rm -it --network=host -v $(pwd):/app llm-bench --test-model mistral:7b -v
(Adjust --network=host
or add -e OLLAMA_HOST
based on your OS for commands requiring Ollama.)
MIT License.
Contributions welcome! Please open an issue or submit a pull request.