Skip to content

Files

Latest commit

 

History

History

scripts

Goose Benchmark Scripts

This directory contains scripts for running and analyzing Goose benchmarks.

run-benchmarks.sh

This script runs Goose benchmarks across multiple provider:model pairs and analyzes the results.

Prerequisites

  • Goose CLI must be built or installed
  • jq command-line tool for JSON processing (optional, but recommended for result analysis)

Usage

./scripts/run-benchmarks.sh [options]

Options

  • -p, --provider-models: Comma-separated list of provider:model pairs (e.g., 'openai:gpt-4o,anthropic:claude-3-5-sonnet')
  • -s, --suites: Comma-separated list of benchmark suites to run (e.g., 'core,small_models')
  • -o, --output-dir: Directory to store benchmark results (default: './benchmark-results')
  • -d, --debug: Use debug build instead of release build
  • -h, --help: Show help message

Examples

# Run with release build (default)
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o,anthropic:claude-3-5-sonnet' --suites 'core,small_models'

# Run with debug build
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o' --suites 'core' --debug

How It Works

The script:

  1. Parses the provider:model pairs and benchmark suites
  2. Determines whether to use the debug or release binary
  3. For each provider:model pair:
    • Sets the GOOSE_PROVIDER and GOOSE_MODEL environment variables
    • Runs the benchmark with the specified suites
    • Analyzes the results for failures
  4. Generates a summary of all benchmark runs

Output

The script creates the following files in the output directory:

  • summary.md: A summary of all benchmark results
  • {provider}-{model}.json: Raw JSON output from each benchmark run
  • {provider}-{model}-analysis.txt: Analysis of each benchmark run

Exit Codes

  • 0: All benchmarks completed successfully
  • 1: One or more benchmarks failed

parse-benchmark-results.sh

This script analyzes a single benchmark JSON result file and identifies any failures.

Usage

./scripts/parse-benchmark-results.sh path/to/benchmark-results.json

Output

The script outputs an analysis of the benchmark results to stdout, including:

  • Basic information about the benchmark run
  • Results for each evaluation in each suite
  • Summary of passed and failed metrics

Exit Codes

  • 0: All metrics passed successfully
  • 1: One or more metrics failed