# Aider coding benchmarks

https://github.com/Aider-AI/aider/tree/main/benchmark

https://github.com/Aider-AI/polyglot-benchmark

## 1. Setup for benchmarking

Open a new terminal and execute the commands below:

```bash
cd wordslab-benchmarks/
./build-aider-benchmarks-container.sh
docker images
```

You should get the folowing result:

```
#/home/workspace/wordslab-benchmarks# docker images
REPOSITORY        TAG       IMAGE ID       CREATED              SIZE
aider-benchmark   latest    a3ead77fb6e4   About a minute ago   4.96GB
```

## 2. Run the benchmark

Make sure that ollama is started with 8192 context length:

```bash
OLLAMA_HOST=0.0.0.0 OLLAMA_CONTEXT_LENGTH=8192 OLLAMA_LOAD_TIMEOUT=-1 ollama serve
```

Open a new terminal and execute the commands below:

```bash
cd wordslab-benchmarks
./aider/benchmark/docker.sh
```

You should be logged in the Docker container:

```
/home/workspace/wordslab-benchmarks/aider/benchmark# ./docker.sh
root@9adfdca74a91:/aider#
```

Inside the container, execute the following commands:

```bash
cd aider
pip install -e .[dev]
export AIDER_BENCHMARK_DIR="/aider/aider/tmp.benchmarks"
export OLLAMA_API_BASE="http://host.docker.internal:11434"

./benchmark/benchmark.py gemma3-4b-polyglot-run1 --model ollama_chat/gemma3:4b --edit-format whole --threads 1 --num-tests 10 --exercises-dir polyglot-benchmark
./benchmark/benchmark.py gemma3-4b-python-run1 --model ollama_chat/gemma3:4b --edit-format whole --threads 1 --num-tests 10 --exercises-dir python-benchmark
```

IMPORTANT : add --cont to the command to resume a stalled run.

Fix the ollama mistral-small3.1 model to run on 24GB VRAM:

```
llama show --modelfile mistral-small3.1:24b > Modelfile
vi Modelfile

# Start with:      FROM mistral-small3.1:24b
# ...
# Add the line:    PARAMETER num_gpu 100

ollama create mistral-small3.1:24b-gpu24GB -f Modelfile

# eval rate:            18.08 tokens/s
ollama run --verbose mistral-small3.1:24b

# eval rate:            55.44 tokens/s
ollama run --verbose mistral-small3.1:24b-gpu24GB
```

## 3. Collect the results

```
- dirname: 2025-05-08-13-59-14--gemma3-4b-python-run6
  test_cases: 129
  model: ollama_chat/gemma3:4b
  edit_format: whole
  commit_hash: 8956eef-dirty
  pass_rate_1: 21.7
  pass_rate_2: 24.0
  pass_num_1: 28
  pass_num_2: 31
  percent_cases_well_formed: 96.1
  error_outputs: 19
  num_malformed_responses: 19
  num_with_malformed_responses: 5
  user_asks: 49
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 0
  total_tests: 133
  command: aider --model ollama_chat/gemma3:4b
  date: 2025-05-08
  versions: 0.82.4.dev
  seconds_per_case: 20.3
  total_cost: 0.0000

- dirname: 2025-05-08-21-06-37--gemma3-12b-python-run1
  test_cases: 133
  model: ollama_chat/gemma3:12b
  edit_format: whole
  commit_hash: 8956eef-dirty
  pass_rate_1: 34.6
  pass_rate_2: 42.1
  pass_num_1: 46
  pass_num_2: 56
  percent_cases_well_formed: 100.0
  error_outputs: 0
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 4
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 1
  total_tests: 133
  command: aider --model ollama_chat/gemma3:12b
  date: 2025-05-08
  versions: 0.82.4.dev
  seconds_per_case: 23.9
  total_cost: 0.0000

- dirname: 2025-05-09-04-24-18--gemma3-27b-python-run1
  test_cases: 133
  model: ollama_chat/gemma3:27b
  edit_format: whole
  commit_hash: 8956eef-dirty
  pass_rate_1: 39.1
  pass_rate_2: 48.9
  pass_num_1: 52
  pass_num_2: 65
  percent_cases_well_formed: 98.5
  error_outputs: 6
  num_malformed_responses: 3
  num_with_malformed_responses: 2
  user_asks: 14
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 0
  total_tests: 133
  command: aider --model ollama_chat/gemma3:27b
  date: 2025-05-09
  versions: 0.82.4.dev
  seconds_per_case: 217.3
  total_cost: 0.0000

- dirname: 2025-05-10-06-26-52--qwen2.5-coder-7b-python-run1
  test_cases: 133
  model: ollama_chat/qwen2.5-coder:7b
  edit_format: whole
  commit_hash: 8956eef-dirty
  pass_rate_1: 44.4
  pass_rate_2: 51.1
  pass_num_1: 59
  pass_num_2: 68
  percent_cases_well_formed: 100.0
  error_outputs: 1
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 9
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 2
  total_tests: 133
  command: aider --model ollama_chat/qwen2.5-coder:7b
  date: 2025-05-10
  versions: 0.82.4.dev
  seconds_per_case: 21.7
  total_cost: 0.0000

- dirname: 2025-05-09-21-41-07--qwen2.5-coder-14b-python-run1
  test_cases: 133
  model: ollama_chat/qwen2.5-coder:14b
  edit_format: whole
  commit_hash: 8956eef-dirty
  pass_rate_1: 54.1
  pass_rate_2: 66.9
  pass_num_1: 72
  pass_num_2: 89
  percent_cases_well_formed: 100.0
  error_outputs: 0
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 13
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 1
  total_tests: 133
  command: aider --model ollama_chat/qwen2.5-coder:14b
  date: 2025-05-09
  versions: 0.82.4.dev
  seconds_per_case: 27.0
  total_cost: 0.0000

- dirname: 2025-05-10-06-50-30--qwen2.5-coder-32b-python-run1
  test_cases: 133
  model: ollama_chat/qwen2.5-coder:32b
  edit_format: whole
  commit_hash: 8956eef-dirty
  pass_rate_1: 56.4
  pass_rate_2: 72.2
  pass_num_1: 75
  pass_num_2: 96
  percent_cases_well_formed: 100.0
  error_outputs: 0
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 5
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 1
  total_tests: 133
  command: aider --model ollama_chat/qwen2.5-coder:32b
  date: 2025-05-10
  versions: 0.82.4.dev
  seconds_per_case: 50.0
  total_cost: 0.0000

costs: $0.0000/test-case, $0.00 total, $0.00 projected
```