Skip to content

Commit 4281a0f

Browse files
thehwangcursoragent
andcommitted
Add benchmark harness for model + context comparison
scripts/benchmark_models.sh runs Ollama against the same transcript across any set of installed models and a configurable num_ctx, recording wall clock, tokens/second, and prompt/eval token counts. Designed to make the "Ollama defaults to 2048" finding reproducible — run once with NUM_CTX=2048 and once with NUM_CTX=32768 to see the difference firsthand. The prompt template is kept in sync with SummaryService.swift:buildPrompt() (instructions embedded in prompt, no system field) so the benchmark matches production behavior — and avoids tripping Gemma 4 into a thinking-mode pattern that consumes the num_predict budget without emitting visible output. benchmarks/synthetic-transcript.md is a 60-minute fictional Atlas Robotics all-hands transcript. All names, projects, customers, and numbers are invented. Real meeting recordings must never be committed to this directory; .gitignore enforces that the only tracked content is the synthetic fixture, the README, and the findings report. benchmarks/findings.md documents the qualitative outputs for the default ctx=2048 vs ctx=32768 runs with both Qwen 2.5 3B and Gemma 4 E2B, including the interesting result that Gemma 4 reports the truncation back to the user. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent c211678 commit 4281a0f

5 files changed

Lines changed: 495 additions & 0 deletions

File tree

.gitignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,14 @@ launch-plan.html
2323
# Test data
2424
TestData/
2525

26+
# Benchmark outputs — never commit anything except the synthetic transcript
27+
# and the published findings report.
28+
benchmarks/*
29+
!benchmarks/.gitkeep
30+
!benchmarks/synthetic-transcript.md
31+
!benchmarks/README.md
32+
!benchmarks/findings.md
33+
2634
# Whisper vendor sources (built locally)
2735
vendor/
2836
Sources/CWhisper/lib/*.a

benchmarks/README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Benchmarks
2+
3+
This directory holds the **synthetic transcript** used by `scripts/benchmark_models.sh`
4+
to measure summary quality and speed across local LLMs.
5+
6+
## Why a synthetic transcript
7+
8+
Public benchmarking with real meeting recordings is a privacy minefield —
9+
even anonymized transcripts can leak company names, product details,
10+
employee identities, or strategy discussions.
11+
12+
`synthetic-transcript.md` is a fictional Q2 all-hands meeting for a made-up
13+
robotics company (Atlas Robotics). All names, projects, numbers, and decisions
14+
are invented. The transcript imitates the structure and pacing of a real
15+
60-minute corporate meeting (cross-functional updates, multiple speakers,
16+
decisions, action items) so summary outputs are meaningful to compare.
17+
18+
## What gets committed
19+
20+
| File / Pattern | Tracked by git? |
21+
|---|---|
22+
| `synthetic-transcript.md` | ✅ yes — the fixture |
23+
| `README.md` | ✅ yes |
24+
| `<run-id>/` (benchmark output) | ❌ no (gitignored) |
25+
| Any other file | ❌ no (gitignored) |
26+
27+
**Never** put a real meeting transcript in this directory. The
28+
`.gitignore` is the last line of defense — primary defense is you.
29+
30+
## Running
31+
32+
```bash
33+
# Baseline: Ollama's default 2K context (reproduces the bug)
34+
NUM_CTX=2048 LABEL=ctx2k bash scripts/benchmark_models.sh benchmarks/synthetic-transcript.md
35+
36+
# Fix: full context window for each model
37+
NUM_CTX=32768 LABEL=ctx32k bash scripts/benchmark_models.sh benchmarks/synthetic-transcript.md
38+
39+
# Gemma 4 advantage: 128K context
40+
NUM_CTX=131072 LABEL=ctx128k bash scripts/benchmark_models.sh benchmarks/synthetic-transcript.md
41+
```

benchmarks/findings.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# Scripta + Gemma 4: benchmark findings
2+
3+
Run date: 2026-05-16
4+
Hardware: Mac (see local benchmarks/*.json for full prompt_tokens, eval_duration_ns)
5+
Transcript: synthetic-transcript.md, 19,357 chars / 3,135 words / ~5K tokens
6+
Workflow: `scripts/benchmark_models.sh` invoked once per (model, num_ctx) pair.
7+
8+
## Quantitative results
9+
10+
| Model | num_ctx | Wall | tok/s | Output tokens |
11+
|-------------|---------|-------|-------|---------------|
12+
| qwen2.5:3b | 2048 | 15.2s | 47.9 | 59 |
13+
| gemma4:e2b | 2048 | 106.9s¹| 41.7 | 267 |
14+
| qwen2.5:3b | 32768 | 25.7s | 39.3 | 222 |
15+
| gemma4:e2b | 32768 | 49.2s | 27.1 | 752 |
16+
17+
¹ Gemma 4 first-token latency includes cold model load (~80s on this hardware).
18+
Subsequent runs warm-cached are roughly half that wall clock.
19+
20+
## Qualitative observations
21+
22+
### qwen2.5:3b @ 2K — the "broken Scripta" baseline
23+
24+
Only summarizes the Q&A section at the end of the transcript. Lists three items
25+
as the meeting's "key points": hybrid work policy, intern conversion path,
26+
pricing impact on pipeline. These are minor Q&A topics, not the meeting's
27+
content.
28+
29+
**Misses entirely**: Q2 ARR ($4.2M), headcount growth (47), Marcus Reyes joining
30+
as VP Engineering, Project Lighthouse (the whole reason this meeting happened),
31+
the 3x perception perf improvement, all five new engineer names, every tech debt
32+
item, three new enterprise logos, Voice Control roadmap, Series B prep, the
33+
Engineer of the Quarter award.
34+
35+
### gemma4:e2b @ 2K — the most interesting result
36+
37+
Like Qwen, only sees the tail of the transcript. But unlike Qwen, **Gemma 4
38+
explicitly told the user the transcript looked incomplete**:
39+
40+
> "The provided transcript seems to be a mix of several unrelated topics, making
41+
> it difficult to extract a single, coherent summary based on the provided text
42+
> alone. ... If you are looking for a summary of the *actual* conversation
43+
> content, please provide the relevant transcript."
44+
45+
This is the model recognizing that the context it received doesn't match a
46+
plausible meeting structure. Reasoning behavior on a 4B-parameter model that
47+
runs on a laptop.
48+
49+
It also duplicated one action item ("Prepare for the upcoming office move"
50+
appears twice), suggesting Gemma 4 is filling output budget when content is thin.
51+
52+
### qwen2.5:3b @ 32K — Qwen with the bug fixed
53+
54+
Solid coverage of the meeting. Names the headline ARR, the new VP, the pricing
55+
move. Lists 8 accurate action items.
56+
57+
Still misses some specifics that matter for a corporate summary: the three new
58+
enterprise logos by name, Project Lighthouse specifically, Series B prep,
59+
Engineer of the Quarter award.
60+
61+
### gemma4:e2b @ 32K — best result
62+
63+
Most comprehensive of all four. Mentions:
64+
65+
* $4.2M ARR exceeding plan
66+
* Headcount expansion to 47
67+
* **Three new logos by name** — Boeing, Amazon, FedEx
68+
* **Project Lighthouse** with July 15 launch date
69+
* Voice Control (Q3) and multi-robot coordination (Q4)
70+
* **Series B prep** (strategic info from segment 5)
71+
* Renewal rate 94%, NPS 67
72+
* 9 accurate action items including the Lighthouse readiness review
73+
74+
The summary is roughly 3x longer than Qwen's and more useful as a meeting
75+
artifact.
76+
77+
## What this means for Scripta
78+
79+
Before this work, Scripta had two compounding limits:
80+
81+
1. `SummaryService.swift:buildPrompt()` truncated the transcript to **3,000 chars**
82+
before sending to Ollama (a leftover guard from the early prototype).
83+
2. The Ollama request had no `num_ctx` override, so Ollama applied its default
84+
of **2,048 tokens** regardless of the model's actual capability.
85+
86+
Either limit alone would have been bad; the combination meant **every Scripta
87+
summary used at most ~750 tokens of context** (the prompt scaffolding plus tail
88+
of the transcript). A 60-minute meeting compressed to roughly the last 5
89+
minutes.
90+
91+
The fix is two lines:
92+
93+
```swift
94+
"num_ctx": SummaryModelManager.contextWindow(for: modelName),
95+
```
96+
97+
…plus dropping the 3,000-char truncation in favor of a context-aware tail
98+
truncation that uses `model.contextTokens - 1200` (template + output reservation).
99+
100+
Adding Gemma 4 was the second piece. With 128K context, Gemma 4 lets Scripta
101+
handle multi-hour meetings without any chunking. With Qwen 2.5 (32K context),
102+
we cover most one-hour meetings; longer meetings need chunking or Gemma.
103+
104+
## Reproduce
105+
106+
```bash
107+
NUM_CTX=2048 LABEL=ctx2k bash scripts/benchmark_models.sh benchmarks/synthetic-transcript.md
108+
NUM_CTX=32768 LABEL=ctx32k bash scripts/benchmark_models.sh benchmarks/synthetic-transcript.md
109+
```

0 commit comments

Comments
 (0)