Skip to content

fix(benchmark): assign unique color per model in scatter plots#231

Merged
ttlequals0 merged 2 commits into
mainfrom
fix/benchmark-distinct-colors
May 16, 2026
Merged

fix(benchmark): assign unique color per model in scatter plots#231
ttlequals0 merged 2 commits into
mainfrom
fix/benchmark-distinct-colors

Conversation

@ttlequals0
Copy link
Copy Markdown
Owner

Summary

  • Three benchmark scatter renderers in benchmarks/llm/src/benchmark/report.py (_render_pareto, _render_precision_recall_chart, _render_token_efficiency_chart) were assigning colors with cmap = plt.get_cmap("tab20") then cmap(i % 20). With more than 20 models in a sweep, the legend cycled colors so multiple models shared the same dot color. The current report has 28 models, so the Cost-vs-F1 chart had several visibly duplicate colors.
  • Added _distinct_colors(n) that concatenates tab20 + tab20b + tab20c (60 categorical colors with good perceptual contrast) and falls back to evenly-spaced hsv past that, so every model gets a unique color up to 60 and remains unique past 60. Magic number is derived from len(palette), not hardcoded.
  • All three renderers switched to colors = _distinct_colors(len(points)) and colors[i].
  • Regenerated the three affected SVGs (pareto.svg, precision_recall.svg, token_efficiency.svg). Other charts color by value semantics (threshold bands, heatmaps, p50/p90/p99 categories), not model identity, so they were not affected and were not regenerated to keep this diff focused.
  • Benchmark-only change. The benchmark tree is dockerignored; no runtime image impact, no version bump, no openapi change.

Test plan

  • Verified _distinct_colors(n) returns n unique RGBA tuples for n in {5, 30, 60, 80, 200} (60 from categorical, >60 from hsv).
  • Regenerated report locally with benchmark report; SVGs render and the three target scatters show distinct colors per model.
  • /simplify and /code-review ran clean against the diff (no findings above the 80-confidence threshold).
  • CI green on this branch.

Three benchmark scatter renderers (Pareto, precision/recall, token
efficiency) drew from tab20 with `i % 20`, so any run with more than 20
models silently reused colors. With 28 models in the current report the
legend now had multiple identical dots.

Add `_distinct_colors(n)` that concatenates tab20 + tab20b + tab20c (60
categorical colors) and falls back to evenly-spaced hsv past that. All
three renderers now call it and index `colors[i]`. The three regenerated
SVGs in this commit reflect the fix.

Benchmark-only; not shipped in the runtime image; no version bump.
Re-renders report.md plus every SVG in results/report_assets/ from the
current calls.jsonl + corpus, so the published artifacts match the rest
of the repo's snapshot pattern. The substantive change is still only the
three scatter plots (pareto, precision_recall, token_efficiency) where
each model now gets a unique color; the other SVGs and report.md only
differ in matplotlib-generated element IDs, embedded timestamps, and
PASS-row ordering for ties.
@ttlequals0 ttlequals0 merged commit 9ce089c into main May 16, 2026
13 checks passed
@ttlequals0 ttlequals0 deleted the fix/benchmark-distinct-colors branch May 16, 2026 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant