fix(benchmark): assign unique color per model in scatter plots#231
Merged
Conversation
Three benchmark scatter renderers (Pareto, precision/recall, token efficiency) drew from tab20 with `i % 20`, so any run with more than 20 models silently reused colors. With 28 models in the current report the legend now had multiple identical dots. Add `_distinct_colors(n)` that concatenates tab20 + tab20b + tab20c (60 categorical colors) and falls back to evenly-spaced hsv past that. All three renderers now call it and index `colors[i]`. The three regenerated SVGs in this commit reflect the fix. Benchmark-only; not shipped in the runtime image; no version bump.
Re-renders report.md plus every SVG in results/report_assets/ from the current calls.jsonl + corpus, so the published artifacts match the rest of the repo's snapshot pattern. The substantive change is still only the three scatter plots (pareto, precision_recall, token_efficiency) where each model now gets a unique color; the other SVGs and report.md only differ in matplotlib-generated element IDs, embedded timestamps, and PASS-row ordering for ties.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
benchmarks/llm/src/benchmark/report.py(_render_pareto,_render_precision_recall_chart,_render_token_efficiency_chart) were assigning colors withcmap = plt.get_cmap("tab20")thencmap(i % 20). With more than 20 models in a sweep, the legend cycled colors so multiple models shared the same dot color. The current report has 28 models, so the Cost-vs-F1 chart had several visibly duplicate colors._distinct_colors(n)that concatenatestab20 + tab20b + tab20c(60 categorical colors with good perceptual contrast) and falls back to evenly-spacedhsvpast that, so every model gets a unique color up to 60 and remains unique past 60. Magic number is derived fromlen(palette), not hardcoded.colors = _distinct_colors(len(points))andcolors[i].pareto.svg,precision_recall.svg,token_efficiency.svg). Other charts color by value semantics (threshold bands, heatmaps, p50/p90/p99 categories), not model identity, so they were not affected and were not regenerated to keep this diff focused.Test plan
_distinct_colors(n)returnsnunique RGBA tuples forn in {5, 30, 60, 80, 200}(60 from categorical, >60 from hsv).benchmark report; SVGs render and the three target scatters show distinct colors per model./simplifyand/code-reviewran clean against the diff (no findings above the 80-confidence threshold).