@remotion/skills-evals: Add batched eval runs#7396
Conversation
There was a problem hiding this comment.
TL;DR — Adds a --runs flag (1-4) so bun run eval run and bun run eval compare can batch parallel executions of a scenario, with the studio UI gaining a run-count selector, per-run progress cards, and scroll-stable log panels.
Key changes
- Per-run batching across CLI, server, and UI — A new
run-count.tsvalidator (1..maxParallelSkillEvalRuns) is threaded throughrunSkillEvalComparison,startRun,startComparison, the/api/run//api/compareroutes, and the CLI argument parsers. - Manifest schema gains
runsarray —SkillEvalComparisonnow carries an optionalruns: SkillEvalComparisonRunPair[]plusrunCount. The legacy top-levelbefore/afterfields are still populated fromruns[0]so existingcomparison.jsonfiles keep loading. - Scoped concurrency helper — Each consumer gets its own
runWithConcurrencythat caps in-flight workers atmaxParallelSkillEvalRunsand (incompare.ts/jobs.ts) bails out on the first failure. - Per-run UI cards with stable scroll —
scenario.tsxswaps the global before/after pair for a list ofRunProgressGroups, and the polling script updates eachpre[data-run-log]in place so the log doesn't jump while polling.
Summary | 11 files | 1 commit | base: main ← cursor/skills-evals-batched-runs-834b
Concurrency helper duplicated across three modules
Before: one shared
Promise.allSettledpair coordinated byfirstSnapshotErrorand per-labelAbortControllers.
After: three near-identical localrunWithConcurrencyhelpers incli.ts,compare.ts, andapp/jobs.ts, only two of which fail fast on the first error.
The helper handles the same problem (bounded parallelism, single-error propagation) and the three copies have already diverged: the cli.ts variant uses Promise.all without a firstError guard, so a failure in one run lets the remaining queued runs keep starting before the rejection surfaces. The other two abort the queue as soon as firstError is set. Extracting this into a shared module (next to run-count.ts) would keep the semantics consistent and shrink the diff.
packages/skills-evals/src/cli.ts · packages/skills-evals/src/compare.ts · packages/skills-evals/src/app/jobs.ts
Studio resultUrl only points at the first batched run
Before: single-run scenarios redirected to the one produced manifest.
After: multi-run scenarios still redirect toresults[0], so runs #2..#N are only reachable via the scenario page'sRunslist.
Functionally fine because every manifest is persisted under runsRoot and surfaced by loadPlainRuns, but worth confirming the UX is intentional — the user clicks "Run" with --runs 4, watches four cards finish, and then lands on run #1 rather than staying on the scenario page where all four are visible.
packages/skills-evals/src/app/jobs.ts · packages/skills-evals/src/app/scenario.tsx
Claude Opus | 𝕏

Summary
--runssupport for skills evalrunandcompare, capped at 4 parallel eval executions.main.Testing
bun run lintinpackages/skills-evalsbun run formattinginpackages/skills-evalsbun -e "import {runWithConcurrency} from './src/run-with-concurrency'; const started = []; try { await runWithConcurrency({inputs: [1,2,3,4], limit: 1, worker: async (value) => { started.push(value); if (value === 2) throw new Error('boom'); return value; }}); process.exit(1); } catch (error) { console.log(error instanceof Error ? error.message : String(error)); console.log(started.join(',')); }"inpackages/skills-evalsbun run eval run --all --runs 4inpackages/skills-evalsPOST /api/run/ui-smoke?runs=4with a temporary local scenario confirmed four run groups and no initialresultUrlbun run buildat repo rootbun run stylecheckat repo root (expected failure:@remotion/lambda-gorequires Go >= 1.23.0, VM has Go 1.22.2)Walkthrough
skills_evals_pullfrog_fixes.mp4
To show artifacts inline, enable in settings.