`@remotion/skills-evals`: Add batched eval runs by samohovets · Pull Request #7396 · remotion-dev/remotion

samohovets · 2026-05-15T10:26:02Z

Summary

Add --runs support for skills eval run and compare, capped at 4 parallel eval executions.
Persist batched comparison pairs in manifests while preserving the existing single-run fields.
Add dev UI run-count selection alongside the comparison base-ref control from main.
Split active progress into per-run cards, with each run owning its own before/after logs.
Preserve log panel scroll position across polling updates by updating run cards in place.
Extract bounded concurrency into a shared helper and keep multi-run plain jobs on the scenario page instead of redirecting to run number 1.

Testing

bun run lint in packages/skills-evals
bun run formatting in packages/skills-evals
bun -e "import {runWithConcurrency} from './src/run-with-concurrency'; const started = []; try { await runWithConcurrency({inputs: [1,2,3,4], limit: 1, worker: async (value) => { started.push(value); if (value === 2) throw new Error('boom'); return value; }}); process.exit(1); } catch (error) { console.log(error instanceof Error ? error.message : String(error)); console.log(started.join(',')); }" in packages/skills-evals
bun run eval run --all --runs 4 in packages/skills-evals
POST /api/run/ui-smoke?runs=4 with a temporary local scenario confirmed four run groups and no initial resultUrl
Browser walkthrough confirmed the page stays on the scenario route while showing all four active run cards
bun run build at repo root
bun run stylecheck at repo root (expected failure: @remotion/lambda-go requires Go >= 1.23.0, VM has Go 1.22.2)

Walkthrough

skills_evals_pullfrog_fixes.mp4

_{To show artifacts inline, enable in settings.}

vercel · 2026-05-15T10:26:08Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
bugs	Ready	Preview, Comment	May 15, 2026 2:01pm

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
remotion	Skipped		May 15, 2026 2:01pm

pullfrog

TL;DR — Adds a --runs flag (1-4) so bun run eval run and bun run eval compare can batch parallel executions of a scenario, with the studio UI gaining a run-count selector, per-run progress cards, and scroll-stable log panels.

Key changes

Per-run batching across CLI, server, and UI — A new run-count.ts validator (1..maxParallelSkillEvalRuns) is threaded through runSkillEvalComparison, startRun, startComparison, the /api/run//api/compare routes, and the CLI argument parsers.
Manifest schema gains runs array — SkillEvalComparison now carries an optional runs: SkillEvalComparisonRunPair[] plus runCount. The legacy top-level before/after fields are still populated from runs[0] so existing comparison.json files keep loading.
Scoped concurrency helper — Each consumer gets its own runWithConcurrency that caps in-flight workers at maxParallelSkillEvalRuns and (in compare.ts / jobs.ts) bails out on the first failure.
Per-run UI cards with stable scroll — scenario.tsx swaps the global before/after pair for a list of RunProgressGroups, and the polling script updates each pre[data-run-log] in place so the log doesn't jump while polling.

_{Summary ｜ 11 files ｜ 1 commit ｜ base: main ← cursor/skills-evals-batched-runs-834b}

Concurrency helper duplicated across three modules

Before: one shared Promise.allSettled pair coordinated by firstSnapshotError and per-label AbortControllers.
After: three near-identical local runWithConcurrency helpers in cli.ts, compare.ts, and app/jobs.ts, only two of which fail fast on the first error.

The helper handles the same problem (bounded parallelism, single-error propagation) and the three copies have already diverged: the cli.ts variant uses Promise.all without a firstError guard, so a failure in one run lets the remaining queued runs keep starting before the rejection surfaces. The other two abort the queue as soon as firstError is set. Extracting this into a shared module (next to run-count.ts) would keep the semantics consistent and shrink the diff.

packages/skills-evals/src/cli.ts · packages/skills-evals/src/compare.ts · packages/skills-evals/src/app/jobs.ts

Studio `resultUrl` only points at the first batched run

Before: single-run scenarios redirected to the one produced manifest.
After: multi-run scenarios still redirect to results[0], so runs #2..#N are only reachable via the scenario page's Runs list.

Functionally fine because every manifest is persisted under runsRoot and surfaced by loadPlainRuns, but worth confirming the UX is intentional — the user clicks "Run" with --runs 4, watches four cards finish, and then lands on run #1 rather than staying on the scenario page where all four are visible.

packages/skills-evals/src/app/jobs.ts · packages/skills-evals/src/app/scenario.tsx

^{｜ Fix all ➔ ｜ Fix 👍s ➔ ｜ View workflow run ｜ Using Claude Opus ｜ 𝕏}

Add batched skills eval runs

779ed08

vercel Bot deployed to Preview – bugs May 15, 2026 10:26 View deployment

Merge main into batched skills eval runs

e3e3294

vercel Bot temporarily deployed to Preview – remotion May 15, 2026 10:36 Inactive

vercel Bot deployed to Preview – bugs May 15, 2026 10:36 View deployment

Split batched eval run progress

f59bd89

vercel Bot temporarily deployed to Preview – remotion May 15, 2026 11:01 Inactive

vercel Bot deployed to Preview – bugs May 15, 2026 11:01 View deployment

Preserve skills eval log scroll position

4acfc58

vercel Bot temporarily deployed to Preview – remotion May 15, 2026 11:13 Inactive

vercel Bot deployed to Preview – bugs May 15, 2026 11:14 View deployment

samohovets marked this pull request as ready for review May 15, 2026 11:15

pullfrog Bot reviewed May 15, 2026

View reviewed changes

Comment thread packages/skills-evals/src/cli.ts Outdated

Comment thread packages/skills-evals/src/app/jobs.ts Outdated

Comment thread packages/skills-evals/src/run-skill-eval.ts

Address batched eval review comments

1d143cc

vercel Bot temporarily deployed to Preview – remotion May 15, 2026 14:01 Inactive

vercel Bot deployed to Preview – bugs May 15, 2026 14:01 View deployment

samohovets merged commit a07d5a7 into main May 15, 2026
16 checks passed

samohovets deleted the cursor/skills-evals-batched-runs-834b branch May 15, 2026 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`@remotion/skills-evals`: Add batched eval runs#7396

`@remotion/skills-evals`: Add batched eval runs#7396
samohovets merged 5 commits into
mainfrom
cursor/skills-evals-batched-runs-834b

samohovets commented May 15, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 15, 2026 •

edited

Loading

Uh oh!

pullfrog Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

samohovets commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Walkthrough

Uh oh!

vercel Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pullfrog Bot left a comment

Choose a reason for hiding this comment

Key changes

Concurrency helper duplicated across three modules

Studio resultUrl only points at the first batched run

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samohovets commented May 15, 2026 •

edited

Loading

vercel Bot commented May 15, 2026 •

edited

Loading

Studio `resultUrl` only points at the first batched run