Skip to content

@remotion/skills-evals: Add batched eval runs#7396

Merged
samohovets merged 5 commits into
mainfrom
cursor/skills-evals-batched-runs-834b
May 15, 2026
Merged

@remotion/skills-evals: Add batched eval runs#7396
samohovets merged 5 commits into
mainfrom
cursor/skills-evals-batched-runs-834b

Conversation

@samohovets
Copy link
Copy Markdown
Member

@samohovets samohovets commented May 15, 2026

Summary

  • Add --runs support for skills eval run and compare, capped at 4 parallel eval executions.
  • Persist batched comparison pairs in manifests while preserving the existing single-run fields.
  • Add dev UI run-count selection alongside the comparison base-ref control from main.
  • Split active progress into per-run cards, with each run owning its own before/after logs.
  • Preserve log panel scroll position across polling updates by updating run cards in place.
  • Extract bounded concurrency into a shared helper and keep multi-run plain jobs on the scenario page instead of redirecting to run number 1.

Testing

  • bun run lint in packages/skills-evals
  • bun run formatting in packages/skills-evals
  • bun -e "import {runWithConcurrency} from './src/run-with-concurrency'; const started = []; try { await runWithConcurrency({inputs: [1,2,3,4], limit: 1, worker: async (value) => { started.push(value); if (value === 2) throw new Error('boom'); return value; }}); process.exit(1); } catch (error) { console.log(error instanceof Error ? error.message : String(error)); console.log(started.join(',')); }" in packages/skills-evals
  • bun run eval run --all --runs 4 in packages/skills-evals
  • POST /api/run/ui-smoke?runs=4 with a temporary local scenario confirmed four run groups and no initial resultUrl
  • Browser walkthrough confirmed the page stays on the scenario route while showing all four active run cards
  • bun run build at repo root
  • bun run stylecheck at repo root (expected failure: @remotion/lambda-go requires Go >= 1.23.0, VM has Go 1.22.2)

Walkthrough

skills_evals_pullfrog_fixes.mp4

To show artifacts inline, enable in settings.

Open in Web Open in Cursor 

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 15, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
bugs Ready Ready Preview, Comment May 15, 2026 2:01pm
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
remotion Skipped Skipped May 15, 2026 2:01pm

Request Review

@vercel vercel Bot temporarily deployed to Preview – remotion May 15, 2026 11:13 Inactive
@samohovets samohovets marked this pull request as ready for review May 15, 2026 11:15
Copy link
Copy Markdown
Contributor

@pullfrog pullfrog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR — Adds a --runs flag (1-4) so bun run eval run and bun run eval compare can batch parallel executions of a scenario, with the studio UI gaining a run-count selector, per-run progress cards, and scroll-stable log panels.

Key changes

  • Per-run batching across CLI, server, and UI — A new run-count.ts validator (1..maxParallelSkillEvalRuns) is threaded through runSkillEvalComparison, startRun, startComparison, the /api/run//api/compare routes, and the CLI argument parsers.
  • Manifest schema gains runs arraySkillEvalComparison now carries an optional runs: SkillEvalComparisonRunPair[] plus runCount. The legacy top-level before/after fields are still populated from runs[0] so existing comparison.json files keep loading.
  • Scoped concurrency helper — Each consumer gets its own runWithConcurrency that caps in-flight workers at maxParallelSkillEvalRuns and (in compare.ts / jobs.ts) bails out on the first failure.
  • Per-run UI cards with stable scrollscenario.tsx swaps the global before/after pair for a list of RunProgressGroups, and the polling script updates each pre[data-run-log] in place so the log doesn't jump while polling.

Summary | 11 files | 1 commit | base: maincursor/skills-evals-batched-runs-834b


Concurrency helper duplicated across three modules

Before: one shared Promise.allSettled pair coordinated by firstSnapshotError and per-label AbortControllers.
After: three near-identical local runWithConcurrency helpers in cli.ts, compare.ts, and app/jobs.ts, only two of which fail fast on the first error.

The helper handles the same problem (bounded parallelism, single-error propagation) and the three copies have already diverged: the cli.ts variant uses Promise.all without a firstError guard, so a failure in one run lets the remaining queued runs keep starting before the rejection surfaces. The other two abort the queue as soon as firstError is set. Extracting this into a shared module (next to run-count.ts) would keep the semantics consistent and shrink the diff.

packages/skills-evals/src/cli.ts · packages/skills-evals/src/compare.ts · packages/skills-evals/src/app/jobs.ts


Studio resultUrl only points at the first batched run

Before: single-run scenarios redirected to the one produced manifest.
After: multi-run scenarios still redirect to results[0], so runs #2..#N are only reachable via the scenario page's Runs list.

Functionally fine because every manifest is persisted under runsRoot and surfaced by loadPlainRuns, but worth confirming the UX is intentional — the user clicks "Run" with --runs 4, watches four cards finish, and then lands on run #1 rather than staying on the scenario page where all four are visible.

packages/skills-evals/src/app/jobs.ts · packages/skills-evals/src/app/scenario.tsx

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow run | Using Claude Opus𝕏

Comment thread packages/skills-evals/src/cli.ts Outdated
Comment thread packages/skills-evals/src/app/jobs.ts Outdated
Comment thread packages/skills-evals/src/run-skill-eval.ts
@vercel vercel Bot temporarily deployed to Preview – remotion May 15, 2026 14:01 Inactive
@samohovets samohovets merged commit a07d5a7 into main May 15, 2026
16 checks passed
@samohovets samohovets deleted the cursor/skills-evals-batched-runs-834b branch May 15, 2026 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants