@remotion/skills-evals: Add visual skill eval runner#7337
Conversation
@remotion/skills-evals: Add visual skill eval runner
There was a problem hiding this comment.
Important
Two issues will keep the runner from working out of the box on a clean checkout: the sandbox project lands inside the monorepo's packages/** workspace glob (so bun install resolves Remotion to the in-repo workspace instead of the registry), and getCurrentRemotionVersion() returns the not-yet-published version every release-targeted PR bumps to. There's also a credential-leak shape worth fixing before this is wired into anything shared (full process.env is forwarded to the pi agent, and the recursive walker follows symlinks back out of the sandbox).
TL;DR — Adds a private @remotion/skills-evals package that drives the external pi CLI through a fresh blank-template sandbox, collects visual artifacts and Pi session exports, and renders a static dark-mode HTML gallery. Aimed at #7256.
Key changes
- Add
@remotion/skills-evalspackage — private Bun-CLI package witheval list|run|gallerycommands, scenario registry, and an animated bar chart seed scenario. - Sandbox runner — copies
packages/template-blank+packages/skills/skills/remotionper run, patches Remotion deps to the current monorepo version, runsbun install, then shells out topi. - Manifest + gallery — writes a
manifest.jsonper run capturing prompt, prompt hash, skill snapshot hash, command logs, and discovered artifacts;gallery.tsaggregates manifests across.runs/into a single static HTML page. - Workspace wiring —
.gitignoreentry forpackages/skills-evals/.runs,bun.lockworkspace stub, and a new project reference in the roottsconfig.json.
Summary | 15 files | 1 commit | base: main ← skills-evals-package
Sandbox install resolves against the monorepo, not the registry
Before: N/A (new package).
After:runSkillEvalwrites the per-run project topackages/skills-evals/.runs/<id>/<ts>/project, then runsbun installthere.
The root package.json workspaces.packages glob is packages/** with no exclusion for .runs/**, so Bun walks up from the project cwd, finds the monorepo root, and treats the sandbox package.json as a workspace member. That collapses the "install the published Remotion versions" goal into "symlink the in-repo workspace" and makes results non-portable. Either nest .runs/ outside packages/** or add an !packages/skills-evals/.runs/** entry to the workspaces list.
package.json · run-skill-eval.ts
bun install will fail every release-targeted PR
Before: N/A.
After:getCurrentRemotionVersionreadspackages/core/src/version.tsand writes that exact version into the sandboxpackage.json.
AGENTS.md instructs every PR to bump version.ts to a version that has not been published yet; bun install will then fail with No matching version found for remotion@<unpublished>. There's no workspace:* fallback, no local-tarball install, and no registry-availability check. Either resolve to the latest published version (a npm view remotion version lookup or a manual override flag) or pin workspace:* against the cooperating workspace stub.
run-skill-eval.ts · packages/core/src/version.ts
Symlink-following + full env forwarding combine into a leak primitive
Before: N/A.
After: Recursive walkers useentry.isDirectory(), which returnstruefor symlinks-to-directories;runCommandforwards the entireprocess.envto bothbun installandpi.
The walker in run-skill-eval.ts (and the duplicates in pi.ts/gallery.ts) follows directory symlinks. The agent under test runs inside the sandbox and can plant a symlink such as projectRoot/leak -> / or -> $HOME — discoverVisualArtifacts will then emit absolute paths into the manifest and the gallery will render <img src="file:///etc/..."> style links. hashDirectory is also exposed: a symlink to /dev/zero or a large host file inflates memory unboundedly. Independently, command.ts spreads process.env into the spawned subprocess, so any agent prompt that says "echo your env" lands the full set (OPENAI_API_KEY, GITHUB_TOKEN, AWS_*) into Pi's stdout, which then gets logged and linked from the gallery as "Pi stdout". Use lstat (or readdir({recursive: true}) plus an explicit realpath containment check) and pass an explicit env allowlist (PATH, HOME, TMPDIR, plus whatever Pi requires) instead of spreading process.env.
run-skill-eval.ts · command.ts · pi.ts
Convention drift on root tsconfig.json and README
Before: Other private Bun-CLI packages (
packages/codex-plugin) are NOT listed in the roottsconfig.jsonreferences.
After:skills-evalsis added to root references withnoEmit: true.
The leaf tsconfig matches codex-plugin shape (module: es2020, moduleResolution: bundler, noEmit: true), but codex-plugin is intentionally absent from root references. Adding skills-evals there gives tsc -b a no-emit project to traverse with no .tsbuildinfo invalidation contract — drop the entry to match the established pattern. Separately, the README does not mention that pi is a hard external prerequisite (run-skill-eval.ts shells out to pi directly); a contributor running bun run eval run animated-bar-chart cold will see an opaque ENOENT.
tsconfig.json · README.md · packages/codex-plugin/tsconfig.json
Claude Opus | 𝕏
| const install = await runCommand({ | ||
| command: ['bun', 'install'], | ||
| cwd: projectRoot, | ||
| }); |
There was a problem hiding this comment.
The sandbox project at .runs/<id>/<ts>/project falls inside the root packages/** workspaces glob, so this bun install walks up to the monorepo root and treats the sandbox package.json as a workspace member — Remotion deps get symlinked to the in-repo workspace instead of being installed from the registry, defeating the per-run-fresh-install intent. Either move .runs/ outside packages/**, or add !packages/skills-evals/.runs/** to the root workspaces.packages list.
| throw new Error('Could not read the current Remotion version.'); | ||
| } | ||
|
|
||
| return match[1]; |
There was a problem hiding this comment.
packages/core/src/version.ts is bumped to a not-yet-published version on every release-targeted PR (per AGENTS.md), so bun install will fail with No matching version found for remotion@<unpublished> until that release lands on npm. Resolve to the latest published version (e.g. npm view remotion version) with an override flag for testing the in-progress version explicitly.
| const startedAt = Date.now(); | ||
| const subprocess = Bun.spawn(command, { | ||
| cwd, | ||
| env: {...process.env, ...env}, |
There was a problem hiding this comment.
Spreading process.env forwards every credential the runner has (OPENAI_API_KEY, GITHUB_TOKEN, AWS_*) into both bun install and the pi agent subprocess. The agent under test can read its own env and exfiltrate it through stdout (which is logged and linked from the gallery). Use an explicit allowlist (PATH, HOME, TMPDIR, plus whatever Pi truly needs).
| }); | ||
| const timeout = timeoutMs | ||
| ? setTimeout(() => { | ||
| subprocess.kill(); |
There was a problem hiding this comment.
On timeout, subprocess.kill() sends SIGTERM to pi only — any children it spawned (e.g. remotion render/Chromium) become orphans, and the returned CommandResult carries a non-zero exitCode indistinguishable from a real failure. Consider escalating to SIGKILL after a grace window and surfacing a timedOut: true flag so callers can differentiate.
| const listFilesRecursively = async (dir: string): Promise<string[]> => { | ||
| const entries = await readdir(dir, {withFileTypes: true}); | ||
| const files = await Promise.all( | ||
| entries.map((entry) => { | ||
| const absolutePath = join(dir, entry.name); | ||
|
|
||
| if (entry.isDirectory()) { | ||
| if (entry.name === 'node_modules') { | ||
| return []; | ||
| } | ||
|
|
||
| return listFilesRecursively(absolutePath); | ||
| } | ||
|
|
||
| return [absolutePath]; | ||
| }), | ||
| ); | ||
|
|
||
| return files.flat().sort(); | ||
| }; |
There was a problem hiding this comment.
entry.isDirectory() returns true for symlinks pointing at directories, so the walker happily descends out of the sandbox if the agent under test (or any artifact written by pi) plants a symlink like leak -> / or -> $HOME. discoverVisualArtifacts then emits absolute host paths into the manifest, and hashDirectory reads arbitrary host files into memory. Switch to lstat (or guard with realpath containment) before recursing.
| } | ||
| }; | ||
|
|
||
| main(); |
There was a problem hiding this comment.
main() is invoked without a .catch(...) so rejections from runSkillEval/generateGallery surface as unhandled-rejection warnings rather than a clean non-zero exit. Wrap with main().catch((err) => { process.stderr.write(${err?.stack ?? err}
); process.exit(1); });.
| sessionDir, | ||
| '-p', | ||
| prompt, | ||
| ]; |
There was a problem hiding this comment.
model and sessionDir flow into Pi's argv unchecked — a value starting with -- would be reinterpreted by Pi as a flag (argument injection). Today the scenarios are checked-in, but runSkillEval is exported from index.ts so a future caller could supply attacker-controlled values; either reject leading--- values or pass -- as an end-of-options separator before user values.
| return `<a href="${href}"><img src="${href}" alt="${escapeHtml( | ||
| artifact.relativePath, | ||
| )}" /></a>`; | ||
| } | ||
|
|
||
| return `<video src="${href}" controls preload="metadata"></video>`; |
There was a problem hiding this comment.
href is built from relative(...) + encodeURI(...), which does not encode " (legal in filenames on Linux/macOS) — a filename containing a literal " will close the src attribute and inject HTML. Same issue on lines 48 (the <a>/<img> pair) and 97–108 (Pi export, manifest, and log links). Wrap each href with escapeHtml() after encodeURI, or reuse the existing escapeHtml helper for attribute contexts.
| "path": "./packages/sfx" | ||
| }, | ||
| { | ||
| "path": "./packages/skills-evals" |
There was a problem hiding this comment.
Other private Bun-CLI packages (packages/codex-plugin) deliberately stay out of the root references list because they're noEmit: true with no project-build contract. Adding skills-evals here gives tsc -b a no-emit project to traverse without a .tsbuildinfo invalidation contract — drop the entry to match the established pattern.
| bun run eval run animated-bar-chart | ||
| bun run eval run --all | ||
| bun run eval gallery | ||
| ``` |
There was a problem hiding this comment.
The runner shells out to a pi binary (src/pi.ts) that is not a workspace dependency. A contributor running bun run eval run animated-bar-chart cold will see an opaque ENOENT — please document pi as a prerequisite and how to install it.
Move the eval UI to a React dashboard and simplify the scenario comparison flow so skill iteration is easier to inspect.
Compare skills against HEAD, expose plain scenario runs when there is no diff, and make Pi render artifacts to a predictable project path.
Keep the private skills eval tool out of the publishable package registry so monorepo consistency checks only validate real Remotion packages.
There was a problem hiding this comment.
Important
A few correctness and lifecycle issues that bite as soon as a second scenario or a timeout shows up. The HTTP server also needs a top-level error handler — a couple of paths currently throw straight into Bun's default 500.
TL;DR — Adds a private @remotion/skills-evals package with a Bun-served dashboard, CLI, and Pi-based comparison runner that copies the blank template, snapshots packages/skills/skills, runs two Pi sessions per scenario, and diffs the resulting artifacts against the HEAD baseline.
Key changes
- New
@remotion/skills-evalsprivate package — sets up the workspace entry (package.json,tsconfig.json,eslint.config.mjs) and wires it intobun.lock,packages/cli/create-videopackage lists,packages/studio-shared/package-info.ts, and the roottsconfig.json. - Pi-driven eval pipeline —
runSkillEvalmaterializes the blank template, copies skills into.pi/skills/remotion, runsbun installplus two Pi turns (scenario then render), exports the session, and writes a manifest with logged commands and discovered artifacts. - Comparison runner —
runSkillEvalComparisonarchivesHEAD'spackages/skills/skillsviagit archive | tar -x, diffs it against the working tree, and runsbefore/afterPi sessions in parallel under.runs/comparisons/<scenario>/<id>. - Local React dashboard —
Bun.serveon127.0.0.1:3030with SSR routes for home, scenario, run, and comparison pages, plus/files/*for serving artifacts and/api/{compare,run,jobs}for kicking off and polling jobs. - Pi stream parsing & seed scenario —
pi-stream-extension.tsparses Pi's streaming output to surface phase progress in the dashboard, andscenarios.tsseeds an animated bar chart scenario;packages/skills/skills/remotion/SKILL.mdgains typography/readability guidance.
Summary | 31 files | 11 commits | base: main ← skills-evals-package
Scenario id sanitization is asymmetric between disk and URLs
Before: N/A (new code).
After: Runs write tojoin(runsRoot, sanitizePathPart(id), …)but URLs are built from the rawmanifest.id.
runSkillEval writes to join(runsRoot, sanitizePathPart(input.id), …) (run-skill-eval.ts:280) and runSkillEvalComparison writes to join(comparisonsRoot, sanitizePathPart(scenario.id), …), but the URLs and lookups use the raw id (jobs.ts:236-238, run.tsx:64, scenario.tsx:80, scenario.tsx:57's loadPlainRuns(scenario.id)). The seed animated-bar-chart is a fixed point of sanitizePathPart, so the dashboard works today; the moment anyone adds a scenario with an uppercase letter, space, or id longer than 80 characters, completed runs return 404. Either pass the sanitized id everywhere a URL or directory name is built, or persist the sanitized id on the manifest and route off that.
run-skill-eval.ts · jobs.ts · scenario.tsx
Subprocess timeout doesn't reap the tree and is invisible to callers
Before: N/A.
After:setTimeout(() => subprocess.kill(), timeoutMs)only SIGTERMs the direct child.
Scenarios pass a 20-minute timeoutMs through to pi, but subprocess.kill() (command.ts:71) sends a single SIGTERM to the immediate child — the actual workloads (pi's provider client, bun install, sh -c 'git archive | tar -x') live in descendant processes, which keep running after the parent exits, hold file descriptors on .runs/, and continue burning API credits on the Pi side. runCommand also gives the caller no signal that a timeout fired: it just returns whatever exit code the killed process produced (commonly 143/null), so runPi throws Pi failed (143) and the dashboard cannot distinguish a hung session from a real crash.
Comparison runs orphan their sibling on first failure
Before: N/A.
After:Promise.allSettled([runSnapshot('before'), runSnapshot('after')])waits for both.
When one Pi snapshot rejects early (e.g. bun install fails in seconds), Promise.allSettled keeps awaiting the surviving 20-minute Pi run before re-throwing — the function will reliably report the failure, but only after the other side has finished spending the full timeout's worth of compute. Either race the two with cancellation (AbortSignal plumbed through runCommand, plus SIGTERM/SIGKILL on the sibling) or sequence them.
/files/* boundary check is fragile
Before: N/A.
After:if (!file.startsWith(\${resolvedRunsRoot}/`)) return notFound();`
Two issues with one line. The hard-coded / makes this guard reject every path on Windows, where path.resolve returns backslash-separated strings — so /files/* is unusable there. Separately, if anything inside .runs/ ever becomes a symlink (workspace node_modules from bun install regularly contains them), resolve won't follow it and the check passes for a path whose realpath is outside the runs root. The same path.relative(root, file) idiom already used in shared.tsx's toFileUrl handles both cases.
Bun.serve has no error handler
Before: N/A.
After: Any route handler throw bubbles to Bun's default 500 with no log line.
decodeURIComponent(pathname.replace(/^\/files\//, '')) (routes.ts:65) throws URIError on any malformed percent-encoded request; renderScenario awaits getSkillDiffState() which throws when git is absent or returns non-zero; loadComparison / loadRun read absolute paths stored in manifest.json / comparison.json, so a relocated .runs/ (different checkout, different machine) throws straight out of the handler. None of these have a try { … } catch and Bun.serve is configured without error() (server.tsx:6-11), so the developer gets opaque 500s with no console output. Adding error(err) { console.error(err); return new Response('Internal Server Error', {status: 500}); } and a try/catch around the git call in renderScenario covers the realistic failure modes.
server.tsx · routes.ts · scenario.tsx
Claude Opus | 𝕏
| const file = resolve(runsRoot, relativePath); | ||
| const resolvedRunsRoot = resolve(runsRoot); | ||
|
|
||
| if (!file.startsWith(`${resolvedRunsRoot}/`)) { |
There was a problem hiding this comment.
Hard-coded / makes this guard reject every legitimate path on Windows where path.resolve returns backslash-separated strings, and resolve won't follow symlinks if anything inside .runs/ ever ends up symlinked (workspace node_modules from bun install regularly contains them). Same path.relative(root, file) idiom already used in shared.tsx#toFileUrl handles both cases.
| if (!file.startsWith(`${resolvedRunsRoot}/`)) { | |
| const relativeToRoot = relative(resolvedRunsRoot, file); | |
| if (relativeToRoot.startsWith('..') || isAbsolute(relativeToRoot)) { | |
| return notFound(); | |
| } |
| ? setTimeout(() => { | ||
| subprocess.kill(); | ||
| }, timeoutMs) | ||
| : null; |
There was a problem hiding this comment.
subprocess.kill() only SIGTERMs the direct child, so a 20-minute Pi timeout leaves the provider client, bun install, and sh -c 'git archive | tar -x' descendants running after this resolves — they keep burning credits and hold file descriptors on .runs/. Also nothing tells the caller a timeout fired: runPi throws Pi failed (143) and the dashboard cannot distinguish a hung run from a real crash. Spawn with a new process group and on timeout signal -pgid, escalate to SIGKILL after a grace period, and surface a timedOut flag on CommandResult.
| const [beforeResult, afterResult] = await Promise.allSettled([ | ||
| runSnapshot('before'), | ||
| runSnapshot('after'), | ||
| ]); |
There was a problem hiding this comment.
Promise.allSettled waits for both snapshots even after one rejects, so a fast failure on one side leaves the surviving 20-minute Pi run to finish before this function re-throws — wasted compute and model budget every time. Plumb an AbortSignal through runCommand and cancel the sibling on first rejection (or run them sequentially).
| job.message = 'Run complete.'; | ||
| job.resultUrl = `/runs/${encodeURIComponent( | ||
| result.manifest.id, | ||
| )}/${encodeURIComponent(basename(result.manifest.runDir))}`; |
There was a problem hiding this comment.
result.manifest.id is the raw scenario id, but result.manifest.runDir was built with sanitizePathPart(input.id) — for any scenario id that isn't a fixed point of the sanitizer (uppercase, spaces, > 80 chars), /runs/<rawId>/<runDir> will 404 because the route handler joins the raw id back into runsRoot. The seed animated-bar-chart happens to work, so this is latent until a new scenario is added. Persist the sanitized id on the manifest (or apply sanitizePathPart here) and use it consistently when building hrefs and reading by id (e.g. run.tsx:9, scenario.tsx:57).
|
|
||
| '/files/*': (request: Request) => { | ||
| const {pathname} = new URL(request.url); | ||
| const relativePath = decodeURIComponent(pathname.replace(/^\/files\//, '')); |
There was a problem hiding this comment.
decodeURIComponent throws URIError on malformed percent-encoded paths (e.g. /files/%E0%A4%A), which propagates to Bun's default error handling because Bun.serve is configured without an error() callback. Wrap in try { … } catch { return notFound(); } here, and add an error handler to the Bun.serve config in server.tsx so unexpected throws (stale absolute paths in manifests, missing git, etc.) at least log the request that produced them.
Co-authored-by: Cursor <cursoragent@cursor.com>

Summary
@remotion/skills-evalspackage for Pi-based visual eval scenarios against repo-local Remotion skills.Test plan
cd packages/skills-evals && bun run formatcd packages/skills-evals && bun run lint