Skip to content

@remotion/skills-evals: Add visual skill eval runner#7337

Merged
samohovets merged 16 commits into
mainfrom
skills-evals-package
May 14, 2026
Merged

@remotion/skills-evals: Add visual skill eval runner#7337
samohovets merged 16 commits into
mainfrom
skills-evals-package

Conversation

@samohovets
Copy link
Copy Markdown
Member

@samohovets samohovets commented May 12, 2026

Summary

  • Add a private @remotion/skills-evals package for Pi-based visual eval scenarios against repo-local Remotion skills.
  • Provide a local React dashboard for scenario-scoped comparisons, before/after video review, Pi exports, manifests, logs, and rendered skill diffs.
  • Copy the blank template and skill snapshots into isolated run folders, then prompt Pi in two turns: scenario creation first, render-artifact follow-up in the same session.
  • Support iteration by comparing a scenario against its latest successful comparison before falling back to the appropriate Git baseline.
  • Seed an animated bar chart scenario and add typography/readability guidance for Remotion skills.
  • Should help us to address Skills: How do we prevent the AI from making the font sizes / proportions too small? #7256.

Test plan

  • cd packages/skills-evals && bun run format
  • cd packages/skills-evals && bun run lint
  • Browser smoke check for the dashboard home, scenario page, and comparison page

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
bugs Ready Ready Preview, Comment May 14, 2026 0:15am
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
remotion Skipped Skipped May 14, 2026 0:15am

Request Review

@samohovets samohovets changed the title : Add visual skill eval runner Skills: Add visual skill eval runner May 12, 2026
@samohovets samohovets changed the title Skills: Add visual skill eval runner @remotion/skills-evals: Add visual skill eval runner May 12, 2026
@samohovets samohovets marked this pull request as draft May 12, 2026 17:01
Copy link
Copy Markdown
Contributor

@pullfrog pullfrog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Two issues will keep the runner from working out of the box on a clean checkout: the sandbox project lands inside the monorepo's packages/** workspace glob (so bun install resolves Remotion to the in-repo workspace instead of the registry), and getCurrentRemotionVersion() returns the not-yet-published version every release-targeted PR bumps to. There's also a credential-leak shape worth fixing before this is wired into anything shared (full process.env is forwarded to the pi agent, and the recursive walker follows symlinks back out of the sandbox).

TL;DR — Adds a private @remotion/skills-evals package that drives the external pi CLI through a fresh blank-template sandbox, collects visual artifacts and Pi session exports, and renders a static dark-mode HTML gallery. Aimed at #7256.

Key changes

  • Add @remotion/skills-evals package — private Bun-CLI package with eval list|run|gallery commands, scenario registry, and an animated bar chart seed scenario.
  • Sandbox runner — copies packages/template-blank + packages/skills/skills/remotion per run, patches Remotion deps to the current monorepo version, runs bun install, then shells out to pi.
  • Manifest + gallery — writes a manifest.json per run capturing prompt, prompt hash, skill snapshot hash, command logs, and discovered artifacts; gallery.ts aggregates manifests across .runs/ into a single static HTML page.
  • Workspace wiring.gitignore entry for packages/skills-evals/.runs, bun.lock workspace stub, and a new project reference in the root tsconfig.json.

Summary | 15 files | 1 commit | base: mainskills-evals-package


Sandbox install resolves against the monorepo, not the registry

Before: N/A (new package).
After: runSkillEval writes the per-run project to packages/skills-evals/.runs/<id>/<ts>/project, then runs bun install there.

The root package.json workspaces.packages glob is packages/** with no exclusion for .runs/**, so Bun walks up from the project cwd, finds the monorepo root, and treats the sandbox package.json as a workspace member. That collapses the "install the published Remotion versions" goal into "symlink the in-repo workspace" and makes results non-portable. Either nest .runs/ outside packages/** or add an !packages/skills-evals/.runs/** entry to the workspaces list.

package.json · run-skill-eval.ts


bun install will fail every release-targeted PR

Before: N/A.
After: getCurrentRemotionVersion reads packages/core/src/version.ts and writes that exact version into the sandbox package.json.

AGENTS.md instructs every PR to bump version.ts to a version that has not been published yet; bun install will then fail with No matching version found for remotion@<unpublished>. There's no workspace:* fallback, no local-tarball install, and no registry-availability check. Either resolve to the latest published version (a npm view remotion version lookup or a manual override flag) or pin workspace:* against the cooperating workspace stub.

run-skill-eval.ts · packages/core/src/version.ts


Symlink-following + full env forwarding combine into a leak primitive

Before: N/A.
After: Recursive walkers use entry.isDirectory(), which returns true for symlinks-to-directories; runCommand forwards the entire process.env to both bun install and pi.

The walker in run-skill-eval.ts (and the duplicates in pi.ts/gallery.ts) follows directory symlinks. The agent under test runs inside the sandbox and can plant a symlink such as projectRoot/leak -> / or -> $HOMEdiscoverVisualArtifacts will then emit absolute paths into the manifest and the gallery will render <img src="file:///etc/..."> style links. hashDirectory is also exposed: a symlink to /dev/zero or a large host file inflates memory unboundedly. Independently, command.ts spreads process.env into the spawned subprocess, so any agent prompt that says "echo your env" lands the full set (OPENAI_API_KEY, GITHUB_TOKEN, AWS_*) into Pi's stdout, which then gets logged and linked from the gallery as "Pi stdout". Use lstat (or readdir({recursive: true}) plus an explicit realpath containment check) and pass an explicit env allowlist (PATH, HOME, TMPDIR, plus whatever Pi requires) instead of spreading process.env.

run-skill-eval.ts · command.ts · pi.ts


Convention drift on root tsconfig.json and README

Before: Other private Bun-CLI packages (packages/codex-plugin) are NOT listed in the root tsconfig.json references.
After: skills-evals is added to root references with noEmit: true.

The leaf tsconfig matches codex-plugin shape (module: es2020, moduleResolution: bundler, noEmit: true), but codex-plugin is intentionally absent from root references. Adding skills-evals there gives tsc -b a no-emit project to traverse with no .tsbuildinfo invalidation contract — drop the entry to match the established pattern. Separately, the README does not mention that pi is a hard external prerequisite (run-skill-eval.ts shells out to pi directly); a contributor running bun run eval run animated-bar-chart cold will see an opaque ENOENT.

tsconfig.json · README.md · packages/codex-plugin/tsconfig.json

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow run | Using Claude Opus𝕏

const install = await runCommand({
command: ['bun', 'install'],
cwd: projectRoot,
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sandbox project at .runs/<id>/<ts>/project falls inside the root packages/** workspaces glob, so this bun install walks up to the monorepo root and treats the sandbox package.json as a workspace member — Remotion deps get symlinked to the in-repo workspace instead of being installed from the registry, defeating the per-run-fresh-install intent. Either move .runs/ outside packages/**, or add !packages/skills-evals/.runs/** to the root workspaces.packages list.

throw new Error('Could not read the current Remotion version.');
}

return match[1];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packages/core/src/version.ts is bumped to a not-yet-published version on every release-targeted PR (per AGENTS.md), so bun install will fail with No matching version found for remotion@<unpublished> until that release lands on npm. Resolve to the latest published version (e.g. npm view remotion version) with an override flag for testing the in-progress version explicitly.

Comment thread packages/skills-evals/src/command.ts Outdated
const startedAt = Date.now();
const subprocess = Bun.spawn(command, {
cwd,
env: {...process.env, ...env},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spreading process.env forwards every credential the runner has (OPENAI_API_KEY, GITHUB_TOKEN, AWS_*) into both bun install and the pi agent subprocess. The agent under test can read its own env and exfiltrate it through stdout (which is logged and linked from the gallery). Use an explicit allowlist (PATH, HOME, TMPDIR, plus whatever Pi truly needs).

Comment thread packages/skills-evals/src/command.ts Outdated
});
const timeout = timeoutMs
? setTimeout(() => {
subprocess.kill();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On timeout, subprocess.kill() sends SIGTERM to pi only — any children it spawned (e.g. remotion render/Chromium) become orphans, and the returned CommandResult carries a non-zero exitCode indistinguishable from a real failure. Consider escalating to SIGKILL after a grace window and surfacing a timedOut: true flag so callers can differentiate.

Comment on lines +94 to +113
const listFilesRecursively = async (dir: string): Promise<string[]> => {
const entries = await readdir(dir, {withFileTypes: true});
const files = await Promise.all(
entries.map((entry) => {
const absolutePath = join(dir, entry.name);

if (entry.isDirectory()) {
if (entry.name === 'node_modules') {
return [];
}

return listFilesRecursively(absolutePath);
}

return [absolutePath];
}),
);

return files.flat().sort();
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entry.isDirectory() returns true for symlinks pointing at directories, so the walker happily descends out of the sandbox if the agent under test (or any artifact written by pi) plants a symlink like leak -> / or -> $HOME. discoverVisualArtifacts then emits absolute host paths into the manifest, and hashDirectory reads arbitrary host files into memory. Switch to lstat (or guard with realpath containment) before recursing.

}
};

main();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main() is invoked without a .catch(...) so rejections from runSkillEval/generateGallery surface as unhandled-rejection warnings rather than a clean non-zero exit. Wrap with main().catch((err) => { process.stderr.write(${err?.stack ?? err}
); process.exit(1); });.

sessionDir,
'-p',
prompt,
];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model and sessionDir flow into Pi's argv unchecked — a value starting with -- would be reinterpreted by Pi as a flag (argument injection). Today the scenarios are checked-in, but runSkillEval is exported from index.ts so a future caller could supply attacker-controlled values; either reject leading--- values or pass -- as an end-of-options separator before user values.

Comment thread packages/skills-evals/src/gallery.ts Outdated
Comment on lines +48 to +53
return `<a href="${href}"><img src="${href}" alt="${escapeHtml(
artifact.relativePath,
)}" /></a>`;
}

return `<video src="${href}" controls preload="metadata"></video>`;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

href is built from relative(...) + encodeURI(...), which does not encode " (legal in filenames on Linux/macOS) — a filename containing a literal " will close the src attribute and inject HTML. Same issue on lines 48 (the <a>/<img> pair) and 97–108 (Pi export, manifest, and log links). Wrap each href with escapeHtml() after encodeURI, or reuse the existing escapeHtml helper for attribute contexts.

Comment thread tsconfig.json Outdated
"path": "./packages/sfx"
},
{
"path": "./packages/skills-evals"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other private Bun-CLI packages (packages/codex-plugin) deliberately stay out of the root references list because they're noEmit: true with no project-build contract. Adding skills-evals here gives tsc -b a no-emit project to traverse without a .tsbuildinfo invalidation contract — drop the entry to match the established pattern.

Comment thread packages/skills-evals/README.md Outdated
bun run eval run animated-bar-chart
bun run eval run --all
bun run eval gallery
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runner shells out to a pi binary (src/pi.ts) that is not a workspace dependency. A contributor running bun run eval run animated-bar-chart cold will see an opaque ENOENT — please document pi as a prerequisite and how to install it.

Move the eval UI to a React dashboard and simplify the scenario comparison flow so skill iteration is easier to inspect.
Compare skills against HEAD, expose plain scenario runs when there is no diff, and make Pi render artifacts to a predictable project path.
Keep the private skills eval tool out of the publishable package registry so monorepo consistency checks only validate real Remotion packages.
@vercel vercel Bot temporarily deployed to Preview – remotion May 13, 2026 16:59 Inactive
@samohovets samohovets marked this pull request as ready for review May 13, 2026 17:25
Copy link
Copy Markdown
Contributor

@pullfrog pullfrog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

A few correctness and lifecycle issues that bite as soon as a second scenario or a timeout shows up. The HTTP server also needs a top-level error handler — a couple of paths currently throw straight into Bun's default 500.

TL;DR — Adds a private @remotion/skills-evals package with a Bun-served dashboard, CLI, and Pi-based comparison runner that copies the blank template, snapshots packages/skills/skills, runs two Pi sessions per scenario, and diffs the resulting artifacts against the HEAD baseline.

Key changes

  • New @remotion/skills-evals private package — sets up the workspace entry (package.json, tsconfig.json, eslint.config.mjs) and wires it into bun.lock, packages/cli/create-video package lists, packages/studio-shared/package-info.ts, and the root tsconfig.json.
  • Pi-driven eval pipelinerunSkillEval materializes the blank template, copies skills into .pi/skills/remotion, runs bun install plus two Pi turns (scenario then render), exports the session, and writes a manifest with logged commands and discovered artifacts.
  • Comparison runnerrunSkillEvalComparison archives HEAD's packages/skills/skills via git archive | tar -x, diffs it against the working tree, and runs before/after Pi sessions in parallel under .runs/comparisons/<scenario>/<id>.
  • Local React dashboardBun.serve on 127.0.0.1:3030 with SSR routes for home, scenario, run, and comparison pages, plus /files/* for serving artifacts and /api/{compare,run,jobs} for kicking off and polling jobs.
  • Pi stream parsing & seed scenariopi-stream-extension.ts parses Pi's streaming output to surface phase progress in the dashboard, and scenarios.ts seeds an animated bar chart scenario; packages/skills/skills/remotion/SKILL.md gains typography/readability guidance.

Summary | 31 files | 11 commits | base: mainskills-evals-package


Scenario id sanitization is asymmetric between disk and URLs

Before: N/A (new code).
After: Runs write to join(runsRoot, sanitizePathPart(id), …) but URLs are built from the raw manifest.id.

runSkillEval writes to join(runsRoot, sanitizePathPart(input.id), …) (run-skill-eval.ts:280) and runSkillEvalComparison writes to join(comparisonsRoot, sanitizePathPart(scenario.id), …), but the URLs and lookups use the raw id (jobs.ts:236-238, run.tsx:64, scenario.tsx:80, scenario.tsx:57's loadPlainRuns(scenario.id)). The seed animated-bar-chart is a fixed point of sanitizePathPart, so the dashboard works today; the moment anyone adds a scenario with an uppercase letter, space, or id longer than 80 characters, completed runs return 404. Either pass the sanitized id everywhere a URL or directory name is built, or persist the sanitized id on the manifest and route off that.

run-skill-eval.ts · jobs.ts · scenario.tsx


Subprocess timeout doesn't reap the tree and is invisible to callers

Before: N/A.
After: setTimeout(() => subprocess.kill(), timeoutMs) only SIGTERMs the direct child.

Scenarios pass a 20-minute timeoutMs through to pi, but subprocess.kill() (command.ts:71) sends a single SIGTERM to the immediate child — the actual workloads (pi's provider client, bun install, sh -c 'git archive | tar -x') live in descendant processes, which keep running after the parent exits, hold file descriptors on .runs/, and continue burning API credits on the Pi side. runCommand also gives the caller no signal that a timeout fired: it just returns whatever exit code the killed process produced (commonly 143/null), so runPi throws Pi failed (143) and the dashboard cannot distinguish a hung session from a real crash.

command.ts · pi.ts


Comparison runs orphan their sibling on first failure

Before: N/A.
After: Promise.allSettled([runSnapshot('before'), runSnapshot('after')]) waits for both.

When one Pi snapshot rejects early (e.g. bun install fails in seconds), Promise.allSettled keeps awaiting the surviving 20-minute Pi run before re-throwing — the function will reliably report the failure, but only after the other side has finished spending the full timeout's worth of compute. Either race the two with cancellation (AbortSignal plumbed through runCommand, plus SIGTERM/SIGKILL on the sibling) or sequence them.

compare.ts


/files/* boundary check is fragile

Before: N/A.
After: if (!file.startsWith(\${resolvedRunsRoot}/`)) return notFound();`

Two issues with one line. The hard-coded / makes this guard reject every path on Windows, where path.resolve returns backslash-separated strings — so /files/* is unusable there. Separately, if anything inside .runs/ ever becomes a symlink (workspace node_modules from bun install regularly contains them), resolve won't follow it and the check passes for a path whose realpath is outside the runs root. The same path.relative(root, file) idiom already used in shared.tsx's toFileUrl handles both cases.

files.ts · shared.tsx


Bun.serve has no error handler

Before: N/A.
After: Any route handler throw bubbles to Bun's default 500 with no log line.

decodeURIComponent(pathname.replace(/^\/files\//, '')) (routes.ts:65) throws URIError on any malformed percent-encoded request; renderScenario awaits getSkillDiffState() which throws when git is absent or returns non-zero; loadComparison / loadRun read absolute paths stored in manifest.json / comparison.json, so a relocated .runs/ (different checkout, different machine) throws straight out of the handler. None of these have a try { … } catch and Bun.serve is configured without error() (server.tsx:6-11), so the developer gets opaque 500s with no console output. Adding error(err) { console.error(err); return new Response('Internal Server Error', {status: 500}); } and a try/catch around the git call in renderScenario covers the realistic failure modes.

server.tsx · routes.ts · scenario.tsx

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow run | Using Claude Opus𝕏

Comment thread packages/skills-evals/src/app/files.ts Outdated
const file = resolve(runsRoot, relativePath);
const resolvedRunsRoot = resolve(runsRoot);

if (!file.startsWith(`${resolvedRunsRoot}/`)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coded / makes this guard reject every legitimate path on Windows where path.resolve returns backslash-separated strings, and resolve won't follow symlinks if anything inside .runs/ ever ends up symlinked (workspace node_modules from bun install regularly contains them). Same path.relative(root, file) idiom already used in shared.tsx#toFileUrl handles both cases.

Suggested change
if (!file.startsWith(`${resolvedRunsRoot}/`)) {
const relativeToRoot = relative(resolvedRunsRoot, file);
if (relativeToRoot.startsWith('..') || isAbsolute(relativeToRoot)) {
return notFound();
}

? setTimeout(() => {
subprocess.kill();
}, timeoutMs)
: null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subprocess.kill() only SIGTERMs the direct child, so a 20-minute Pi timeout leaves the provider client, bun install, and sh -c 'git archive | tar -x' descendants running after this resolves — they keep burning credits and hold file descriptors on .runs/. Also nothing tells the caller a timeout fired: runPi throws Pi failed (143) and the dashboard cannot distinguish a hung run from a real crash. Spawn with a new process group and on timeout signal -pgid, escalate to SIGKILL after a grace period, and surface a timedOut flag on CommandResult.

const [beforeResult, afterResult] = await Promise.allSettled([
runSnapshot('before'),
runSnapshot('after'),
]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promise.allSettled waits for both snapshots even after one rejects, so a fast failure on one side leaves the surviving 20-minute Pi run to finish before this function re-throws — wasted compute and model budget every time. Plumb an AbortSignal through runCommand and cancel the sibling on first rejection (or run them sequentially).

job.message = 'Run complete.';
job.resultUrl = `/runs/${encodeURIComponent(
result.manifest.id,
)}/${encodeURIComponent(basename(result.manifest.runDir))}`;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result.manifest.id is the raw scenario id, but result.manifest.runDir was built with sanitizePathPart(input.id) — for any scenario id that isn't a fixed point of the sanitizer (uppercase, spaces, > 80 chars), /runs/<rawId>/<runDir> will 404 because the route handler joins the raw id back into runsRoot. The seed animated-bar-chart happens to work, so this is latent until a new scenario is added. Persist the sanitized id on the manifest (or apply sanitizePathPart here) and use it consistently when building hrefs and reading by id (e.g. run.tsx:9, scenario.tsx:57).

Comment thread packages/skills-evals/src/app/routes.ts Outdated

'/files/*': (request: Request) => {
const {pathname} = new URL(request.url);
const relativePath = decodeURIComponent(pathname.replace(/^\/files\//, ''));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decodeURIComponent throws URIError on malformed percent-encoded paths (e.g. /files/%E0%A4%A), which propagates to Bun's default error handling because Bun.serve is configured without an error() callback. Wrap in try { … } catch { return notFound(); } here, and add an error handler to the Bun.serve config in server.tsx so unexpected throws (stale absolute paths in manifests, missing git, etc.) at least log the request that produced them.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vercel vercel Bot temporarily deployed to Preview – remotion May 14, 2026 12:15 Inactive
@samohovets samohovets merged commit 56eae11 into main May 14, 2026
18 checks passed
@samohovets samohovets deleted the skills-evals-package branch May 14, 2026 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants