feat(sandbox): port buildOrgSnapshotWorkflow + Vercel Workflow runtime#531
Conversation
Closes the biggest remaining cutover gap: when a recoupable org repo is requested via POST /api/sandbox and no snapshot exists yet, fire a durable background workflow that builds one. The current request still pays the slow full-clone path; what's protected is that *future* sessions for the same org warm-boot from the new snapshot in seconds. This is the first Vercel Workflow integration in api. Per the direction of using Vercel Workflow as the workflow runner going forward (deprecating Trigger.dev over time), this PR also lands the infra: the `workflow` package, `withWorkflow(nextConfig)` wrapper, and the `app/workflows/` convention. Files: - `next.config.ts` — wraps the config with `withWorkflow` from `workflow/next` so workflow files (with `"use workflow"` / `"use step"` directives) are deployable - `app/workflows/buildOrgSnapshot.ts` — preserves the open-agents workflow code structure exactly. One step that calls api's existing `refreshBaseSnapshot` (inlined via PR #507) to provision a sandbox named `<orgRepoName>`, run `git clone --depth=1 <cloneUrl>`, and snapshot the result - `lib/sandbox/kickBuildOrgSnapshotWorkflow.ts` — fire-and-forget invocation via `start(buildOrgSnapshotWorkflow, [input])` from `workflow/api`. Failures are logged but never surface in the request — the request always falls back to the slow full-clone path - `lib/sandbox/defaultBaseSnapshotId.ts` — `DEFAULT_SANDBOX_BASE_SNAPSHOT_ID` constant, env-overridable. Used by the build workflow as the base image (bun + jq + agent-browser + chromium + code-server) so the workflow only does the clone, not the runtime install - `lib/sandbox/createSandboxHandler.ts` — wired in: when `extractOrgRepoName` returns non-null and `findOrgSnapshot` misses, kick the workflow TDD red -> green: - 3 new createSandboxHandler tests: - kicks the workflow on recoupable miss with the right input - does NOT kick when an existing snapshot is found (hit case) - does NOT kick for non-recoupable repos - All existing tests pass (the mock for kickBuildOrgSnapshotWorkflow is no-op so unchanged tests don't see the kick) - Suite: 2574 -> 2577 (+3). pnpm lint:check + tsc --noEmit clean. Phase 2 (sandboxLifecycleWorkflow + evaluateSandboxLifecycle) follows in a separate PR — that one's much bigger and ports the stateful auto-pause / status-check-overdue logic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughThis PR introduces a Vercel durable workflow system for building and caching organization repository snapshots. When a sandbox is created for an organization repo without a cached snapshot, the handler asynchronously initiates a background workflow to clone the repo into a base sandbox and save the result, enabling faster provisioning on subsequent accesses. ChangesOrganization Snapshot Build Workflow
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsTimed out fetching pipeline failures after 30000ms Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
4 issues found across 8 files
Confidence score: 2/5
- There is a high-confidence security risk in
app/workflows/buildOrgSnapshot.ts: interpolatingcloneUrldirectly into a shell command can enable command injection, which is a merge-blocking concern unless properly quoted/escaped. lib/sandbox/kickBuildOrgSnapshotWorkflow.tslogs the fullcloneUrl, which can expose embedded credentials/tokens and creates meaningful security and compliance risk.lib/sandbox/createSandboxHandler.tsmay trigger duplicate snapshot builds for the same org when a prior build is still running, introducing concrete regression risk (extra load, conflicting workflow state).- Pay close attention to
app/workflows/buildOrgSnapshot.ts,lib/sandbox/kickBuildOrgSnapshotWorkflow.ts,lib/sandbox/createSandboxHandler.ts- command execution safety, secret leakage in logs, and duplicate workflow triggering.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="app/workflows/buildOrgSnapshot.ts">
<violation number="1" location="app/workflows/buildOrgSnapshot.ts:31">
P0: Avoid interpolating `cloneUrl` directly into a shell command; quote/escape it before execution to prevent command injection.</violation>
<violation number="2" location="app/workflows/buildOrgSnapshot.ts:38">
P2: Custom agent: **Module should export a single primary function whose name matches the filename**
Primary export name must match the filename basename; `buildOrgSnapshotWorkflow` does not match `buildOrgSnapshot`.</violation>
</file>
<file name="lib/sandbox/kickBuildOrgSnapshotWorkflow.ts">
<violation number="1" location="lib/sandbox/kickBuildOrgSnapshotWorkflow.ts:27">
P1: Avoid logging the full `cloneUrl`; it can leak embedded repository credentials or tokens.</violation>
</file>
<file name="lib/sandbox/createSandboxHandler.ts">
<violation number="1" location="lib/sandbox/createSandboxHandler.ts:75">
P2: The new snapshot-miss branch can repeatedly start duplicate build workflows for the same org while an earlier build is still in progress.</violation>
</file>
Architecture diagram
sequenceDiagram
participant Client as External Client
participant Handler as createSandboxHandler
participant Extract as extractOrgRepoName
participant Snap as findOrgSnapshot
participant Kick as kickBuildOrgSnapshotWorkflow
participant Workflow as Vercel Workflow Runtime
participant WFCode as buildOrgSnapshotWorkflow
participant Step as buildSnapshotStep
participant Refresh as refreshBaseSnapshot
participant Tokens as getServiceGithubToken
participant SandboxAPI as Sandbox Service (Vercel)
participant FutureClient as Future Client (same org)
Note over Client,FutureClient: POST /api/sandbox - Recoupable Org Snapshot Miss
Client->>Handler: POST /api/sandbox
Handler->>Handler: Auth + ownership (unchanged)
Handler->>Extract: extractOrgRepoName(repoUrl)
Extract-->>Handler: orgRepoName or null
alt Org repo detected
Handler->>Snap: findOrgSnapshot(orgRepoName)
alt No snapshot exists (miss)
Snap-->>Handler: null
Note over Handler,FutureClient: NEW: Fire-and-forget background workflow
Handler->>Kick: kickBuildOrgSnapshotWorkflow(cloneUrl, sandboxName)
Kick->>Kick: fire-and-forget (void promise)
alt Workflow starts successfully
Kick->>Workflow: start(buildOrgSnapshotWorkflow, [input])
Workflow-->>Kick: run { runId }
Kick->>Kick: log success
else Workflow fails to start
Workflow-->>Kick: error
Kick->>Kick: log error (never surface to client)
end
par Background workflow execution (outside request lifecycle)
Workflow->>WFCode: execute buildOrgSnapshotWorkflow(input)
WFCode->>WFCode: use workflow
WFCode->>Step: buildSnapshotStep(input)
Step->>Step: use step
Step->>Tokens: getServiceGithubToken()
alt Token available
Tokens-->>Step: githubToken
Step->>Refresh: refreshBaseSnapshot(...)
Refresh->>SandboxAPI: provision sandbox from base snapshot
Refresh->>SandboxAPI: execute git clone
Refresh->>SandboxAPI: create snapshot
SandboxAPI-->>Refresh: snapshotId
Refresh-->>Step: result { snapshotId }
Step-->>WFCode: snapshotId
WFCode-->>Workflow: return { success: true, snapshotId }
else Token missing
Tokens-->>Step: undefined
Step->>Step: throw Error
Step-->>WFCode: error caught
WFCode-->>Workflow: return { success: false, error }
end
end
else Snapshot exists (hit)
Snap-->>Handler: snapshotId
Note over Handler: NEW: Do NOT kick workflow
end
end
Note over Handler,FutureClient: Request continues with full-clone path (unchanged)
Handler->>Handler: connectSandbox (cold-start full clone)
Handler-->>Client: 200 OK
Note over FutureClient,FutureClient: Future requests for same org
FutureClient->>Handler: POST /api/sandbox (same org)
Handler->>Snap: findOrgSnapshot(orgRepoName)
Snap-->>Handler: snapshotId (created by workflow)
Handler->>Handler: warm-boot from snapshot (seconds)
Note over FutureClient,FutureClient: Dramatically reduced cold-start
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| // the *next* session warm-boots from it. This request still pays the | ||
| // full-clone cold-start path — the workflow runs durably outside the | ||
| // request lifecycle. | ||
| if (orgRepoName && !orgSnapshotId) { |
There was a problem hiding this comment.
P2: The new snapshot-miss branch can repeatedly start duplicate build workflows for the same org while an earlier build is still in progress.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/sandbox/createSandboxHandler.ts, line 75:
<comment>The new snapshot-miss branch can repeatedly start duplicate build workflows for the same org while an earlier build is still in progress.</comment>
<file context>
@@ -67,6 +68,17 @@ export async function createSandboxHandler(request: NextRequest): Promise<NextRe
+ // the *next* session warm-boots from it. This request still pays the
+ // full-clone cold-start path — the workflow runs durably outside the
+ // request lifecycle.
+ if (orgRepoName && !orgSnapshotId) {
+ kickBuildOrgSnapshotWorkflow({
+ cloneUrl: body.repoUrl,
</file context>
| sandboxName: string; | ||
| } | ||
|
|
||
| async function buildSnapshotStep(input: BuildOrgSnapshotInput): Promise<string> { |
There was a problem hiding this comment.
SRP - new lib file for buildSnapshotStep.ts
| return result.snapshotId; | ||
| } | ||
|
|
||
| export async function buildOrgSnapshotWorkflow(input: BuildOrgSnapshotInput) { |
There was a problem hiding this comment.
SRP - new lib file for buildOrgSnapshotWorkflow.ts
The previous hardcoded id (snap_EjsphVxi07bFKrfojljJdIS41KHT) lived in open-agents' Vercel team and returned 404 from api's project, breaking the build-org-snapshot workflow on its first step. Built a recoup-team equivalent snapshot manually via the @vercel/sandbox SDK by provisioning a clean Amazon Linux 2023 sandbox and running: - sudo dnf install -y jq - curl -fsSL https://bun.sh/install | sudo BUN_INSTALL=/usr/local bash - sudo npm install -g agent-browser - curl -fsSL https://code-server.dev/install.sh | sudo sh Then snapshotted via `vercel sandbox snapshot <id> --stop`. Result: snap_RgVtpDO4y1BJHQiUbptMwS3Rt2EQ. Tooling note (in the file comment): chromium is intentionally NOT in this base. Amazon Linux 2023's default repo doesn't carry it, and agent-browser fetches a managed Playwright browser on first use, so having it in the base would just bloat the image. The header comment now also documents the install commands and refresh procedure so the next person who needs to add tooling has a clear path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end workflow loop verifiedYou spotted that the first run failed in the dashboard. Caught the root cause via runtime logs (`vercel logs --query "build-org-snapshot" --expand`): ``` That snapshot id lives in open-agents' Vercel team, not recoup's, so the create-from-base call returned 404 and the workflow retried 3× then bubbled. Fix: built a recoup-team base snapshotDid this manually via the @vercel/sandbox SDK because we couldn't access open-agents' base directly. Probed the runtime first to determine the OS: ``` Then provisioned a clean sandbox and ran: ```bash `vercel sandbox snapshot --stop` → `snap_RgVtpDO4y1BJHQiUbptMwS3Rt2EQ`. Pushed `60265ee5` to point `defaultBaseSnapshotId.ts` at it. Comment in the file documents the install commands + refresh procedure for the next person who needs to add tooling. (Chromium intentionally NOT in this base — Amazon Linux 2023's default repo doesn't carry it, and `agent-browser` fetches a managed Playwright browser on first use anyway.) Re-test — full loop succeeded1) Cold POST → kick → workflow ran end-to-end in ~21s: ``` `vercel sandbox snapshots list` confirms: `snap_g6FiWyWxIA0g1psgHvLqY3MJiPhH` (697 MB, status `created`). 2) Subsequent POST → hit: ``` What's now runtime-proven
The Phase 1 cutover gap (org snapshot building for new orgs) is closed end-to-end. |
Three review fixes bundled:
1) SRP per sweetmantech: split app/workflows/buildOrgSnapshot.ts into
- app/workflows/buildSnapshotStep.ts (the "use step" function)
- app/workflows/buildOrgSnapshotWorkflow.ts (the "use workflow"
function — filename now matches the export, addressing the
cubic P2 about filename mismatch)
2) P0 command injection (cubic): wrap cloneUrl with shellEscape() in
the `git clone --depth=1 ...` command. The Zod + parseGitHubRepoUrl
validators upstream of this workflow already reject anything that
doesn't match `^https:\/\/github\.com\/recoupable\/...`, so this
isn't exploitable today — but shell-escaping is defense-in-depth
that survives validator changes. Zero behavior change for valid
inputs (extra single quotes around an already-clean URL).
3) P1 credential leak (cubic): dropped `cloneUrl` from every log line
in the workflow + kick. The `https://user:token@github.com/...`
shape is also blocked by the regex validator, but log scrubbing
is the cheaper of the two layers if the validator ever loosens.
`sandboxName` (the regex-extracted repo name only) remains in logs
— uniquely identifies the org for observability without
exposing tokens.
Out of scope: cubic's P2 about duplicate workflow runs (same
fire-and-forget pattern as open-agents — duplicates are wasteful but
not incorrect; can add a Vercel Workflow idempotency key in a
follow-up if observed in practice).
Tests: existing handler tests still pass unchanged (the kick is mocked
at the lib path, which didn't change). Suite remains 2577 / 2577.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed `27cce760` addressing 3 of the 4 actionable comments.
Skipped cubic's P2 about duplicate workflow runs: same fire-and-forget pattern as open-agents — duplicates waste compute but aren't incorrect. Worth a Vercel Workflow idempotency key in a follow-up if it's observed in practice. Tests + lint + tsc all green. Suite stays at 2577 (no test changes — the existing handler test mocks the kick at `@/lib/sandbox/kickBuildOrgSnapshotWorkflow`, which didn't change path). |
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
Smoke test re-run after `27cce760` (with new org URL)Used the URL you suggested: `https://github.com/recoupable/org-myco-wtf-80263819-9dfd-4bbf-9371-60a6185122d6\`. Full miss → workflow → hit loop visible in the logs in correct chronological order: ``` 16:13:27 [step:start] name='org-myco-wtf-...' 16:13:48 Built snap_5w67z2k52zns9aocc7JjuLmItm46 for 'org-myco-wtf-...' 16:16:20 POST → [findOrgSnapshot] 'org-myco-wtf-...' → hit snap_5w67z2k52zns9aocc7JjuLmItm46 (1 total snapshots returned) Both review fixes from `27cce760` are visible in the runtime evidence
Negative paths
Honest observation about readyMs
The hit case was ~6s slower than the cold path. Restoring a 727MB snapshot has overhead that exceeds the time saved on a tiny repo's The infrastructure works correctly end-to-end. The performance win will appear once the workflow is extended to include a post-clone install step (e.g., |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
app/workflows/buildSnapshotStep.ts (1)
28-49: ⚡ Quick winTrim
buildSnapshotStepinto smaller helpers for readability.It’s just over the function-length threshold; extracting token resolution and refresh payload construction would keep the step tighter and easier to scan.
As per coding guidelines, "Flag functions longer than 20 lines or classes with >200 lines" and "Keep functions small and focused".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/workflows/buildSnapshotStep.ts` around lines 28 - 49, The buildSnapshotStep function is too long; extract the token resolution and the refreshBaseSnapshot payload construction into small helpers: create a getGithubTokenOrThrow() that encapsulates getServiceGithubToken() and throws the existing error if not present, and create a buildRefreshPayload(input) (or buildRefreshOptions) that returns the object passed into refreshBaseSnapshot (baseSnapshotId, sandboxName, sandboxTimeoutMs, commandTimeoutMs, githubToken, commands, log) using shellEscape(input.cloneUrl) for the clone command; then simplify buildSnapshotStep to call those two helpers and await refreshBaseSnapshot(payload). Ensure helper names (getGithubTokenOrThrow, buildRefreshPayload) match references in buildSnapshotStep.lib/sandbox/createSandboxHandler.ts (1)
33-157: 🏗️ Heavy liftSplit
createSandboxHandlerinto smaller focused units.This handler is carrying too many responsibilities in one function. Please extract at least auth/session resolution, snapshot warmup decision, and provisioning/update paths into helpers.
As per coding guidelines, "Flag functions longer than 20 lines or classes with >200 lines", "Keep functions small and focused", and for
lib/**/*.ts: "Keep functions under 50 lines" and "Apply Single Responsibility Principle (SRP): one exported function per file; each file should do one thing well".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/sandbox/createSandboxHandler.ts` around lines 33 - 157, The createSandboxHandler is too large and does multiple responsibilities; split it into focused helpers: extractSessionAndAuth(request) which wraps validateCreateSandboxBody and selectSessions + permission checks (use symbols validateCreateSandboxBody, selectSessions, sessionRow), determineOrgSnapshot(repoUrl) which encapsulates extractOrgRepoName, findOrgSnapshot and kickBuildOrgSnapshotWorkflow (use orgRepoName, findOrgSnapshot, kickBuildOrgSnapshotWorkflow), and provisionAndUpdateSandbox(params) which handles connectSandbox, updateSession and installSessionGlobalSkills (use connectSandbox, updateSession, installSessionGlobalSkills) and returns the final response payload; then have createSandboxHandler orchestrate these smaller functions — keep each helper under ~50 lines and export only createSandboxHandler per file to follow SRP.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/workflows/buildOrgSnapshotWorkflow.ts`:
- Around line 20-28: The current try/catch in buildOrgSnapshotWorkflow swallows
errors and returns { success: false } which prevents Vercel Workflow from
marking the run as failed; instead, after logging the error (preserve the
existing console.error call that formats message for input.sandboxName), rethrow
the original error (or throw a new Error with the message) so the workflow run
fails and can be retried; update the catch block around buildSnapshotStep so it
no longer returns { success: false } but logs then throws the error (or remove
the catch entirely to let the error bubble).
In `@lib/sandbox/createSandboxHandler.ts`:
- Around line 75-80: The current call to kickBuildOrgSnapshotWorkflow can start
duplicate workflows when parallel requests race; update
kickBuildOrgSnapshotWorkflow (and its internal call to start) to pass a
deterministic runId based on the org identifier (e.g., use `orgRepoName`) so
identical requests reuse the same workflow run; specifically, modify the
start(...) invocation inside kickBuildOrgSnapshotWorkflow to include an options
object with runId set to a stable string like `org-snapshot-${orgRepoName}`
derived from the same org identifier used in the if block, ensuring idempotent
workflow starts.
---
Nitpick comments:
In `@app/workflows/buildSnapshotStep.ts`:
- Around line 28-49: The buildSnapshotStep function is too long; extract the
token resolution and the refreshBaseSnapshot payload construction into small
helpers: create a getGithubTokenOrThrow() that encapsulates
getServiceGithubToken() and throws the existing error if not present, and create
a buildRefreshPayload(input) (or buildRefreshOptions) that returns the object
passed into refreshBaseSnapshot (baseSnapshotId, sandboxName, sandboxTimeoutMs,
commandTimeoutMs, githubToken, commands, log) using shellEscape(input.cloneUrl)
for the clone command; then simplify buildSnapshotStep to call those two helpers
and await refreshBaseSnapshot(payload). Ensure helper names
(getGithubTokenOrThrow, buildRefreshPayload) match references in
buildSnapshotStep.
In `@lib/sandbox/createSandboxHandler.ts`:
- Around line 33-157: The createSandboxHandler is too large and does multiple
responsibilities; split it into focused helpers: extractSessionAndAuth(request)
which wraps validateCreateSandboxBody and selectSessions + permission checks
(use symbols validateCreateSandboxBody, selectSessions, sessionRow),
determineOrgSnapshot(repoUrl) which encapsulates extractOrgRepoName,
findOrgSnapshot and kickBuildOrgSnapshotWorkflow (use orgRepoName,
findOrgSnapshot, kickBuildOrgSnapshotWorkflow), and
provisionAndUpdateSandbox(params) which handles connectSandbox, updateSession
and installSessionGlobalSkills (use connectSandbox, updateSession,
installSessionGlobalSkills) and returns the final response payload; then have
createSandboxHandler orchestrate these smaller functions — keep each helper
under ~50 lines and export only createSandboxHandler per file to follow SRP.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cd168a8d-c3d5-422f-a893-de6e8819b05d
⛔ Files ignored due to path filters (4)
lib/sandbox/__tests__/createSandboxHandler.test.tsis excluded by!**/*.test.*,!**/__tests__/**and included bylib/**next.config.tsis excluded by none and included by nonepackage.jsonis excluded by none and included by nonepnpm-lock.yamlis excluded by!**/pnpm-lock.yamland included by none
📒 Files selected for processing (5)
app/workflows/buildOrgSnapshotWorkflow.tsapp/workflows/buildSnapshotStep.tslib/sandbox/createSandboxHandler.tslib/sandbox/defaultBaseSnapshotId.tslib/sandbox/kickBuildOrgSnapshotWorkflow.ts
| try { | ||
| const snapshotId = await buildSnapshotStep(input); | ||
| console.log(`[build-org-snapshot] Built ${snapshotId} for '${input.sandboxName}'`); | ||
| return { success: true as const, snapshotId }; | ||
| } catch (error) { | ||
| const message = error instanceof Error ? error.message : String(error); | ||
| console.error(`[build-org-snapshot] Failed for '${input.sandboxName}':`, message); | ||
| return { success: false as const, error: message }; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Vercel Workflow (workflow@4.2.0-beta.72), if a workflow catches an exception and returns a value instead of throwing, is the run considered failed and eligible for automatic retries?
💡 Result:
No. In Vercel Workflow, if you catch an exception and return a value instead of throwing a failure (i.e., you don’t let the step/workflow end in an error), the run/step is considered successful rather than failed, so it is not eligible for the automatic retry mechanism. What counts as retryable is tied to failures being surfaced as errors: by default, “errors thrown inside steps are retried” [1], and you can explicitly control retrying by throwing RetryableError (or stop retrying by throwing FatalError) [1][2]. If you catch the exception and simply return a value, you’re preventing the thrown error from occurring, so there’s no failure state to trigger retries. Notes: - If a step throws until it exhausts retries, the run is marked failed (and you can inspect the run status as “failed”) [3]. - Separately, retries may also occur due to certain infrastructure/runtime issues, but those still originate from an error path rather than a caught-and-returned success value [4].
Citations:
- 1: https://mintlify.com/vercel/workflow/concepts/errors-and-retries
- 2: https://vercel.com/academy/svelte-on-vercel/workflow-error-handling
- 3: https://mintlify.com/vercel/workflow/api-reference/workflow-api/get-workflow-state
- 4: https://github.com/vercel/workflow/blob/main/packages/core/src/runtime/step-handler.ts
Rethrow errors to preserve run failure state and enable automatic retries.
Catching errors and returning { success: false } prevents Vercel Workflow from marking the run as failed, which disables automatic retries and obscures true failure conditions. Log the error, then rethrow it to surface the failure properly.
💡 Proposed fix
catch (error) {
const message = error instanceof Error ? error.message : String(error);
console.error(`[build-org-snapshot] Failed for '${input.sandboxName}':`, message);
- return { success: false as const, error: message };
+ throw error instanceof Error ? error : new Error(message);
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/workflows/buildOrgSnapshotWorkflow.ts` around lines 20 - 28, The current
try/catch in buildOrgSnapshotWorkflow swallows errors and returns { success:
false } which prevents Vercel Workflow from marking the run as failed; instead,
after logging the error (preserve the existing console.error call that formats
message for input.sandboxName), rethrow the original error (or throw a new Error
with the message) so the workflow run fails and can be retried; update the catch
block around buildSnapshotStep so it no longer returns { success: false } but
logs then throws the error (or remove the catch entirely to let the error
bubble).
| if (orgRepoName && !orgSnapshotId) { | ||
| kickBuildOrgSnapshotWorkflow({ | ||
| cloneUrl: body.repoUrl, | ||
| sandboxName: orgRepoName, | ||
| }); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In workflow@4.2.0-beta.72, what is the recommended way to deduplicate start(...) calls for the same logical job (idempotency key / unique run key / concurrency controls)?
💡 Result:
In workflow@4.2.0-beta.72, the recommended way to deduplicate start(...) calls for the same logical job (idempotency key / unique run key) is to provide a custom runId in the StartOptions object. If a run with that runId already exists, start() returns the existing run instead of creating a duplicate.[1][2] Example: const run = await start(myWorkflow, args, { runId: `order-${orderId}` }); This ensures idempotency: repeated calls with the same runId return the same run object.[1][2] For concurrency controls, there is an ongoing discussion about a native lock primitive (distributed semaphore) to limit concurrent executions for the same key inside workflows, but it appears not yet implemented in beta.72 (no mention in release notes).[3][4] Use application-level checks (e.g., check run status before starting) or the custom runId for deduplication as primary mechanism.[5][6]
Citations:
- 1: Add support for idempotent workflow creation vercel/workflow#85
- 2: feat: add custom runId support for idempotent workflow creation vercel/workflow#86
- 3: Concurrency Controls vercel/workflow#301
- 4: https://github.com/vercel/workflow/releases/tag/workflow%404.2.0-beta.72
- 5: https://useworkflow.dev/docs/foundations/starting-workflows
- 6: https://useworkflow.dev/docs/api-reference/workflow-api/start
Use deterministic runId to prevent duplicate org snapshot workflow kicks.
When parallel requests miss the snapshot check simultaneously, both can trigger identical builds. Implement idempotency by passing a deterministic runId based on the org identifier to the start() call in kickBuildOrgSnapshotWorkflow. This ensures that concurrent identical requests reuse the same workflow run instead of spawning duplicates:
const run = await start(myWorkflow, args, {
runId: `org-snapshot-${orgRepoName}`
});
This prevents redundant clones, snapshot operations, and avoidable cost/rate pressure from concurrent requests.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/sandbox/createSandboxHandler.ts` around lines 75 - 80, The current call
to kickBuildOrgSnapshotWorkflow can start duplicate workflows when parallel
requests race; update kickBuildOrgSnapshotWorkflow (and its internal call to
start) to pass a deterministic runId based on the org identifier (e.g., use
`orgRepoName`) so identical requests reuse the same workflow run; specifically,
modify the start(...) invocation inside kickBuildOrgSnapshotWorkflow to include
an options object with runId set to a stable string like
`org-snapshot-${orgRepoName}` derived from the same org identifier used in the
if block, ensuring idempotent workflow starts.
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="app/workflows/buildOrgSnapshotWorkflow.ts">
<violation number="1" location="app/workflows/buildOrgSnapshotWorkflow.ts:27">
P2: Do not swallow workflow errors; rethrow after logging so failed runs are recorded correctly.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
| } catch (error) { | ||
| const message = error instanceof Error ? error.message : String(error); | ||
| console.error(`[build-org-snapshot] Failed for '${input.sandboxName}':`, message); | ||
| return { success: false as const, error: message }; |
There was a problem hiding this comment.
P2: Do not swallow workflow errors; rethrow after logging so failed runs are recorded correctly.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At app/workflows/buildOrgSnapshotWorkflow.ts, line 27:
<comment>Do not swallow workflow errors; rethrow after logging so failed runs are recorded correctly.</comment>
<file context>
@@ -0,0 +1,29 @@
+ } catch (error) {
+ const message = error instanceof Error ? error.message : String(error);
+ console.error(`[build-org-snapshot] Failed for '${input.sandboxName}':`, message);
+ return { success: false as const, error: message };
+ }
+}
</file context>
Tip: Review your code locally with the cubic CLI to iterate faster.
* feat(sandbox): port sandbox lifecycle workflow + kicks (Phase 2) Closes the auto-pause and lifecycle-self-heal gaps from the cutover analysis. Sequel to PR #531 (build-org-snapshot workflow). The Vercel Workflow runtime + withWorkflow already shipped — this PR adds the second workflow on top. What it does: - POST /api/sandbox now kicks the sandbox-lifecycle workflow with reason="sandbox-created" after provisioning + skill install. The workflow sleeps until `hibernate_after` / `sandbox_expires_at` (whichever is sooner), then evaluates: pause if idle, retry on next iteration if a chat stream is active or the user extended the timing. - GET /api/sandbox/status now kicks with reason="status-check-overdue" when the session row says lifecycle_state="active" but the workflow's due-time has already passed (its lease has died). The kick reclaims the stale lease and starts a fresh workflow. Components ported faithfully from open-agents: Sandbox state predicates (one fn per file per api convention): - getPersistentSandboxName, getResumableSandboxName, hasResumableSandboxState, hasPausedSandboxState - canOperateOnSandbox, clearSandboxState Lifecycle types + config: - sandboxLifecycleTypes.ts: SandboxLifecycleState/Reason types - sandboxLifecycleConfig.ts: SANDBOX_INACTIVITY_TIMEOUT_MS (5min), SANDBOX_EXPIRES_BUFFER_MS (10s), SANDBOX_LIFECYCLE_MIN_SLEEP_MS (5s), SANDBOX_LIFECYCLE_STALE_RUN_GRACE_MS (2min) Lifecycle update builders + due-at: - buildLifecycleActivityUpdate, buildActiveLifecycleUpdate, buildHibernatedLifecycleUpdate - getLifecycleDueAtMs (the wake-time calculator) - getNextLifecycleVersion, getSandboxExpiresAtDate Active-stream guard (don't pause mid-chat): - hasActiveStreamForSession + Supabase helper selectChatsBySession Concurrency: - claimSessionLifecycleRunId — atomic UPDATE that fails when the expected current value doesn't match. Prevents two kicks from starting duplicate workflows. The evaluator (the heart): - evaluateSandboxLifecycle (~110 lines) — single-pass FSM that decides skip / hibernate / fail. Mirrors open-agents' evaluator with the same skip reasons and self-heal pattern. The workflow: - app/workflows/sandboxLifecycleWorkflow.ts — `while(true)` loop with `sleep(new Date(wakeAtMs))` between iterations, lease-claim pattern, and step boundaries via "use step". Loops on not-due-yet / active-workflow defers, terminates on hibernated / failed / sandbox-gone. The kick: - kickSandboxLifecycleWorkflow.ts — fire-and-forget invocation with stale-run detection, lease claiming, and graceful handling when start() fails (clears the lease so retry can succeed). TDD red -> green: - 2 new createSandboxHandler tests asserting the kick fires with the right reason on session-bound provisions, and is skipped without sessionId - All existing handler tests continue to pass with the new mocks - Suite: 2577 -> 2579 (+2 new tests). pnpm lint:check + tsc clean for new files. Out of scope (deliberately deferred): - Tests for the workflow itself + evaluator. The workflow primitives (`"use workflow"`, `"use step"`, `sleep(date)`) require a Vercel Workflow test harness we haven't introduced yet. The handler tests + open-agents' parity give enough confidence; if the workflow misbehaves at runtime, the smoke test will catch it. - Reason-specific telemetry beyond what evaluateSandboxLifecycle already logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sandbox): adopt open-agents scheduleBackgroundWork pattern for lifecycle kick Smoke test caught: the lifecycle kick fired but its async chain (selectSessions → claim lease → start workflow) never completed because the serverless function tore down on response. No [kickSandboxLifecycleWorkflow] log lines, no workflow run started. Original PR placed `after()` from next/server inside the kick lib — which violated open-agents' architecture: the kick is a generic fan-out lib, the platform-specific scheduler belongs at the call site. Refactored to match open-agents' open / api conventions: - `kickSandboxLifecycleWorkflow` accepts an optional `scheduleBackgroundWork: (task: Promise<void>) => void` parameter. Default behavior is `void task` (matches open-agents fallback), scheduler used when provided. - createSandboxHandler + getSandboxStatusHandler pass `scheduleBackgroundWork: task => after(() => task)`. Same pattern api already uses in `lib/agents/createPlatformRoutes.ts:62` and `app/api/chat/slack/route.ts:16` for waitUntil-style background work. Test updated to use `expect.objectContaining` since the kick now receives an extra `scheduleBackgroundWork` callback. Suite stays at 2579. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(sandbox): KISS deletions + SRP extractions per review feedback Addresses 17 review comments on PR #533. All KISS deletions and SRP extractions in one pass — no behavior change, just file/function reorganization to match api conventions (one exported function per file, filename matches the export, no thin wrappers). KISS deletions (3 files removed): - canOperateOnSandbox.ts → callers use hasRuntimeSandboxState directly (was a one-line wrapper around it) - getNextLifecycleVersion.ts → was unused; existing inline calc kept - hasResumableSandboxState.ts → callers use `getResumableSandboxName(state) !== null` (one-line wrapper) KISS rename: - selectChatsBySession.ts → selectChats.ts with a generic `{ id?, sessionId? }` filter parameter, mirroring selectSessions SRP split — Supabase queries: - claimSessionLifecycleRunId.ts split into: - claimSessionLifecycleRunIdIfNull.ts (initial claim) - claimSessionLifecycleRunIdIfMatch.ts (lease refresh) - claimSessionLifecycleRunId.ts (combiner — picks the right one) SRP extractions — kick logic: - runKick.ts (the async chain) - reclaimStaleLease.ts - shouldStartLifecycle.ts - isLifecycleRunStale.ts - createLifecycleRunId.ts - kickSandboxLifecycleWorkflow.ts now thin: just runKick + scheduler SRP extractions — due-at math: - computeInactivityDueAtMs.ts - computeExpiryDueAtMs.ts - getLifecycleDueAtMs.ts now thin: composes the two SRP extractions — evaluator helpers: - restoreActiveLifecycleState.ts - wasLifecycleTimingExtended.ts SRP extractions — workflow steps (each preserves "use step"): - computeLifecycleWakeDecision.ts - runLifecycleEvaluation.ts - clearLifecycleRunIdIfOwned.ts - sandboxLifecycleWorkflow.ts now thin: just composes the steps with the while-sleep-evaluate loop Net delta: 16 new files, 4 deletions, no behavior change. Tests: 2579 / 2579 still pass. pnpm lint:check + tsc clean for new/changed files (pre-existing TS errors in processCreateSandbox / updateSnapshotPatchHandler unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sandbox): actually delegate getLifecycleDueAtMs to the extracted helpers In the previous SRP refactor I created computeInactivityDueAtMs.ts and computeExpiryDueAtMs.ts but the Write to update getLifecycleDueAtMs.ts errored on "File has not been read yet" and I missed it in the batch output. The two helper files were left as orphans while the original file kept its inline copies — visible by the SANDBOX_*_MS imports + both functions still inlined. Now properly imports computeInactivityDueAtMs / computeExpiryDueAtMs and the parent file just composes them. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sandbox): KISS lifecycle claim — combiner under lib/sessions, IfMatch via updateSession Two review fixes: 1) Move `claimSessionLifecycleRunId.ts` from `lib/supabase/sessions/` to `lib/sessions/`. The combiner doesn't directly query Supabase — it composes two underlying helpers, so per api convention it belongs alongside other domain composers, not under the Supabase namespace. 2) KISS: delete `claimSessionLifecycleRunIdIfMatch.ts` and use the existing `updateSession` helper for the lease-refresh path. The refresh writes the same `lifecycle_run_id` value back, so an unconditional `updateSession({ lifecycle_run_id: runId })` does the work — accepts a small race where a concurrent stale-reclaim could be overwritten, but the kick path (which DOES use the atomic `claimSessionLifecycleRunIdIfNull`) remains the primary concurrency guard. Two callers (runKick, computeLifecycleWakeDecision) updated to import from the new path. No behavior change at the workflow level. Tests: 2579 / 2579 still pass. lint + tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sandbox): KISS lifecycle claim — drop IfNull helper, use updateSession Per latest review: the `IfNull` Supabase helper is also just an `updateSession` with a conditional WHERE — KISS it the same way the `IfMatch` helper was simplified. Changes: - Delete `lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts`. - Delete `lib/sessions/claimSessionLifecycleRunId.ts` (combiner — no longer needed once both branches collapse to plain `updateSession`). - `runKick`: write the new lease via `updateSession(id, { lifecycle_run_id: runId })`. The atomic `WHERE lifecycle_run_id IS NULL` guard is gone; `shouldStartLifecycle` already filters out rows that already hold a lease — best-effort concurrency, accepting the small race window. - `computeLifecycleWakeDecision`: replace the IfMatch lease re-claim with a SELECT-based ownership check using the row already read in the same step. If `lifecycle_run_id` was overwritten by a concurrent kick, return `run-replaced` and bail. Tests 2579 / 2579 still pass. Lint clean. The pre-existing tsc errors in unrelated test files (orgId on ApiKeyDetails, SlackApiResponse messages) are not introduced here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Closes the biggest remaining cutover gap from the gap analysis: when a recoupable org repo is requested but no snapshot exists yet, fire a durable background workflow that builds one. The current request still pays the slow full-clone path; what's protected is that future sessions for the same org warm-boot from the new snapshot in seconds.
This is the first Vercel Workflow integration in api. Per the direction of using Vercel Workflow as the runner going forward (deprecating Trigger.dev over time), this PR lands the infra alongside the first workflow.
What's new
Plus the `workflow@^4.2.0-beta.72` package and lockfile.
Behavior
POST `/api/sandbox` flow now:
The workflow itself is a faithful port of open-agents' `buildOrgSnapshotWorkflow`. Same one-step structure, same `refreshBaseSnapshot` call (already in api), same failure path (caught + logged, returns `{success: false, error}`).
TDD
Strict red → green. +3 net new tests (suite 2574 → 2577):
Verification
Test plan
What follows (Phase 2)
This was Phase 1 (build-org-snapshot — small, self-contained). Phase 2 is the bigger one:
Will land in its own PR. Phase 1 already closes the largest user-visible cutover gap (~75s cold-start regression for new orgs).
🤖 Generated with Claude Code
Summary by cubic
Adds a background Vercel Workflow that builds an org snapshot when a recoupable repo is requested and no snapshot exists, so future sessions start in seconds. Enables the Vercel Workflow runtime and hardens the build step.
New Features
buildOrgSnapshotWorkflow+buildSnapshotStepto provision, shallow-clone, and snapshot org repos viarefreshBaseSnapshot; usesDEFAULT_SANDBOX_BASE_SNAPSHOT_ID(env overridable).createSandboxHandlerto fire-and-forgetkickBuildOrgSnapshotWorkflowon a recoupable snapshot miss; current request still full-clones.withWorkflow(nextConfig); workflows live inapp/workflows/*; addedworkflow@^4.2.4.Bug Fixes
DEFAULT_SANDBOX_BASE_SNAPSHOT_IDto a Recoup team snapshotsnap_RgVtpDO4y1BJHQiUbptMwS3Rt2EQto stop 404s.git cloneURL to prevent command injection; removedcloneUrlfrom logs to avoid leaking credentials.Written for commit 27cce76. Summary will update on new commits.
Summary by CodeRabbit