fix(background): recategorize user/recovery failures as errors, not trigger faults by TheodoreSpeaks · Pull Request #4860 · simstudioai/sim

TheodoreSpeaks · 2026-06-03T03:17:56Z

Problem

#eng-errors is flooded with webhook-execution trigger.dev run failures, and a trickle of schedule-execution ones. The errors are almost all things that shouldn't fault a background job:

Webhook — user-caused workflow failures, already recorded in the user's execution logs:

Gmail 2 is missing required fields: Label (serializer validation)
Knowledge 2: "semantic_query" doesn't exist on block "restructure" (invalid field reference)
Firecrawl 1: Invalid URL, Router 1: Invalid request body, provider 401/400/404, etc.

Schedule — a secondary DB failure during error recovery (e.g. a transient blip while releasing the claim), which double-faults during cleanup.

webhook-execution re-threw every error; with maxAttempts: 1 any throw marks the run failed and alerts — even for a user/workflow problem. schedule-execution already treats workflow failures as recorded errors (not faults), but its outermost catch's own recovery code was unguarded, so a cleanup-write failure escaped run().

Changes

`webhook-execution.ts` — fault vs error

await loggingSession.waitForPostExecution() before reading the finalized flag (set inside a fire-and-forget post-execution promise; the read must be reliable once branches diverge).
Finalized by core (workflow ran, failure already in execution logs = user/workflow error) → return { success: false, ... }; the run completes, no alert.
Not finalized (error from the webhook pipeline itself) → record and re-throw to fault the run, preserving alerting on genuine problems.

Every error in the logs is thrown from inside executeWorkflowCore (serializer validation at execution-core.ts:474, or block/provider failures via the DAG executor), which force-starts logging and finalizes before re-throwing — so they're all finalized-by-core and stop faulting. Verified against prod run run_cmpwquraf4tb60iogze9dg1fr.

`schedule-execution.ts` — guard the recovery path

The outermost catch already turns workflow failures into schedule-state updates without faulting. Its own recovery calls (infra-retry, releaseClaim) were unguarded — wrap them in a try/catch that logs and records the exception on the span without re-throwing. The claim expires on its TTL and the next tick re-claims the schedule, so swallowing a cleanup-write failure is safe.

Both — keep investigability

The exception is recorded on the run's OTel span via recordException (no ERROR status, so it doesn't fault the run) so these stay visible in Tempo; they also remain in the execution logs and logger.error → Loki.

Notes / risk

No retry behavior changes — webhook-execution is maxAttempts: 1, so nothing was ever retried; only the run's final status changes (FAILED → COMPLETED). Schedule's recovery swallow is also safe (TTL + next tick).
Webhook idempotency marker flips failed→completed; since webhookIdempotency is retryFailures: false, this reduces alerts (duplicate deliveries of a failed webhook previously re-threw and re-faulted).
The one genuine tradeoff: a real in-core bug now shows as a completed run instead of failed — but it stays fully investigable via execution logs + the recorded span exception (Tempo) + logger.error (Loki).
The knowledge knowledge-process-document task (maxAttempts: 3, Unsupported fileUrl scheme user error) is a separate follow-up.

Test plan

bun run test background/webhook-execution.test.ts — added cases: finalized → resolves { success: false } + records span exception; non-finalized → throws + logs.
bun run type-check — clean.
bun run lint:check — clean (2 warnings are pre-existing in untouched service.ts).
Prod sanity check via trigger MCP.

🤖 Generated with Claude Code

Webhook-triggered executions re-threw every error, so trigger.dev marked the run failed and fired #eng-errors alerts. The vast majority of these are user-caused workflow failures (missing required fields, invalid field references, bad URLs, provider 4xx, expired models, low credit) that are already recorded in the execution logs. Distinguish fault vs error in executeWebhookJobInternal: when the failure was finalized by core (the workflow ran and its failure is logged), complete the run with { success: false } instead of throwing. Errors that were not finalized came from the webhook pipeline itself and still re-throw to fault the run. Await waitForPostExecution first so the finalized flag is reliable. The error is still recorded on the run's OTel span via recordException (no ERROR status, so the run isn't faulted) and remains in the execution logs, so these stay investigable in Tempo/Loki without false alerts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vercel · 2026-06-03T03:18:03Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
docs	Skipped		Jun 3, 2026 5:44am

cursor · 2026-06-03T03:18:06Z

PR Summary

Medium Risk
Changes how Trigger.dev run success is determined for webhooks and schedule recovery; real in-core bugs may show as completed runs, though execution logs and span exceptions remain for investigation.

Overview
Background jobs were treating user/workflow failures and transient recovery glitches like infrastructure faults, which flooded #eng-errors on Trigger.dev.

Webhook execution now awaits loggingSession.waitForPostExecution() before checking wasExecutionFinalizedByCore. When core already finalized the failure (validation, block/provider errors, etc.), the job returns { success: false, ... } instead of re-throwing, so the run completes without a false alert; it still recordExceptions on the OTel span (without ERROR status) for Tempo. Pipeline/setup failures that were not finalized still log via safeCompleteWithError and re-throw to fault the run.

Schedule execution wraps the outer catch’s recovery path (infra retry, releaseClaim) in a nested try/catch so a secondary DB failure during cleanup logs and records on the span without faulting; the claim TTL still allows re-claim on the next tick.

Tests add executeWebhookJob fault vs error cases with the needed mocks.

^{Reviewed by Cursor Bugbot for commit 3fbd65f. Bugbot is set up for automated code reviews on this repo. Configure here.}

TheodoreSpeaks · 2026-06-03T03:18:10Z

@greptile review

greptile-apps · 2026-06-03T03:22:47Z

Greptile Summary

This PR fixes two sources of false-positive trigger.dev run failures: webhook jobs that threw on user/workflow errors (which were already recorded in execution logs), and schedule jobs whose cleanup path could fault the run on a secondary DB error during recovery.

webhook-execution.ts: Awaits waitForPostExecution() so the finalized-by-core signal is reliable, then branches on wasExecutionFinalizedByCore — user/workflow errors return { success: false } without faulting the run, while genuine pipeline errors still re-throw. The exception is recorded on the OTel span in both paths for investigability.
schedule-execution.ts: Wraps the existing outer-catch recovery block (retryScheduleAfterInfraFailure / releaseClaim) in a try/catch so a secondary transient DB failure during cleanup logs and records on the span rather than escaping and faulting the run.
webhook-execution.test.ts: Adds two new test cases covering the finalized and non-finalized branches, including assertions that waitForPostExecution is called on both paths.

Confidence Score: 5/5

Safe to merge — changes are well-scoped, the finalized-by-core signal is correctly gated behind waitForPostExecution, and the schedule recovery guard is a straightforward containment that relies on TTL expiry as a fallback.

Both changed code paths are narrow and defensive: webhook-execution now correctly branches on a reliable finalized signal rather than re-throwing all errors, and schedule-execution wraps an already-isolated cleanup block. No new async races are introduced, the OTel span recording preserves investigability, and the new test cases cover both branches including the race-condition guard. No logic defects were found.

No files require special attention.

Important Files Changed

Filename	Overview
apps/sim/background/webhook-execution.ts	Core fault-vs-error split: adds waitForPostExecution() before the finalized check, returns a completed result for user/workflow errors, and re-throws only genuine pipeline errors. Logic is sound and consistent with how markExecutionFinalizedByCore / wasExecutionFinalizedByCore work in execution-core.ts.
apps/sim/background/schedule-execution.ts	Guards the recovery path (infra retry + releaseClaim) with an inner try/catch so a secondary transient failure during cleanup logs and records on the span without faulting the run. Safe because the claim expires on TTL and the next tick re-claims.
apps/sim/background/webhook-execution.test.ts	Adds comprehensive test infrastructure and two new test cases for the fault-vs-error branching. Both tests assert that waitForPostExecution is called, covering the race-condition guard on both the finalized and non-finalized paths.

Sequence Diagram

sequenceDiagram
    participant TDev as trigger.dev
    participant WH as webhook-execution
    participant Core as executeWorkflowCore
    participant LS as LoggingSession
    participant Span as OTel Span

    TDev->>WH: run(payload)
    WH->>Core: executeWorkflowCore(...)
    alt Workflow/user error (block fail, provider 4xx, etc.)
        Core-->>LS: markExecutionFinalizedByCore(error, executionId)
        Core-->>WH: throws error (finalized)
        WH->>LS: waitForPostExecution() [ensures finalized flag is set]
        LS-->>WH: resolved
        WH->>WH: wasExecutionFinalizedByCore(error) → true
        WH->>Span: recordException(error) [visible in Tempo, no ERROR status]
        WH-->>TDev: "return { success: false } [run COMPLETES, no alert]"
    else Pipeline/infra error (workflow not found, DB error, etc.)
        Core-->>WH: throws error (not finalized)
        WH->>LS: waitForPostExecution()
        LS-->>WH: resolved
        WH->>WH: wasExecutionFinalizedByCore(error) → false
        WH->>LS: safeStart + safeCompleteWithError
        WH-->>TDev: re-throws [run FAULTS, alert fires]
    end

_{Reviews (4): Last reviewed commit: "test(webhook): assert waitForPostExecuti..." | Re-trigger Greptile}

greptile-apps · 2026-06-03T03:29:41Z

Greptile Summary

This PR fixes false trigger.dev run failures in the webhook execution pipeline by distinguishing user/workflow errors (already recorded in execution logs) from genuine infra errors. The core change adds await loggingSession.waitForPostExecution() before the wasExecutionFinalizedByCore check to eliminate a race condition (the finalized flag is set inside a fire-and-forget promise), then returns {success: false} instead of re-throwing when the failure was finalized by core.

webhook-execution.ts: In the catch block, awaits post-execution finalization before branching on wasExecutionFinalizedByCore; finalized-by-core errors now complete the trigger.dev run (recording the exception on the OTel span for Tempo visibility) while non-finalized errors still re-throw to fault the run.
webhook-execution.test.ts: Adds two new integration-style test cases covering the finalized path (asserts return value + span recording, no double-logging) and the non-finalized path (asserts rejection + logging session error completion).

Confidence Score: 4/5

Safe to merge; the change correctly narrows alerting to genuine infra failures while preserving full observability of user errors through execution logs, Loki, and Tempo.

The logic is well-reasoned and the waitForPostExecution addition eliminates the race condition that made the finalized check unreliable. The non-finalized test case is missing an assertion that waitForPostExecution was called, leaving a small gap where a future refactor could silently re-introduce the race without breaking tests.

The test file would benefit from the additional assertion in the non-finalized path; no concerns with the production code in webhook-execution.ts.

Important Files Changed

Filename	Overview
apps/sim/background/webhook-execution.ts	Adds `waitForPostExecution()` before finalized-by-core check to eliminate a race condition, then returns `{success:false}` instead of re-throwing for user/workflow errors so trigger.dev runs complete rather than fault.
apps/sim/background/webhook-execution.test.ts	Adds two new test cases covering the finalized vs non-finalized execution paths with mocks for core, logging session, and OTel span; finalized path asserts return value and span recording, non-finalized path asserts throw and logging.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[executeWebhookJobInternal] --> B[executeWorkflowCore]
    B -->|success| C[handleExecutionResult\nwaitForPostExecution]
    C --> D[return success result]
    B -->|throws| E[catch block\nlogger.error]
    E --> F[await loggingSession.waitForPostExecution\nensures finalized flag is set]
    F --> G{wasExecutionFinalizedByCore?}
    G -->|YES — user/workflow error| H[recordException on OTel span]
    H --> I[return success:false\ntrigger.dev run COMPLETES]
    G -->|NO — infra/pipeline error| J[safeStart + safeCompleteWithError\nrecord to execution logs]
    J --> K[throw error\ntrigger.dev run FAULTS]

_{Reviews (2): Last reviewed commit: "fix(webhook): don't fault trigger run on..." | Re-trigger Greptile}

The schedule task already treats workflow-execution failures as recorded errors rather than trigger faults, but the outermost catch's own recovery code (the infra-retry and releaseClaim calls) was unguarded. A secondary DB blip while releasing the claim re-threw and escaped run(), faulting the trigger.dev run and firing an alert — a double-fault during cleanup. Wrap the recovery path in a try/catch: log and record the exception on the span without re-throwing. The claim expires on its TTL and the next tick re-claims the schedule, so swallowing the cleanup failure is safe. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

TheodoreSpeaks · 2026-06-03T05:37:59Z

@greptile review

…path Guards the race fix on the infra-error path so a future refactor can't silently drop the await. Addresses Greptile review feedback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

TheodoreSpeaks · 2026-06-03T05:44:44Z

@greptile review

greptile-apps Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread apps/sim/background/webhook-execution.test.ts

vercel Bot temporarily deployed to Preview June 3, 2026 05:37 Inactive

TheodoreSpeaks changed the title ~~fix(webhook): don't fault trigger run on user/workflow execution errors~~ fix(background): recategorize user/recovery failures as errors, not trigger faults Jun 3, 2026

test(webhook): assert waitForPostExecution runs on the non-finalized …

3fbd65f

…path Guards the race fix on the infra-error path so a future refactor can't silently drop the await. Addresses Greptile review feedback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vercel Bot temporarily deployed to Preview June 3, 2026 05:44 Inactive

TheodoreSpeaks merged commit 928bd91 into staging Jun 3, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(background): recategorize user/recovery failures as errors, not trigger faults#4860

fix(background): recategorize user/recovery failures as errors, not trigger faults#4860
TheodoreSpeaks merged 3 commits into
stagingfrom
fix/recategorize-trigger-faults

TheodoreSpeaks commented Jun 3, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

cursor Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

TheodoreSpeaks commented Jun 3, 2026

Uh oh!

greptile-apps Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 3, 2026

Uh oh!

Uh oh!

TheodoreSpeaks commented Jun 3, 2026

Uh oh!

TheodoreSpeaks commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheodoreSpeaks commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

webhook-execution.ts — fault vs error

schedule-execution.ts — guard the recovery path

Both — keep investigability

Notes / risk

Test plan

Uh oh!

vercel Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

TheodoreSpeaks commented Jun 3, 2026

Uh oh!

greptile-apps Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

TheodoreSpeaks commented Jun 3, 2026

Uh oh!

TheodoreSpeaks commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TheodoreSpeaks commented Jun 3, 2026 •

edited

Loading

`webhook-execution.ts` — fault vs error

`schedule-execution.ts` — guard the recovery path

vercel Bot commented Jun 3, 2026 •

edited

Loading

cursor Bot commented Jun 3, 2026 •

edited

Loading

greptile-apps Bot commented Jun 3, 2026 •

edited

Loading