Skip to content

fix(background): recategorize user/recovery failures as errors, not trigger faults#4860

Merged
TheodoreSpeaks merged 3 commits into
stagingfrom
fix/recategorize-trigger-faults
Jun 3, 2026
Merged

fix(background): recategorize user/recovery failures as errors, not trigger faults#4860
TheodoreSpeaks merged 3 commits into
stagingfrom
fix/recategorize-trigger-faults

Conversation

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator

@TheodoreSpeaks TheodoreSpeaks commented Jun 3, 2026

Problem

#eng-errors is flooded with webhook-execution trigger.dev run failures, and a trickle of schedule-execution ones. The errors are almost all things that shouldn't fault a background job:

Webhook — user-caused workflow failures, already recorded in the user's execution logs:

  • Gmail 2 is missing required fields: Label (serializer validation)
  • Knowledge 2: "semantic_query" doesn't exist on block "restructure" (invalid field reference)
  • Firecrawl 1: Invalid URL, Router 1: Invalid request body, provider 401/400/404, etc.

Schedule — a secondary DB failure during error recovery (e.g. a transient blip while releasing the claim), which double-faults during cleanup.

webhook-execution re-threw every error; with maxAttempts: 1 any throw marks the run failed and alerts — even for a user/workflow problem. schedule-execution already treats workflow failures as recorded errors (not faults), but its outermost catch's own recovery code was unguarded, so a cleanup-write failure escaped run().

Changes

webhook-execution.ts — fault vs error

  • await loggingSession.waitForPostExecution() before reading the finalized flag (set inside a fire-and-forget post-execution promise; the read must be reliable once branches diverge).
  • Finalized by core (workflow ran, failure already in execution logs = user/workflow error) → return { success: false, ... }; the run completes, no alert.
  • Not finalized (error from the webhook pipeline itself) → record and re-throw to fault the run, preserving alerting on genuine problems.

Every error in the logs is thrown from inside executeWorkflowCore (serializer validation at execution-core.ts:474, or block/provider failures via the DAG executor), which force-starts logging and finalizes before re-throwing — so they're all finalized-by-core and stop faulting. Verified against prod run run_cmpwquraf4tb60iogze9dg1fr.

schedule-execution.ts — guard the recovery path

  • The outermost catch already turns workflow failures into schedule-state updates without faulting. Its own recovery calls (infra-retry, releaseClaim) were unguarded — wrap them in a try/catch that logs and records the exception on the span without re-throwing. The claim expires on its TTL and the next tick re-claims the schedule, so swallowing a cleanup-write failure is safe.

Both — keep investigability

The exception is recorded on the run's OTel span via recordException (no ERROR status, so it doesn't fault the run) so these stay visible in Tempo; they also remain in the execution logs and logger.errorLoki.

Notes / risk

  • No retry behavior changeswebhook-execution is maxAttempts: 1, so nothing was ever retried; only the run's final status changes (FAILED → COMPLETED). Schedule's recovery swallow is also safe (TTL + next tick).
  • Webhook idempotency marker flips failedcompleted; since webhookIdempotency is retryFailures: false, this reduces alerts (duplicate deliveries of a failed webhook previously re-threw and re-faulted).
  • The one genuine tradeoff: a real in-core bug now shows as a completed run instead of failed — but it stays fully investigable via execution logs + the recorded span exception (Tempo) + logger.error (Loki).
  • The knowledge knowledge-process-document task (maxAttempts: 3, Unsupported fileUrl scheme user error) is a separate follow-up.

Test plan

  • bun run test background/webhook-execution.test.ts — added cases: finalized → resolves { success: false } + records span exception; non-finalized → throws + logs.
  • bun run type-check — clean.
  • bun run lint:check — clean (2 warnings are pre-existing in untouched service.ts).
  • Prod sanity check via trigger MCP.

🤖 Generated with Claude Code

Webhook-triggered executions re-threw every error, so trigger.dev marked
the run failed and fired #eng-errors alerts. The vast majority of these are
user-caused workflow failures (missing required fields, invalid field
references, bad URLs, provider 4xx, expired models, low credit) that are
already recorded in the execution logs.

Distinguish fault vs error in executeWebhookJobInternal: when the failure
was finalized by core (the workflow ran and its failure is logged), complete
the run with { success: false } instead of throwing. Errors that were not
finalized came from the webhook pipeline itself and still re-throw to fault
the run. Await waitForPostExecution first so the finalized flag is reliable.

The error is still recorded on the run's OTel span via recordException (no
ERROR status, so the run isn't faulted) and remains in the execution logs,
so these stay investigable in Tempo/Loki without false alerts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 3, 2026 5:44am

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented Jun 3, 2026

PR Summary

Medium Risk
Changes how Trigger.dev run success is determined for webhooks and schedule recovery; real in-core bugs may show as completed runs, though execution logs and span exceptions remain for investigation.

Overview
Background jobs were treating user/workflow failures and transient recovery glitches like infrastructure faults, which flooded #eng-errors on Trigger.dev.

Webhook execution now awaits loggingSession.waitForPostExecution() before checking wasExecutionFinalizedByCore. When core already finalized the failure (validation, block/provider errors, etc.), the job returns { success: false, ... } instead of re-throwing, so the run completes without a false alert; it still recordExceptions on the OTel span (without ERROR status) for Tempo. Pipeline/setup failures that were not finalized still log via safeCompleteWithError and re-throw to fault the run.

Schedule execution wraps the outer catch’s recovery path (infra retry, releaseClaim) in a nested try/catch so a secondary DB failure during cleanup logs and records on the span without faulting; the claim TTL still allows re-claim on the next tick.

Tests add executeWebhookJob fault vs error cases with the needed mocks.

Reviewed by Cursor Bugbot for commit 3fbd65f. Bugbot is set up for automated code reviews on this repo. Configure here.

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@greptile review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

This PR fixes two sources of false-positive trigger.dev run failures: webhook jobs that threw on user/workflow errors (which were already recorded in execution logs), and schedule jobs whose cleanup path could fault the run on a secondary DB error during recovery.

  • webhook-execution.ts: Awaits waitForPostExecution() so the finalized-by-core signal is reliable, then branches on wasExecutionFinalizedByCore — user/workflow errors return { success: false } without faulting the run, while genuine pipeline errors still re-throw. The exception is recorded on the OTel span in both paths for investigability.
  • schedule-execution.ts: Wraps the existing outer-catch recovery block (retryScheduleAfterInfraFailure / releaseClaim) in a try/catch so a secondary transient DB failure during cleanup logs and records on the span rather than escaping and faulting the run.
  • webhook-execution.test.ts: Adds two new test cases covering the finalized and non-finalized branches, including assertions that waitForPostExecution is called on both paths.

Confidence Score: 5/5

Safe to merge — changes are well-scoped, the finalized-by-core signal is correctly gated behind waitForPostExecution, and the schedule recovery guard is a straightforward containment that relies on TTL expiry as a fallback.

Both changed code paths are narrow and defensive: webhook-execution now correctly branches on a reliable finalized signal rather than re-throwing all errors, and schedule-execution wraps an already-isolated cleanup block. No new async races are introduced, the OTel span recording preserves investigability, and the new test cases cover both branches including the race-condition guard. No logic defects were found.

No files require special attention.

Important Files Changed

Filename Overview
apps/sim/background/webhook-execution.ts Core fault-vs-error split: adds waitForPostExecution() before the finalized check, returns a completed result for user/workflow errors, and re-throws only genuine pipeline errors. Logic is sound and consistent with how markExecutionFinalizedByCore / wasExecutionFinalizedByCore work in execution-core.ts.
apps/sim/background/schedule-execution.ts Guards the recovery path (infra retry + releaseClaim) with an inner try/catch so a secondary transient failure during cleanup logs and records on the span without faulting the run. Safe because the claim expires on TTL and the next tick re-claims.
apps/sim/background/webhook-execution.test.ts Adds comprehensive test infrastructure and two new test cases for the fault-vs-error branching. Both tests assert that waitForPostExecution is called, covering the race-condition guard on both the finalized and non-finalized paths.

Sequence Diagram

sequenceDiagram
    participant TDev as trigger.dev
    participant WH as webhook-execution
    participant Core as executeWorkflowCore
    participant LS as LoggingSession
    participant Span as OTel Span

    TDev->>WH: run(payload)
    WH->>Core: executeWorkflowCore(...)
    alt Workflow/user error (block fail, provider 4xx, etc.)
        Core-->>LS: markExecutionFinalizedByCore(error, executionId)
        Core-->>WH: throws error (finalized)
        WH->>LS: waitForPostExecution() [ensures finalized flag is set]
        LS-->>WH: resolved
        WH->>WH: wasExecutionFinalizedByCore(error) → true
        WH->>Span: recordException(error) [visible in Tempo, no ERROR status]
        WH-->>TDev: "return { success: false } [run COMPLETES, no alert]"
    else Pipeline/infra error (workflow not found, DB error, etc.)
        Core-->>WH: throws error (not finalized)
        WH->>LS: waitForPostExecution()
        LS-->>WH: resolved
        WH->>WH: wasExecutionFinalizedByCore(error) → false
        WH->>LS: safeStart + safeCompleteWithError
        WH-->>TDev: re-throws [run FAULTS, alert fires]
    end
Loading

Reviews (4): Last reviewed commit: "test(webhook): assert waitForPostExecuti..." | Re-trigger Greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

This PR fixes false trigger.dev run failures in the webhook execution pipeline by distinguishing user/workflow errors (already recorded in execution logs) from genuine infra errors. The core change adds await loggingSession.waitForPostExecution() before the wasExecutionFinalizedByCore check to eliminate a race condition (the finalized flag is set inside a fire-and-forget promise), then returns {success: false} instead of re-throwing when the failure was finalized by core.

  • webhook-execution.ts: In the catch block, awaits post-execution finalization before branching on wasExecutionFinalizedByCore; finalized-by-core errors now complete the trigger.dev run (recording the exception on the OTel span for Tempo visibility) while non-finalized errors still re-throw to fault the run.
  • webhook-execution.test.ts: Adds two new integration-style test cases covering the finalized path (asserts return value + span recording, no double-logging) and the non-finalized path (asserts rejection + logging session error completion).

Confidence Score: 4/5

Safe to merge; the change correctly narrows alerting to genuine infra failures while preserving full observability of user errors through execution logs, Loki, and Tempo.

The logic is well-reasoned and the waitForPostExecution addition eliminates the race condition that made the finalized check unreliable. The non-finalized test case is missing an assertion that waitForPostExecution was called, leaving a small gap where a future refactor could silently re-introduce the race without breaking tests.

The test file would benefit from the additional assertion in the non-finalized path; no concerns with the production code in webhook-execution.ts.

Important Files Changed

Filename Overview
apps/sim/background/webhook-execution.ts Adds waitForPostExecution() before finalized-by-core check to eliminate a race condition, then returns {success:false} instead of re-throwing for user/workflow errors so trigger.dev runs complete rather than fault.
apps/sim/background/webhook-execution.test.ts Adds two new test cases covering the finalized vs non-finalized execution paths with mocks for core, logging session, and OTel span; finalized path asserts return value and span recording, non-finalized path asserts throw and logging.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[executeWebhookJobInternal] --> B[executeWorkflowCore]
    B -->|success| C[handleExecutionResult\nwaitForPostExecution]
    C --> D[return success result]
    B -->|throws| E[catch block\nlogger.error]
    E --> F[await loggingSession.waitForPostExecution\nensures finalized flag is set]
    F --> G{wasExecutionFinalizedByCore?}
    G -->|YES — user/workflow error| H[recordException on OTel span]
    H --> I[return success:false\ntrigger.dev run COMPLETES]
    G -->|NO — infra/pipeline error| J[safeStart + safeCompleteWithError\nrecord to execution logs]
    J --> K[throw error\ntrigger.dev run FAULTS]
Loading

Reviews (2): Last reviewed commit: "fix(webhook): don't fault trigger run on..." | Re-trigger Greptile

Comment thread apps/sim/background/webhook-execution.test.ts
The schedule task already treats workflow-execution failures as recorded
errors rather than trigger faults, but the outermost catch's own recovery
code (the infra-retry and releaseClaim calls) was unguarded. A secondary DB
blip while releasing the claim re-threw and escaped run(), faulting the
trigger.dev run and firing an alert — a double-fault during cleanup.

Wrap the recovery path in a try/catch: log and record the exception on the
span without re-throwing. The claim expires on its TTL and the next tick
re-claims the schedule, so swallowing the cleanup failure is safe.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@TheodoreSpeaks TheodoreSpeaks changed the title fix(webhook): don't fault trigger run on user/workflow execution errors fix(background): recategorize user/recovery failures as errors, not trigger faults Jun 3, 2026
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@greptile review

…path

Guards the race fix on the infra-error path so a future refactor can't
silently drop the await. Addresses Greptile review feedback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@greptile review

@TheodoreSpeaks TheodoreSpeaks merged commit 928bd91 into staging Jun 3, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant