fix(background): recategorize user/recovery failures as errors, not trigger faults#4860
Conversation
Webhook-triggered executions re-threw every error, so trigger.dev marked
the run failed and fired #eng-errors alerts. The vast majority of these are
user-caused workflow failures (missing required fields, invalid field
references, bad URLs, provider 4xx, expired models, low credit) that are
already recorded in the execution logs.
Distinguish fault vs error in executeWebhookJobInternal: when the failure
was finalized by core (the workflow ran and its failure is logged), complete
the run with { success: false } instead of throwing. Errors that were not
finalized came from the webhook pipeline itself and still re-throw to fault
the run. Await waitForPostExecution first so the finalized flag is reliable.
The error is still recorded on the run's OTel span via recordException (no
ERROR status, so the run isn't faulted) and remains in the execution logs,
so these stay investigable in Tempo/Loki without false alerts.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview Webhook execution now Schedule execution wraps the outer catch’s recovery path (infra retry, Tests add Reviewed by Cursor Bugbot for commit 3fbd65f. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
@greptile review |
Greptile SummaryThis PR fixes two sources of false-positive trigger.dev run failures: webhook jobs that threw on user/workflow errors (which were already recorded in execution logs), and schedule jobs whose cleanup path could fault the run on a secondary DB error during recovery.
Confidence Score: 5/5Safe to merge — changes are well-scoped, the finalized-by-core signal is correctly gated behind waitForPostExecution, and the schedule recovery guard is a straightforward containment that relies on TTL expiry as a fallback. Both changed code paths are narrow and defensive: webhook-execution now correctly branches on a reliable finalized signal rather than re-throwing all errors, and schedule-execution wraps an already-isolated cleanup block. No new async races are introduced, the OTel span recording preserves investigability, and the new test cases cover both branches including the race-condition guard. No logic defects were found. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant TDev as trigger.dev
participant WH as webhook-execution
participant Core as executeWorkflowCore
participant LS as LoggingSession
participant Span as OTel Span
TDev->>WH: run(payload)
WH->>Core: executeWorkflowCore(...)
alt Workflow/user error (block fail, provider 4xx, etc.)
Core-->>LS: markExecutionFinalizedByCore(error, executionId)
Core-->>WH: throws error (finalized)
WH->>LS: waitForPostExecution() [ensures finalized flag is set]
LS-->>WH: resolved
WH->>WH: wasExecutionFinalizedByCore(error) → true
WH->>Span: recordException(error) [visible in Tempo, no ERROR status]
WH-->>TDev: "return { success: false } [run COMPLETES, no alert]"
else Pipeline/infra error (workflow not found, DB error, etc.)
Core-->>WH: throws error (not finalized)
WH->>LS: waitForPostExecution()
LS-->>WH: resolved
WH->>WH: wasExecutionFinalizedByCore(error) → false
WH->>LS: safeStart + safeCompleteWithError
WH-->>TDev: re-throws [run FAULTS, alert fires]
end
Reviews (4): Last reviewed commit: "test(webhook): assert waitForPostExecuti..." | Re-trigger Greptile |
Greptile SummaryThis PR fixes false trigger.dev run failures in the webhook execution pipeline by distinguishing user/workflow errors (already recorded in execution logs) from genuine infra errors. The core change adds
Confidence Score: 4/5Safe to merge; the change correctly narrows alerting to genuine infra failures while preserving full observability of user errors through execution logs, Loki, and Tempo. The logic is well-reasoned and the The test file would benefit from the additional assertion in the non-finalized path; no concerns with the production code in Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[executeWebhookJobInternal] --> B[executeWorkflowCore]
B -->|success| C[handleExecutionResult\nwaitForPostExecution]
C --> D[return success result]
B -->|throws| E[catch block\nlogger.error]
E --> F[await loggingSession.waitForPostExecution\nensures finalized flag is set]
F --> G{wasExecutionFinalizedByCore?}
G -->|YES — user/workflow error| H[recordException on OTel span]
H --> I[return success:false\ntrigger.dev run COMPLETES]
G -->|NO — infra/pipeline error| J[safeStart + safeCompleteWithError\nrecord to execution logs]
J --> K[throw error\ntrigger.dev run FAULTS]
Reviews (2): Last reviewed commit: "fix(webhook): don't fault trigger run on..." | Re-trigger Greptile |
The schedule task already treats workflow-execution failures as recorded errors rather than trigger faults, but the outermost catch's own recovery code (the infra-retry and releaseClaim calls) was unguarded. A secondary DB blip while releasing the claim re-threw and escaped run(), faulting the trigger.dev run and firing an alert — a double-fault during cleanup. Wrap the recovery path in a try/catch: log and record the exception on the span without re-throwing. The claim expires on its TTL and the next tick re-claims the schedule, so swallowing the cleanup failure is safe. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@greptile review |
…path Guards the race fix on the infra-error path so a future refactor can't silently drop the await. Addresses Greptile review feedback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@greptile review |
Problem
#eng-errorsis flooded withwebhook-executiontrigger.dev run failures, and a trickle ofschedule-executionones. The errors are almost all things that shouldn't fault a background job:Webhook — user-caused workflow failures, already recorded in the user's execution logs:
Gmail 2 is missing required fields: Label(serializer validation)Knowledge 2: "semantic_query" doesn't exist on block "restructure"(invalid field reference)Firecrawl 1: Invalid URL,Router 1: Invalid request body, provider401/400/404, etc.Schedule — a secondary DB failure during error recovery (e.g. a transient blip while releasing the claim), which double-faults during cleanup.
webhook-executionre-threw every error; withmaxAttempts: 1any throw marks the run failed and alerts — even for a user/workflow problem.schedule-executionalready treats workflow failures as recorded errors (not faults), but its outermost catch's own recovery code was unguarded, so a cleanup-write failure escapedrun().Changes
webhook-execution.ts— fault vs errorawait loggingSession.waitForPostExecution()before reading the finalized flag (set inside a fire-and-forget post-execution promise; the read must be reliable once branches diverge).{ success: false, ... }; the run completes, no alert.Every error in the logs is thrown from inside
executeWorkflowCore(serializer validation atexecution-core.ts:474, or block/provider failures via the DAG executor), which force-starts logging and finalizes before re-throwing — so they're all finalized-by-core and stop faulting. Verified against prod runrun_cmpwquraf4tb60iogze9dg1fr.schedule-execution.ts— guard the recovery pathreleaseClaim) were unguarded — wrap them in a try/catch that logs and records the exception on the span without re-throwing. The claim expires on its TTL and the next tick re-claims the schedule, so swallowing a cleanup-write failure is safe.Both — keep investigability
The exception is recorded on the run's OTel span via
recordException(noERRORstatus, so it doesn't fault the run) so these stay visible in Tempo; they also remain in the execution logs andlogger.error→ Loki.Notes / risk
webhook-executionismaxAttempts: 1, so nothing was ever retried; only the run's final status changes (FAILED → COMPLETED). Schedule's recovery swallow is also safe (TTL + next tick).failed→completed; sincewebhookIdempotencyisretryFailures: false, this reduces alerts (duplicate deliveries of a failed webhook previously re-threw and re-faulted).logger.error(Loki).knowledge-process-documenttask (maxAttempts: 3,Unsupported fileUrl schemeuser error) is a separate follow-up.Test plan
bun run test background/webhook-execution.test.ts— added cases: finalized → resolves{ success: false }+ records span exception; non-finalized → throws + logs.bun run type-check— clean.bun run lint:check— clean (2 warnings are pre-existing in untouchedservice.ts).🤖 Generated with Claude Code