-
-
Notifications
You must be signed in to change notification settings - Fork 847
fix(run-engine): retry non-zero exit code errors #2467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Caution Review failedThe pull request is closed. Walkthrough
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (6)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal-packages/run-engine/src/engine/retrying.ts (1)
85-88
: Global max-attempts check is off-by-one.Using
>
allows one extra attempt; use>=
to enforce the cap.Apply this diff:
- if (attemptNumber !== null && attemptNumber > MAX_TASK_RUN_ATTEMPTS) { + if (attemptNumber !== null && attemptNumber >= MAX_TASK_RUN_ATTEMPTS) { return { outcome: "fail_run", sanitizedError }; }
🧹 Nitpick comments (7)
internal-packages/run-engine/src/engine/systems/dequeueSystem.ts (1)
406-409
: Lock-once retry config: good guard; minor naming nit.Logic correctly avoids overwriting an existing per-run lock. Consider a more explicit name to signal intent at the callsite.
Apply this diff:
- const lockedRetryConfig = result.run.lockedRetryConfig - ? undefined - : result.task.retryConfig; + const lockedRetryConfigToPersist = result.run.lockedRetryConfig + ? undefined + : result.task.retryConfig;internal-packages/database/prisma/schema.prisma (1)
651-657
: New per-run retry lock field: add a brief docstring for future readers.Json? is appropriate; a short comment clarifying “locked at first dequeue, never overwritten” will prevent misuse.
maxAttempts Int? - lockedRetryConfig Json? + /// Retry options locked to this run at first dequeue; preserved across redeploys + lockedRetryConfig Json?internal-packages/run-engine/src/engine/retrying.ts (5)
90-99
: Prefer findUnique when querying by primary key.
findFirst
withid
can befindUnique
for clarity and minor perf gains.Apply this diff:
- const run = await prisma.taskRun.findFirst({ + const run = await prisma.taskRun.findUnique({ where: { id: runId, }, select: { maxAttempts: true, lockedRetryConfig: true, }, });
115-155
: Deriving retry settings from locked config looks solid; avoid shadowing param.Local
const retrySettings = ...
shadows the function param and may confuse future edits. Rename.Apply this diff:
- const retrySettings = { + const inferredRetrySettings = { timestamp: Date.now() + nextDelay, delay: nextDelay, }; return { outcome: "retry", method: "queue", // we'll always retry on the queue because usually having no settings means something bad happened - settings: retrySettings, + settings: inferredRetrySettings, };
54-74
: Consider using enhanced error for OOM detection too.
isOOMRunError(error)
might miss cases the enhancer reclassifies as OOM. UsingenhancedError
(or enhancing before this check) could improve detection.Apply this diff:
- // OOM error (retry on a larger machine or fail) - if (isOOMRunError(error)) { + // OOM error (retry on a larger machine or fail) + const enhancedError = taskRunErrorEnhancer(error); + if (isOOMRunError(enhancedError)) {Note: retain the existing later
enhancedError
or reuse this one.
164-178
: Use findUnique here as well.Primary-key fetch by
id
can usefindUnique
.Apply this diff:
- const run = await prisma.taskRun.findFirst({ + const run = await prisma.taskRun.findUnique({ where: { id: runId, },
205-212
: Prefer structured logger over console.error.Stay consistent with the rest of the codebase’s logging and include error metadata.
Apply this diff:
- } catch (error) { - console.error("[FailedTaskRunRetryHelper] Failed to get execution retry", { - runId, - error, - }); + } catch (error) { + logger?.error?.("[FailedTaskRunRetryHelper] Failed to get execution retry", { + runId, + error, + });
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
internal-packages/database/prisma/migrations/20250902112516_add_locked_retry_config_to_task_run/migration.sql
(1 hunks)internal-packages/database/prisma/migrations/migration_lock.toml
(1 hunks)internal-packages/database/prisma/schema.prisma
(2 hunks)internal-packages/run-engine/src/engine/retrying.ts
(5 hunks)internal-packages/run-engine/src/engine/systems/dequeueSystem.ts
(2 hunks)packages/core/src/v3/errors.ts
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}
: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations
Files:
internal-packages/run-engine/src/engine/systems/dequeueSystem.ts
packages/core/src/v3/errors.ts
internal-packages/run-engine/src/engine/retrying.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
We use zod a lot in packages/core and in the webapp
Files:
packages/core/src/v3/errors.ts
🧠 Learnings (1)
📚 Learning: 2024-10-18T15:41:52.352Z
Learnt from: nicktrn
PR: triggerdotdev/trigger.dev#1418
File: packages/core/src/v3/errors.ts:364-371
Timestamp: 2024-10-18T15:41:52.352Z
Learning: In `packages/core/src/v3/errors.ts`, within the `taskRunErrorEnhancer` function, `error.message` is always defined, so it's safe to directly call `error.message.includes("SIGTERM")` without additional checks.
Applied to files:
packages/core/src/v3/errors.ts
🧬 Code graph analysis (2)
packages/core/src/v3/errors.ts (1)
packages/core/src/v3/schemas/common.ts (2)
TaskRunError
(200-205)TaskRunError
(207-207)
internal-packages/run-engine/src/engine/retrying.ts (2)
packages/core/src/v3/errors.ts (3)
taskRunErrorEnhancer
(671-781)shouldRetryError
(285-347)shouldLookupRetrySettings
(349-373)apps/webapp/app/v3/services/completeAttempt.server.ts (1)
run
(452-550)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: typecheck / typecheck
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (11)
internal-packages/database/prisma/migrations/migration_lock.toml (1)
2-3
: No-op change acknowledged.Comment tweak looks fine; provider remains unchanged.
internal-packages/run-engine/src/engine/systems/dequeueSystem.ts (1)
419-419
: Correctly avoids unintended writes.Using
?? undefined
ensures Prisma skips the field when we don't want to touch it.- lockedRetryConfig: lockedRetryConfig ?? undefined, + lockedRetryConfig: lockedRetryConfigToPersist ?? undefined,internal-packages/database/prisma/schema.prisma (1)
154-154
: Minor formatting change is fine.internal-packages/database/prisma/migrations/20250902112516_add_locked_retry_config_to_task_run/migration.sql (1)
1-2
: Migration matches schema and intent.JSONB is the right choice; nullable aligns with “set on first lock”.
packages/core/src/v3/errors.ts (1)
349-373
: Predicate is precise and side-effect free.Returning true only for
TASK_PROCESS_EXITED_WITH_NON_ZERO_CODE
matches the PR scope.internal-packages/run-engine/src/engine/retrying.ts (6)
6-6
: Good call importing shouldLookupRetrySettings.This aligns retry decisions with core error semantics.
76-83
: Use of enhanced error for retryability = correct.Evaluating
shouldRetryError
againstenhancedError
is the right move; it captures normalized error codes.
105-108
: Failing when maxAttempts is null — confirm schema invariant.If
maxAttempts
can be null at runtime, this path hard-fails all such runs. Confirm DB defaults/constraints guarantee a non-nullmaxAttempts
post-migration; otherwise, consider a sensible default or clearer error.
151-154
: Queue-only inference path — ensure upstream honors it.CompleteAttemptService has a special flow for “inferred” retries. If the caller relies on an
executionRetryInferred
flag (vs.method === "queue"
), ensure it’s still being set upstream for this path.
175-205
: OOM retry config path reads from lockedRetryConfig — LGTM.Validates JSON shape, respects preset equality, and returns machine + retry options.
139-143
: No changes needed:calculateNextRetryDelay
expects 1-based attempts. The helper’s JSDoc states “If the first attempt has failed, this will be 1,” so using(attemptNumber ?? 1)
correctly aligns with its contract.
# Please do not edit this file manually | ||
# It should be added in your version-control system (i.e. Git) | ||
provider = "postgresql" | ||
# It should be added in your version-control system (e.g., Git) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's with this?
We’re also now saving the retryConfig from the BackgroundWorkerTask on TaskRun.lockedRetryConfig when the run is first locked to the version
215d4a8
to
75185b2
Compare
We’re also now saving the retryConfig from the BackgroundWorkerTask on
TaskRun.lockedRetryConfig when the run is first locked to the version