fix(run-engine): retry non-zero exit code errors #2467

ericallam · 2025-09-02T12:13:35Z

We’re also now saving the retryConfig from the BackgroundWorkerTask on
TaskRun.lockedRetryConfig when the run is first locked to the version

changeset-bot · 2025-09-02T12:13:39Z

⚠️ No Changeset found

Latest commit: 75185b2

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2025-09-02T12:13:42Z

Caution

Review failed

The pull request is closed.

Walkthrough

Adds a JSONB column lockedRetryConfig to public.TaskRun via a Prisma migration and updates schema.prisma with lockedRetryConfig Json? on the TaskRun model.
dequeueSystem sets lockedRetryConfig on first dequeue from the task’s retryConfig if not already present and avoids overwriting existing values.
retrying.ts now, when retrySettings are missing, may look up and parse run.lockedRetryConfig, compute next retry delay, apply OOM-specific logic using the locked config, and returns retry plans or failures with added guards and enhanced error handling.
Adds shouldLookupRetrySettings(error: TaskRunError) in core errors to decide when to consult locked retry settings.
Minor whitespace/newline edits to migration lock file.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 215d4a8 and 75185b2.

📒 Files selected for processing (6)

internal-packages/database/prisma/migrations/20250902112516_add_locked_retry_config_to_task_run/migration.sql (1 hunks)
internal-packages/database/prisma/migrations/migration_lock.toml (1 hunks)
internal-packages/database/prisma/schema.prisma (2 hunks)
internal-packages/run-engine/src/engine/retrying.ts (5 hunks)
internal-packages/run-engine/src/engine/systems/dequeueSystem.ts (2 hunks)
packages/core/src/v3/errors.ts (1 hunks)

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/non-zero-exit-codes-retrying

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal-packages/run-engine/src/engine/retrying.ts (1)
85-88: Global max-attempts check is off-by-one.

Using > allows one extra attempt; use >= to enforce the cap.

Apply this diff:
-  if (attemptNumber !== null && attemptNumber > MAX_TASK_RUN_ATTEMPTS) {
+  if (attemptNumber !== null && attemptNumber >= MAX_TASK_RUN_ATTEMPTS) {
     return { outcome: "fail_run", sanitizedError };
   }

🧹 Nitpick comments (7)

internal-packages/run-engine/src/engine/systems/dequeueSystem.ts (1)
406-409: Lock-once retry config: good guard; minor naming nit.

Logic correctly avoids overwriting an existing per-run lock. Consider a more explicit name to signal intent at the callsite.

Apply this diff:
-              const lockedRetryConfig = result.run.lockedRetryConfig
-                ? undefined
-                : result.task.retryConfig;
+              const lockedRetryConfigToPersist = result.run.lockedRetryConfig
+                ? undefined
+                : result.task.retryConfig;
internal-packages/database/prisma/schema.prisma (1)
651-657: New per-run retry lock field: add a brief docstring for future readers.

Json? is appropriate; a short comment clarifying “locked at first dequeue, never overwritten” will prevent misuse.
   maxAttempts       Int?
-  lockedRetryConfig Json?
+  /// Retry options locked to this run at first dequeue; preserved across redeploys
+  lockedRetryConfig Json?
internal-packages/run-engine/src/engine/retrying.ts (5)
90-99: Prefer findUnique when querying by primary key.

findFirst with id can be findUnique for clarity and minor perf gains.

Apply this diff:
-  const run = await prisma.taskRun.findFirst({
+  const run = await prisma.taskRun.findUnique({
     where: {
       id: runId,
     },
     select: {
       maxAttempts: true,
       lockedRetryConfig: true,
     },
   });
115-155: Deriving retry settings from locked config looks solid; avoid shadowing param.

Local const retrySettings = ... shadows the function param and may confuse future edits. Rename.

Apply this diff:
-    const retrySettings = {
+    const inferredRetrySettings = {
       timestamp: Date.now() + nextDelay,
       delay: nextDelay,
     };
 
     return {
       outcome: "retry",
       method: "queue", // we'll always retry on the queue because usually having no settings means something bad happened
-      settings: retrySettings,
+      settings: inferredRetrySettings,
     };
54-74: Consider using enhanced error for OOM detection too.

isOOMRunError(error) might miss cases the enhancer reclassifies as OOM. Using enhancedError (or enhancing before this check) could improve detection.

Apply this diff:
-  // OOM error (retry on a larger machine or fail)
-  if (isOOMRunError(error)) {
+  // OOM error (retry on a larger machine or fail)
+  const enhancedError = taskRunErrorEnhancer(error);
+  if (isOOMRunError(enhancedError)) {
Note: retain the existing later enhancedError or reuse this one.

164-178: Use findUnique here as well.

Primary-key fetch by id can use findUnique.

Apply this diff:
-    const run = await prisma.taskRun.findFirst({
+    const run = await prisma.taskRun.findUnique({
       where: {
         id: runId,
       },
205-212: Prefer structured logger over console.error.

Stay consistent with the rest of the codebase’s logging and include error metadata.

Apply this diff:
-  } catch (error) {
-    console.error("[FailedTaskRunRetryHelper] Failed to get execution retry", {
-      runId,
-      error,
-    });
+  } catch (error) {
+    logger?.error?.("[FailedTaskRunRetryHelper] Failed to get execution retry", {
+      runId,
+      error,
+    });

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ddbae6b and 215d4a8.

📒 Files selected for processing (6)

internal-packages/database/prisma/migrations/20250902112516_add_locked_retry_config_to_task_run/migration.sql (1 hunks)
internal-packages/database/prisma/migrations/migration_lock.toml (1 hunks)
internal-packages/database/prisma/schema.prisma (2 hunks)
internal-packages/run-engine/src/engine/retrying.ts (5 hunks)
internal-packages/run-engine/src/engine/systems/dequeueSystem.ts (2 hunks)
packages/core/src/v3/errors.ts (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations

Files:

internal-packages/run-engine/src/engine/systems/dequeueSystem.ts
packages/core/src/v3/errors.ts
internal-packages/run-engine/src/engine/retrying.ts

{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

We use zod a lot in packages/core and in the webapp

Files:

packages/core/src/v3/errors.ts

🧠 Learnings (1)

📚 Learning: 2024-10-18T15:41:52.352Z

Learnt from: nicktrn
PR: triggerdotdev/trigger.dev#1418
File: packages/core/src/v3/errors.ts:364-371
Timestamp: 2024-10-18T15:41:52.352Z
Learning: In `packages/core/src/v3/errors.ts`, within the `taskRunErrorEnhancer` function, `error.message` is always defined, so it's safe to directly call `error.message.includes("SIGTERM")` without additional checks.

Applied to files:

packages/core/src/v3/errors.ts

🧬 Code graph analysis (2)

packages/core/src/v3/errors.ts (1)

packages/core/src/v3/schemas/common.ts (2)

TaskRunError (200-205)

TaskRunError (207-207)

internal-packages/run-engine/src/engine/retrying.ts (2)

packages/core/src/v3/errors.ts (3)

taskRunErrorEnhancer (671-781)

shouldRetryError (285-347)

shouldLookupRetrySettings (349-373)

apps/webapp/app/v3/services/completeAttempt.server.ts (1)

run (452-550)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)

GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: typecheck / typecheck
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (11)

internal-packages/database/prisma/migrations/migration_lock.toml (1)

2-3: No-op change acknowledged.

Comment tweak looks fine; provider remains unchanged.
internal-packages/run-engine/src/engine/systems/dequeueSystem.ts (1)
419-419: Correctly avoids unintended writes.

Using ?? undefined ensures Prisma skips the field when we don't want to touch it.
-                  lockedRetryConfig: lockedRetryConfig ?? undefined,
+                  lockedRetryConfig: lockedRetryConfigToPersist ?? undefined,
internal-packages/database/prisma/schema.prisma (1)

154-154: Minor formatting change is fine.

internal-packages/database/prisma/migrations/20250902112516_add_locked_retry_config_to_task_run/migration.sql (1)

1-2: Migration matches schema and intent.

JSONB is the right choice; nullable aligns with “set on first lock”.

packages/core/src/v3/errors.ts (1)

349-373: Predicate is precise and side-effect free.

Returning true only for TASK_PROCESS_EXITED_WITH_NON_ZERO_CODE matches the PR scope.

internal-packages/run-engine/src/engine/retrying.ts (6)

6-6: Good call importing shouldLookupRetrySettings.

This aligns retry decisions with core error semantics.

76-83: Use of enhanced error for retryability = correct.

Evaluating shouldRetryError against enhancedError is the right move; it captures normalized error codes.

105-108: Failing when maxAttempts is null — confirm schema invariant.

If maxAttempts can be null at runtime, this path hard-fails all such runs. Confirm DB defaults/constraints guarantee a non-null maxAttempts post-migration; otherwise, consider a sensible default or clearer error.

151-154: Queue-only inference path — ensure upstream honors it.

CompleteAttemptService has a special flow for “inferred” retries. If the caller relies on an executionRetryInferred flag (vs. method === "queue"), ensure it’s still being set upstream for this path.

175-205: OOM retry config path reads from lockedRetryConfig — LGTM.

Validates JSON shape, respects preset equality, and returns machine + retry options.

139-143: No changes needed: calculateNextRetryDelay expects 1-based attempts. The helper’s JSDoc states “If the first attempt has failed, this will be 1,” so using (attemptNumber ?? 1) correctly aligns with its contract.

nicktrn · 2025-09-02T12:59:33Z

internal-packages/database/prisma/migrations/migration_lock.toml

 # Please do not edit this file manually
-# It should be added in your version-control system (i.e. Git)
-provider = "postgresql"
+# It should be added in your version-control system (e.g., Git)


What's with this?

We’re also now saving the retryConfig from the BackgroundWorkerTask on TaskRun.lockedRetryConfig when the run is first locked to the version

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

nicktrn reviewed Sep 2, 2025

View reviewed changes

nicktrn approved these changes Sep 2, 2025

View reviewed changes

fix(run-engine): retry non-zero exit code errors

75185b2

We’re also now saving the retryConfig from the BackgroundWorkerTask on TaskRun.lockedRetryConfig when the run is first locked to the version

ericallam force-pushed the fix/non-zero-exit-codes-retrying branch from 215d4a8 to 75185b2 Compare September 2, 2025 13:09

ericallam merged commit 9b1877b into main Sep 2, 2025
27 of 28 checks passed

ericallam deleted the fix/non-zero-exit-codes-retrying branch September 2, 2025 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(run-engine): retry non-zero exit code errors #2467

fix(run-engine): retry non-zero exit code errors #2467

Uh oh!

ericallam commented Sep 2, 2025

Uh oh!

changeset-bot bot commented Sep 2, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 2, 2025 •

edited

Loading

Review failed

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

nicktrn Sep 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fix(run-engine): retry non-zero exit code errors #2467

fix(run-engine): retry non-zero exit code errors #2467

Uh oh!

Conversation

ericallam commented Sep 2, 2025

Uh oh!

changeset-bot bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Estimated code review effort

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

nicktrn Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

changeset-bot bot commented Sep 2, 2025 •

edited

Loading

coderabbitai bot commented Sep 2, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)