-
-
Notifications
You must be signed in to change notification settings - Fork 881
fix(run-engine): retry SIGSEGV errors #2514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
WalkthroughThe change updates the retry logic in packages/core/src/v3/errors.ts by modifying shouldLookupRetrySettings. It now returns true for INTERNAL_ERROR when the error code is TASK_PROCESS_SIGSEGV, in addition to the existing cases TASK_PROCESS_EXITED_WITH_NON_ZERO_CODE and TASK_PROCESS_SIGTERM. No other logic, function signatures, or exports are altered. Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/core/src/v3/errors.ts (1)
285-333: Blocking: SIGSEGV is still marked non‑retryable in shouldRetryError, undermining the PR.
TASK_PROCESS_SIGSEGVis listed in the “return false” block, so we won't actually retry even if we look up retry settings. Move it to the retryable list.Apply this diff:
case "INTERNAL_ERROR": { switch (error.code) { case "COULD_NOT_FIND_EXECUTOR": case "COULD_NOT_FIND_TASK": case "COULD_NOT_IMPORT_TASK": case "CONFIGURED_INCORRECTLY": case "TASK_ALREADY_RUNNING": case "TASK_PROCESS_SIGKILL_TIMEOUT": - case "TASK_PROCESS_SIGSEGV": case "TASK_PROCESS_OOM_KILLED": case "TASK_PROCESS_MAYBE_OOM_KILLED": case "TASK_RUN_CANCELLED": case "MAX_DURATION_EXCEEDED": case "DISK_SPACE_EXCEEDED": case "OUTDATED_SDK_VERSION": case "TASK_RUN_HEARTBEAT_TIMEOUT": case "TASK_DID_CONCURRENT_WAIT": case "RECURSIVE_WAIT_DEADLOCK": // run engine errors case "TASK_DEQUEUED_INVALID_STATE": case "TASK_DEQUEUED_QUEUE_NOT_FOUND": - case "TASK_HAS_N0_EXECUTION_SNAPSHOT": + case "TASK_HAS_N0_EXECUTION_SNAPSHOT": case "TASK_RUN_DEQUEUED_MAX_RETRIES": return false; //new heartbeat error //todo case "TASK_RUN_STALLED_EXECUTING": case "TASK_RUN_STALLED_EXECUTING_WITH_WAITPOINTS": case "GRACEFUL_EXIT_TIMEOUT": case "HANDLE_ERROR_ERROR": case "TASK_INPUT_ERROR": case "TASK_OUTPUT_ERROR": case "TASK_MIDDLEWARE_ERROR": case "POD_EVICTED": case "POD_UNKNOWN_ERROR": case "TASK_EXECUTION_ABORTED": case "TASK_EXECUTION_FAILED": case "TASK_RUN_CRASHED": case "TASK_PROCESS_EXITED_WITH_NON_ZERO_CODE": case "TASK_PROCESS_SIGTERM": + case "TASK_PROCESS_SIGSEGV": return true;
🧹 Nitpick comments (2)
packages/core/src/v3/errors.ts (2)
849-856: Augment exception events for SIGSEGV too.
exceptionEventEnhancerdirectly prettifies types for OOM/MAYBE_OOM/SIGTERM but not SIGSEGV. Add it for consistent UX.- case TaskRunErrorCodes.TASK_PROCESS_MAYBE_OOM_KILLED: - case TaskRunErrorCodes.TASK_PROCESS_OOM_KILLED: - case TaskRunErrorCodes.TASK_PROCESS_SIGTERM: { + case TaskRunErrorCodes.TASK_PROCESS_MAYBE_OOM_KILLED: + case TaskRunErrorCodes.TASK_PROCESS_OOM_KILLED: + case TaskRunErrorCodes.TASK_PROCESS_SIGTERM: + case TaskRunErrorCodes.TASK_PROCESS_SIGSEGV: { return { ...exception, ...getPrettyExceptionEvent(exception.type), }; }
349-357: Add tests to lock behavior: SIGSEGV retries and settings lookup.Please add unit tests asserting:
shouldRetryError({ type: "INTERNAL_ERROR", code: "TASK_PROCESS_SIGSEGV", ... }) === trueshouldLookupRetrySettings(...) === truefor SIGSEGVI can scaffold these if helpful.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
packages/core/src/v3/errors.ts(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations
Files:
packages/core/src/v3/errors.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
We use zod a lot in packages/core and in the webapp
Files:
packages/core/src/v3/errors.ts
🧠 Learnings (2)
📓 Common learnings
Learnt from: nicktrn
PR: triggerdotdev/trigger.dev#1418
File: packages/core/src/v3/errors.ts:364-371
Timestamp: 2024-10-18T15:41:52.352Z
Learning: In `packages/core/src/v3/errors.ts`, within the `taskRunErrorEnhancer` function, `error.message` is always defined, so it's safe to directly call `error.message.includes("SIGTERM")` without additional checks.
📚 Learning: 2024-10-18T15:41:52.352Z
Learnt from: nicktrn
PR: triggerdotdev/trigger.dev#1418
File: packages/core/src/v3/errors.ts:364-371
Timestamp: 2024-10-18T15:41:52.352Z
Learning: In `packages/core/src/v3/errors.ts`, within the `taskRunErrorEnhancer` function, `error.message` is always defined, so it's safe to directly call `error.message.includes("SIGTERM")` without additional checks.
Applied to files:
packages/core/src/v3/errors.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: typecheck / typecheck
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (2)
packages/core/src/v3/errors.ts (2)
352-356: LGTM: Include SIGSEGV in retry-settings lookup.Adding
TASK_PROCESS_SIGSEGVtoshouldLookupRetrySettingsaligns with the PR intent.
306-311: Confirm and fix 'TASK_HAS_N0_EXECUTION_SNAPSHOT' (zero vs letter O)Repo consistently uses the zero form at:
- packages/core/src/v3/schemas/common.ts:181
- packages/core/src/v3/errors.ts:308
- internal-packages/run-engine/src/engine/errors.ts:55
If the intended code is TASK_HAS_NO_EXECUTION_SNAPSHOT, update the canonical enum in schemas/common.ts and all usages — this alters the public schema and may be a breaking change.
Replays tend to go through