Skip to content

feat(workflow-engine): add retryOnTimeout opt-in for step timeout retries#5089

Draft
abcxff wants to merge 1 commit into
05-18-fix_rivetkit_wire_cancellationtoken_through_waitfornamesavailable_to_eliminate_kv_busy-pollingfrom
05-21-feat_workflow-engine_add_retryontimeout_opt-in_for_step_timeout_retries
Draft

feat(workflow-engine): add retryOnTimeout opt-in for step timeout retries#5089
abcxff wants to merge 1 commit into
05-18-fix_rivetkit_wire_cancellationtoken_through_waitfornamesavailable_to_eliminate_kv_busy-pollingfrom
05-21-feat_workflow-engine_add_retryontimeout_opt-in_for_step_timeout_retries

Conversation

@abcxff
Copy link
Copy Markdown
Contributor

@abcxff abcxff commented May 21, 2026

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 21, 2026

🚅 Deployed to the rivet-pr-5089 environment in rivet-frontend

Service Status Web Updated (UTC)
website 😴 Sleeping (View Logs) Web May 21, 2026 at 8:01 pm
frontend-cloud 😴 Sleeping (View Logs) Web May 21, 2026 at 7:58 pm
ladle ✅ Success (View Logs) Web May 21, 2026 at 7:49 pm
mcp-hub ✅ Success (View Logs) Web May 21, 2026 at 7:47 pm
frontend-inspector ❌ Build Failed (View Logs) Web May 21, 2026 at 7:46 pm
kitchen-sink ❌ Build Failed (View Logs) Web May 21, 2026 at 7:46 pm

@abcxff abcxff mentioned this pull request May 21, 2026
11 tasks
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 21, 2026

Code Review: feat(workflow-engine): add retryOnTimeout opt-in for step timeout retries

Overview

This PR adds a retryOnTimeout?: boolean flag to StepConfig so step timeouts can opt into normal retry behavior instead of always being treated as critical failures. The change is small and well-targeted.


Correctness

The core logic change in context.ts is correct: the guard error instanceof StepTimeoutError && !config.retryOnTimeout correctly short-circuits to a critical failure only when the flag is absent/false, and falls through to the normal retry path otherwise.

The kind: "timeout" disambiguation for the exhausted-after-retries case is also correct — TryStepCatchKind already had "timeout" as a variant, and it now consistently means "this step failed due to a timeout" whether retried or not.


Issues

Missing tryStep test coverage

The new behavior interacts with tryStep's catch filter, but there is no test for that combination. Specifically:

await ctx.tryStep({
  name: "foo",
  retryOnTimeout: true,
  maxRetries: 2,
  catch: ["timeout"],
  run: async () => { /* always times out */ },
});

This should resolve to { ok: false, failure: { kind: "timeout", ... } } rather than throwing. Without a test, it is easy for a future refactor to silently break this. There is a try.test.ts that already covers the tryStep + catch surface — adding a case there would be natural.

Multi-line comment in context.ts

Per project conventions, comments should be one short line max. The new two-line comment at line 946-947 should be condensed:

// Timeout errors are critical by default; opt in to retries with retryOnTimeout: true.
if (error instanceof StepTimeoutError && !config.retryOnTimeout) {

Grammar nit in retries.md

when retries exhaust on a timeout the try-step failure kind is "timeout"

Should be "when retries are exhausted".


Minor observations

  • The test timing (5 ms step timeout, 25 ms step duration, 10 ms inter-run wait) is tight but consistent with the existing steps.test.ts pattern for yield-mode tests, so it is acceptable.
  • The StepConfig JSDoc comment for retryOnTimeout is appropriately concise.
  • No persistence schema changes are required since retryOnTimeout is call-site config, not stored state.

Summary

The implementation is correct and the docs update is accurate. Two things worth addressing before merge:

  1. Add a tryStep + catch: ["timeout"] + retryOnTimeout: true test to try.test.ts.
  2. Collapse the two-line comment in context.ts to one line.

The grammar fix in the docs is minor and can be bundled with the above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant