Conversation
0e2907a to
13c58e1
Compare
🚀 PR Environment DeployedYour PR environment has been successfully deployed! Environment Details:
Components:
The environment will be automatically cleaned up when this PR is closed or merged. |
zdumitru
previously approved these changes
Mar 4, 2026
13c58e1 to
84a60da
Compare
…odule Move the inline max-retries reconciliation logic in the workflow executor into the dedicated max-retries-reconciler module, calling getFailedMaxRetriesNodeIds and reconcileMaxRetriesFailures instead of duplicating the logic. This ensures the unit tests exercise the actual production code path.
🚀 PR Environment DeployedYour PR environment has been successfully deployed! Environment Details:
Components:
The environment will be automatically cleaned up when this PR is closed or merged. |
zdumitru
approved these changes
Mar 4, 2026
🧹 PR Environment Cleaned UpThe PR environment has been successfully deleted. Deleted Resources:
All resources have been cleaned up and will no longer incur costs. |
🚀 PR Environment DeployedYour PR environment has been successfully deployed! Environment Details:
Components:
The environment will be automatically cleaned up when this PR is closed or merged. |
joelorzet
added a commit
that referenced
this pull request
Mar 4, 2026
The max-retries reconciliation added in PR #487 used an HTTP loopback to fetch execution logs from the DB. This fails silently in production (network policy / missing env vars), leaving workflows marked as error despite all steps succeeding. Replace with a module-level Map populated by withStepLogging after each successful step. Both run in the same Node.js process so no HTTP needed.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Workflows are being incorrectly marked as
errorinworkflow_executionseven though all step logs showstatus: successand the SDK run showsstatus: completed. The error message is always:Step "step//...//stepName" exceeded max retries (X retries).This affects
sendWebhookStepandconditionStepmost frequently.Root Cause
The Workflow DevKit's
"use step"durability layer tracks step completion through an internal event-sourcing mechanism. When the SDK's state tracking encounters a conflict (e.g., step already completed, state replay mismatch), it throws "exceeded max retries" even though:withStepLoggingalready recorded astatus: successlog inworkflow_execution_logsThe error is caught by
executeNode()'s catch block and stored asresults[nodeId] = { success: false, error: ... }, which causesfinalSuccessto befalse-- marking the entire workflow aserror.Fix
After all nodes finish execution but before computing
finalSuccess, cross-reference any failed results that contain "exceeded max retries" errors against the actual execution logs inworkflow_execution_logs:status: success(no error logs at all), the SDK error was spurious -- override the in-memoryresultsto success using the logged outputThe in-memory
resultsmutation happens beforefinalSuccessis computed, sotriggerStep({ _workflowComplete: { status } })writes the correct status to DB. No separate DB update is needed.Architecture: HTTP Loopback Pattern
The workflow bundler rejects Node.js modules (like
nanoidpulled in viadb/schema), so.workflow.tsfiles cannot import DB modules -- not even dynamically (the bundler tracesimport()calls), and functions cannot be passed as parameters (the SDK serializes workflow arguments for durability). To work around this:keeperhub/api/internal/execution-logs/route.ts-- internal API endpoint that queriesworkflow_execution_logsfor a given execution + node IDskeeperhub/lib/fetch-execution-logs.ts-- HTTP loopback helper (same pattern asexecution-fallback.ts) that calls the internal endpoint viafetch(). No DB imports, safe for the workflow bundlelib/workflow-executor.workflow.ts-- importsfetchExecutionLogsand calls it after node execution, beforefinalSuccesscomputationThis approach:
maxRetries-- keeping it at 0 means no duplicate step executions (no double webhook calls)WorkflowExecutionInput-- no caller modifications neededChanges
lib/workflow-executor.workflow.ts-- reconciles after node execution, beforefinalSuccesscomputationkeeperhub/lib/fetch-execution-logs.ts-- HTTP loopback helper to fetch execution logs without DB importskeeperhub/api/internal/execution-logs/route.ts-- internal API endpoint for execution log queriesapp/api/internal/execution-logs/route.ts-- thin wrapper (re-exports from keeperhub)keeperhub/lib/max-retries-reconciler.ts-- pure reconciliation logic (used by tests)tests/unit/max-retries-reconciler.test.ts-- 8 unit tests covering all reconciliation scenariosScenarios Tested
success: falsesuccess: falsesuccess: truesuccess: false