Skip to content

hotfix: KEEP-1541 reconcile spurious max-retries failures in workflow executor#487

Merged
joelorzet merged 2 commits intostagingfrom
KEEP-1541-Hotfix-Workflows-wrongly-as-success-when-a-step-fails-Webhook-node
Mar 4, 2026
Merged

hotfix: KEEP-1541 reconcile spurious max-retries failures in workflow executor#487
joelorzet merged 2 commits intostagingfrom
KEEP-1541-Hotfix-Workflows-wrongly-as-success-when-a-step-fails-Webhook-node

Conversation

@joelorzet
Copy link
Collaborator

@joelorzet joelorzet commented Mar 4, 2026

Summary

Workflows are being incorrectly marked as error in workflow_executions even though all step logs show status: success and the SDK run shows status: completed. The error message is always: Step "step//...//stepName" exceeded max retries (X retries).

This affects sendWebhookStep and conditionStep most frequently.

Root Cause

The Workflow DevKit's "use step" durability layer tracks step completion through an internal event-sourcing mechanism. When the SDK's state tracking encounters a conflict (e.g., step already completed, state replay mismatch), it throws "exceeded max retries" even though:

  1. The step function itself returned a success result
  2. withStepLogging already recorded a status: success log in workflow_execution_logs
  3. The actual side effect (webhook call, condition evaluation) completed correctly

The error is caught by executeNode()'s catch block and stored as results[nodeId] = { success: false, error: ... }, which causes finalSuccess to be false -- marking the entire workflow as error.

Fix

After all nodes finish execution but before computing finalSuccess, cross-reference any failed results that contain "exceeded max retries" errors against the actual execution logs in workflow_execution_logs:

  • If every log entry for that node has status: success (no error logs at all), the SDK error was spurious -- override the in-memory results to success using the logged output
  • If there is any error log for the node, keep the failure -- this prevents masking real failures where a retry legitimately failed

The in-memory results mutation happens before finalSuccess is computed, so triggerStep({ _workflowComplete: { status } }) writes the correct status to DB. No separate DB update is needed.

Architecture: HTTP Loopback Pattern

The workflow bundler rejects Node.js modules (like nanoid pulled in via db/schema), so .workflow.ts files cannot import DB modules -- not even dynamically (the bundler traces import() calls), and functions cannot be passed as parameters (the SDK serializes workflow arguments for durability). To work around this:

  1. keeperhub/api/internal/execution-logs/route.ts -- internal API endpoint that queries workflow_execution_logs for a given execution + node IDs
  2. keeperhub/lib/fetch-execution-logs.ts -- HTTP loopback helper (same pattern as execution-fallback.ts) that calls the internal endpoint via fetch(). No DB imports, safe for the workflow bundle
  3. lib/workflow-executor.workflow.ts -- imports fetchExecutionLogs and calls it after node execution, before finalSuccess computation

This approach:

  • Does not change maxRetries -- keeping it at 0 means no duplicate step executions (no double webhook calls)
  • Does not modify the catch block -- the error is still caught and recorded normally
  • Only reconciles at the end -- no interference with step execution flow
  • Is conservative -- only overrides when there is zero evidence of actual failure in the logs
  • Does not change WorkflowExecutionInput -- no caller modifications needed

Changes

  • lib/workflow-executor.workflow.ts -- reconciles after node execution, before finalSuccess computation
  • keeperhub/lib/fetch-execution-logs.ts -- HTTP loopback helper to fetch execution logs without DB imports
  • keeperhub/api/internal/execution-logs/route.ts -- internal API endpoint for execution log queries
  • app/api/internal/execution-logs/route.ts -- thin wrapper (re-exports from keeperhub)
  • keeperhub/lib/max-retries-reconciler.ts -- pure reconciliation logic (used by tests)
  • tests/unit/max-retries-reconciler.test.ts -- 8 unit tests covering all reconciliation scenarios

Scenarios Tested

Scenario Logs in DB Result Behavior
Step genuinely fails (HTTP 500) error log success: false Unchanged -- no max-retries error
Step fails + SDK throws max retries success + error logs success: false Kept as failure (error log exists)
Step succeeds + SDK throws spurious max retries success log only success: true Overridden to success
No logs exist for node none success: false Kept as failure (no evidence of success)
Multiple nodes, mixed outcomes varies per node per-node Each node reconciled independently

@joelorzet joelorzet force-pushed the KEEP-1541-Hotfix-Workflows-wrongly-as-success-when-a-step-fails-Webhook-node branch 4 times, most recently from 0e2907a to 13c58e1 Compare March 4, 2026 14:31
@github-actions
Copy link

github-actions bot commented Mar 4, 2026

🚀 PR Environment Deployed

Your PR environment has been successfully deployed!

Environment Details:

Components:

  • Keeperhub Application
  • PostgreSQL Database (isolated instance)
  • LocalStack (SQS emulation)
  • Redis (isolated instance)
  • Scheduler Dispatcher (staging image)
  • Scheduler Executor (staging image)
  • SC Event Tracker (staging image)
  • SC Event Worker (staging image)

The environment will be automatically cleaned up when this PR is closed or merged.

zdumitru
zdumitru previously approved these changes Mar 4, 2026
@joelorzet joelorzet force-pushed the KEEP-1541-Hotfix-Workflows-wrongly-as-success-when-a-step-fails-Webhook-node branch from 13c58e1 to 84a60da Compare March 4, 2026 15:24
…odule

Move the inline max-retries reconciliation logic in the workflow
executor into the dedicated max-retries-reconciler module, calling
getFailedMaxRetriesNodeIds and reconcileMaxRetriesFailures instead
of duplicating the logic. This ensures the unit tests exercise the
actual production code path.
@github-actions
Copy link

github-actions bot commented Mar 4, 2026

🚀 PR Environment Deployed

Your PR environment has been successfully deployed!

Environment Details:

Components:

  • Keeperhub Application
  • PostgreSQL Database (isolated instance)
  • LocalStack (SQS emulation)
  • Redis (isolated instance)
  • Scheduler Dispatcher (staging image)
  • Scheduler Executor (staging image)
  • SC Event Tracker (staging image)
  • SC Event Worker (staging image)

The environment will be automatically cleaned up when this PR is closed or merged.

@joelorzet joelorzet requested a review from zdumitru March 4, 2026 15:40
@joelorzet joelorzet merged commit 8812ff4 into staging Mar 4, 2026
24 checks passed
@joelorzet joelorzet deleted the KEEP-1541-Hotfix-Workflows-wrongly-as-success-when-a-step-fails-Webhook-node branch March 4, 2026 16:06
@github-actions
Copy link

github-actions bot commented Mar 4, 2026

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

  • Namespace: pr-487
  • All Helm releases (Keeperhub, Scheduler, Event services)
  • PostgreSQL Database (including data)
  • LocalStack, Redis
  • All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

🚀 PR Environment Deployed

Your PR environment has been successfully deployed!

Environment Details:

Components:

  • Keeperhub Application
  • PostgreSQL Database (isolated instance)
  • LocalStack (SQS emulation)
  • Redis (isolated instance)
  • Scheduler Dispatcher (staging image)
  • Scheduler Executor (staging image)
  • SC Event Tracker (staging image)
  • SC Event Worker (staging image)

The environment will be automatically cleaned up when this PR is closed or merged.

joelorzet added a commit that referenced this pull request Mar 4, 2026
The max-retries reconciliation added in PR #487 used an HTTP loopback
to fetch execution logs from the DB. This fails silently in production
(network policy / missing env vars), leaving workflows marked as error
despite all steps succeeding.

Replace with a module-level Map populated by withStepLogging after each
successful step. Both run in the same Node.js process so no HTTP needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants