hotfix: KEEP-1541 reconcile spurious max-retries failures in workflow executor by joelorzet · Pull Request #487 · techops-services/keeperhub

joelorzet · 2026-03-04T14:02:10Z

Summary

Workflows are being incorrectly marked as error in workflow_executions even though all step logs show status: success and the SDK run shows status: completed. The error message is always: Step "step//...//stepName" exceeded max retries (X retries).

This affects sendWebhookStep and conditionStep most frequently.

Root Cause

The Workflow DevKit's "use step" durability layer tracks step completion through an internal event-sourcing mechanism. When the SDK's state tracking encounters a conflict (e.g., step already completed, state replay mismatch), it throws "exceeded max retries" even though:

The step function itself returned a success result
withStepLogging already recorded a status: success log in workflow_execution_logs
The actual side effect (webhook call, condition evaluation) completed correctly

The error is caught by executeNode()'s catch block and stored as results[nodeId] = { success: false, error: ... }, which causes finalSuccess to be false -- marking the entire workflow as error.

Fix

After all nodes finish execution but before computing finalSuccess, cross-reference any failed results that contain "exceeded max retries" errors against the actual execution logs in workflow_execution_logs:

If every log entry for that node has status: success (no error logs at all), the SDK error was spurious -- override the in-memory results to success using the logged output
If there is any error log for the node, keep the failure -- this prevents masking real failures where a retry legitimately failed

The in-memory results mutation happens before finalSuccess is computed, so triggerStep({ _workflowComplete: { status } }) writes the correct status to DB. No separate DB update is needed.

Architecture: HTTP Loopback Pattern

The workflow bundler rejects Node.js modules (like nanoid pulled in via db/schema), so .workflow.ts files cannot import DB modules -- not even dynamically (the bundler traces import() calls), and functions cannot be passed as parameters (the SDK serializes workflow arguments for durability). To work around this:

keeperhub/api/internal/execution-logs/route.ts -- internal API endpoint that queries workflow_execution_logs for a given execution + node IDs
keeperhub/lib/fetch-execution-logs.ts -- HTTP loopback helper (same pattern as execution-fallback.ts) that calls the internal endpoint via fetch(). No DB imports, safe for the workflow bundle
lib/workflow-executor.workflow.ts -- imports fetchExecutionLogs and calls it after node execution, before finalSuccess computation

This approach:

Does not change maxRetries -- keeping it at 0 means no duplicate step executions (no double webhook calls)
Does not modify the catch block -- the error is still caught and recorded normally
Only reconciles at the end -- no interference with step execution flow
Is conservative -- only overrides when there is zero evidence of actual failure in the logs
Does not change WorkflowExecutionInput -- no caller modifications needed

Changes

lib/workflow-executor.workflow.ts -- reconciles after node execution, before finalSuccess computation
keeperhub/lib/fetch-execution-logs.ts -- HTTP loopback helper to fetch execution logs without DB imports
keeperhub/api/internal/execution-logs/route.ts -- internal API endpoint for execution log queries
app/api/internal/execution-logs/route.ts -- thin wrapper (re-exports from keeperhub)
keeperhub/lib/max-retries-reconciler.ts -- pure reconciliation logic (used by tests)
tests/unit/max-retries-reconciler.test.ts -- 8 unit tests covering all reconciliation scenarios

Scenarios Tested

Scenario	Logs in DB	Result	Behavior
Step genuinely fails (HTTP 500)	error log	`success: false`	Unchanged -- no max-retries error
Step fails + SDK throws max retries	success + error logs	`success: false`	Kept as failure (error log exists)
Step succeeds + SDK throws spurious max retries	success log only	`success: true`	Overridden to success
No logs exist for node	none	`success: false`	Kept as failure (no evidence of success)
Multiple nodes, mixed outcomes	varies per node	per-node	Each node reconciled independently

github-actions · 2026-03-04T14:49:41Z

🚀 PR Environment Deployed

Your PR environment has been successfully deployed!

Environment Details:

🌐 URL: https://app-pr-487.keeperhub.com
📦 Namespace: pr-487
🏷️ Image Tag: 13c58e1
🔢 Commit: 13c58e10f21ae67075658de7205fb0673a537c9a

Components:

Keeperhub Application
PostgreSQL Database (isolated instance)
LocalStack (SQS emulation)
Redis (isolated instance)
Scheduler Dispatcher (staging image)
Scheduler Executor (staging image)
SC Event Tracker (staging image)
SC Event Worker (staging image)

The environment will be automatically cleaned up when this PR is closed or merged.

…odule Move the inline max-retries reconciliation logic in the workflow executor into the dedicated max-retries-reconciler module, calling getFailedMaxRetriesNodeIds and reconcileMaxRetriesFailures instead of duplicating the logic. This ensures the unit tests exercise the actual production code path.

github-actions · 2026-03-04T15:38:07Z

🚀 PR Environment Deployed

Your PR environment has been successfully deployed!

Environment Details:

🌐 URL: https://app-pr-487.keeperhub.com
📦 Namespace: pr-487
🏷️ Image Tag: 82fb90f
🔢 Commit: 82fb90ff49bfed6e02e3e6d8b5f3dff2312e6732

Components:

Keeperhub Application
PostgreSQL Database (isolated instance)
LocalStack (SQS emulation)
Redis (isolated instance)
Scheduler Dispatcher (staging image)
Scheduler Executor (staging image)
SC Event Tracker (staging image)
SC Event Worker (staging image)

The environment will be automatically cleaned up when this PR is closed or merged.

github-actions · 2026-03-04T16:07:43Z

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

Namespace: pr-487
All Helm releases (Keeperhub, Scheduler, Event services)
PostgreSQL Database (including data)
LocalStack, Redis
All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

github-actions · 2026-03-04T16:24:18Z

🚀 PR Environment Deployed

Your PR environment has been successfully deployed!

Environment Details:

🌐 URL: https://app-pr-487.keeperhub.com
📦 Namespace: pr-487
🏷️ Image Tag: 84a60da
🔢 Commit: 84a60dab35b638328b9f9b63688be15a74620f13

Components:

Keeperhub Application
PostgreSQL Database (isolated instance)
LocalStack (SQS emulation)
Redis (isolated instance)
Scheduler Dispatcher (staging image)
Scheduler Executor (staging image)
SC Event Tracker (staging image)
SC Event Worker (staging image)

The environment will be automatically cleaned up when this PR is closed or merged.

The max-retries reconciliation added in PR #487 used an HTTP loopback to fetch execution logs from the DB. This fails silently in production (network policy / missing env vars), leaving workflows marked as error despite all steps succeeding. Replace with a module-level Map populated by withStepLogging after each successful step. Both run in the same Node.js process so no HTTP needed.

joelorzet force-pushed the KEEP-1541-Hotfix-Workflows-wrongly-as-success-when-a-step-fails-Webhook-node branch 4 times, most recently from 0e2907a to 13c58e1 Compare March 4, 2026 14:31

joelorzet added the deploy-pr-environment label Mar 4, 2026

joelorzet temporarily deployed to staging March 4, 2026 14:36 — with GitHub Actions Inactive

joelorzet temporarily deployed to staging March 4, 2026 14:41 — with GitHub Actions Inactive

zdumitru previously approved these changes Mar 4, 2026

View reviewed changes

fix: reconcile spurious max-retries failures in workflow executor

84a60da

joelorzet dismissed zdumitru’s stale review via 84a60da March 4, 2026 15:24

joelorzet force-pushed the KEEP-1541-Hotfix-Workflows-wrongly-as-success-when-a-step-fails-Webhook-node branch from 13c58e1 to 84a60da Compare March 4, 2026 15:24

joelorzet temporarily deployed to staging March 4, 2026 15:24 — with GitHub Actions Inactive

joelorzet temporarily deployed to staging March 4, 2026 15:29 — with GitHub Actions Inactive

joelorzet temporarily deployed to staging March 4, 2026 15:35 — with GitHub Actions Inactive

joelorzet requested a review from zdumitru March 4, 2026 15:40

zdumitru approved these changes Mar 4, 2026

View reviewed changes

joelorzet merged commit 8812ff4 into staging Mar 4, 2026
24 checks passed

joelorzet deleted the KEEP-1541-Hotfix-Workflows-wrongly-as-success-when-a-step-fails-Webhook-node branch March 4, 2026 16:06

joelorzet temporarily deployed to staging March 4, 2026 16:06 — with GitHub Actions Inactive

joelorzet mentioned this pull request Mar 4, 2026

fix: KEEP-1543 replace HTTP loopback reconciliation with in-memory step tracking #490

Merged

suisuss mentioned this pull request Mar 4, 2026

fix: KEEP-1541 code review fixes for max-retries reconciler #493

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hotfix: KEEP-1541 reconcile spurious max-retries failures in workflow executor#487

hotfix: KEEP-1541 reconcile spurious max-retries failures in workflow executor#487
joelorzet merged 2 commits intostagingfrom
KEEP-1541-Hotfix-Workflows-wrongly-as-success-when-a-step-fails-Webhook-node

joelorzet commented Mar 4, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joelorzet commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Architecture: HTTP Loopback Pattern

Changes

Scenarios Tested

Uh oh!

github-actions bot commented Mar 4, 2026

🚀 PR Environment Deployed

Uh oh!

github-actions bot commented Mar 4, 2026

🚀 PR Environment Deployed

Uh oh!

Uh oh!

github-actions bot commented Mar 4, 2026

🧹 PR Environment Cleaned Up

Uh oh!

github-actions bot commented Mar 4, 2026

🚀 PR Environment Deployed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joelorzet commented Mar 4, 2026 •

edited

Loading