Skip to content

fix(platform): improve workflow engine reliability and prevent execution buildup#433

Merged
Israeltheminer merged 2 commits into
mainfrom
fix/workflow-engine-reliability
Feb 12, 2026
Merged

fix(platform): improve workflow engine reliability and prevent execution buildup#433
Israeltheminer merged 2 commits into
mainfrom
fix/workflow-engine-reliability

Conversation

@Israeltheminer
Copy link
Copy Markdown
Collaborator

@Israeltheminer Israeltheminer commented Feb 12, 2026

Summary

  • Concurrency guard: Scheduler now checks for running/pending executions before triggering a workflow, preventing duplicate executions from accumulating
  • Stuck execution recovery: New cron job (every 5 min) marks executions stuck in running or pending for >30 minutes as failed, breaking the orphan accumulation cycle
  • IMAP connection timeout: All 3 ImapFlow clients now have socketTimeout: 30s and greetTimeout: 15s, preventing indefinite hangs that caused onComplete to never fire
  • Variable cleanup on completion: Transient keys (lastOutput, steps) are stripped from execution variables on completion, reducing per-execution storage footprint

Root cause

The workflow scheduler triggered new executions every minute without checking if one was already running. When IMAP connections hung indefinitely, the onComplete callback never fired, leaving executions permanently stuck in running status. Each cron tick spawned another duplicate, causing unbounded growth of wfExecutions, _scheduled_jobs, and related tables (927MB local DB).

Test plan

  • 7 tests for shouldTriggerWorkflow (scheduler dedup behavior)
  • 5 tests for recoverStuckExecutions (recovery of stuck running/pending, skip recent, batch handling)
  • 5 tests for stripTransientVariables (key stripping, edge cases)
  • All 254 tests pass, 0 lint errors

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added automatic recovery mechanism for stuck workflow executions (runs every 5 minutes)
    • Enhanced IMAP email connection stability with extended timeout configurations
    • Implemented workflow variable serialization and storage to prevent unbounded growth
    • Added execution conflict detection to prevent triggering workflows that are already running
  • Tests

    • Added comprehensive test coverage for stuck execution recovery scenarios
    • Added tests for workflow trigger scheduling logic
    • Added tests for variable cleanup utility

…ion buildup

- Add concurrency guard to scheduler to skip workflows with running executions
- Add stuck execution recovery cron (every 5 min) to mark hung executions as failed after 30 min
- Add 30s socket timeout and 15s greeting timeout to all IMAP connections
- Strip transient variables (lastOutput, steps) on execution completion to reduce storage
- Add tests for scheduler behavior, stuck recovery, and variable stripping
Fix TypeScript error - ImapFlowOptions uses greetingTimeout, not greetTimeout.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

This pull request introduces workflow execution recovery and scheduling improvements. Changes include: a new 5-minute cron job to recover stuck workflow executions, a recovery utility that marks running or pending executions older than 30 minutes as failed, new scheduler helpers to query for running executions, logic to skip scheduling workflows that already have running instances, a utility to strip transient keys from workflow variables during serialization, and timeout configuration updates for IMAP email providers. Comprehensive tests cover recovery scenarios, scheduler logic, and variable stripping behavior. All changes are localized to the workflow engine, scheduler, and email provider modules.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objectives: improving workflow engine reliability through execution deduplication, stuck execution recovery, and preventing execution buildup via multiple mechanisms.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/workflow-engine-reliability

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@services/platform/convex/workflow_engine/helpers/engine/serialize_and_complete_execution_handler.ts`:
- Around line 58-65: The current code makes an unnecessary deep clone via
Object.fromEntries(Object.entries(output)) before calling
stripTransientVariables; since stripTransientVariables is non-mutating, change
parsedVars to reference output directly when it's a non-null plain object (i.e.,
set parsedVars = typeof output === 'object' && output !== null &&
!Array.isArray(output) ? output as Record<string, unknown> : {}), then pass
parsedVars into stripTransientVariables and onward to serializeVariables
(preserving usage of stripTransientVariables, serializeVariables,
varsSerialized, and varsStorageId).

In
`@services/platform/convex/workflow_engine/helpers/scheduler/scan_and_trigger.ts`:
- Around line 43-46: Remove the unnecessary type cast on runningExecutionsObj to
match lastExecutionTimesObj: call
ctx.runQuery(internal.workflow_engine.internal_queries.getRunningExecutions, {
wfDefinitionIds }) and assign its result directly to runningExecutionsObj
without "as Record<string, boolean>" so both query results follow the same
uncast pattern; update any downstream assumptions if needed to use the returned
structure from getRunningExecutions.

@Israeltheminer Israeltheminer merged commit 3e14417 into main Feb 12, 2026
10 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant