fix(platform): improve workflow engine reliability and prevent execution buildup#433
Conversation
…ion buildup - Add concurrency guard to scheduler to skip workflows with running executions - Add stuck execution recovery cron (every 5 min) to mark hung executions as failed after 30 min - Add 30s socket timeout and 15s greeting timeout to all IMAP connections - Strip transient variables (lastOutput, steps) on execution completion to reduce storage - Add tests for scheduler behavior, stuck recovery, and variable stripping
Fix TypeScript error - ImapFlowOptions uses greetingTimeout, not greetTimeout.
📝 WalkthroughWalkthroughThis pull request introduces workflow execution recovery and scheduling improvements. Changes include: a new 5-minute cron job to recover stuck workflow executions, a recovery utility that marks running or pending executions older than 30 minutes as failed, new scheduler helpers to query for running executions, logic to skip scheduling workflows that already have running instances, a utility to strip transient keys from workflow variables during serialization, and timeout configuration updates for IMAP email providers. Comprehensive tests cover recovery scenarios, scheduler logic, and variable stripping behavior. All changes are localized to the workflow engine, scheduler, and email provider modules. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~35 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@services/platform/convex/workflow_engine/helpers/engine/serialize_and_complete_execution_handler.ts`:
- Around line 58-65: The current code makes an unnecessary deep clone via
Object.fromEntries(Object.entries(output)) before calling
stripTransientVariables; since stripTransientVariables is non-mutating, change
parsedVars to reference output directly when it's a non-null plain object (i.e.,
set parsedVars = typeof output === 'object' && output !== null &&
!Array.isArray(output) ? output as Record<string, unknown> : {}), then pass
parsedVars into stripTransientVariables and onward to serializeVariables
(preserving usage of stripTransientVariables, serializeVariables,
varsSerialized, and varsStorageId).
In
`@services/platform/convex/workflow_engine/helpers/scheduler/scan_and_trigger.ts`:
- Around line 43-46: Remove the unnecessary type cast on runningExecutionsObj to
match lastExecutionTimesObj: call
ctx.runQuery(internal.workflow_engine.internal_queries.getRunningExecutions, {
wfDefinitionIds }) and assign its result directly to runningExecutionsObj
without "as Record<string, boolean>" so both query results follow the same
uncast pattern; update any downstream assumptions if needed to use the returned
structure from getRunningExecutions.
Summary
runningorpendingfor >30 minutes asfailed, breaking the orphan accumulation cyclesocketTimeout: 30sandgreetTimeout: 15s, preventing indefinite hangs that causedonCompleteto never firelastOutput,steps) are stripped from execution variables on completion, reducing per-execution storage footprintRoot cause
The workflow scheduler triggered new executions every minute without checking if one was already running. When IMAP connections hung indefinitely, the
onCompletecallback never fired, leaving executions permanently stuck inrunningstatus. Each cron tick spawned another duplicate, causing unbounded growth ofwfExecutions,_scheduled_jobs, and related tables (927MB local DB).Test plan
shouldTriggerWorkflow(scheduler dedup behavior)recoverStuckExecutions(recovery of stuck running/pending, skip recent, batch handling)stripTransientVariables(key stripping, edge cases)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests