Skip to content

fix(terminal): terminal console update for child spans + hitl state machine#4450

Merged
icecrasher321 merged 6 commits intostagingfrom
fix/terminal-logs
May 6, 2026
Merged

fix(terminal): terminal console update for child spans + hitl state machine#4450
icecrasher321 merged 6 commits intostagingfrom
fix/terminal-logs

Conversation

@icecrasher321
Copy link
Copy Markdown
Collaborator

@icecrasher321 icecrasher321 commented May 5, 2026

Summary

Fix terminal log races so child workflow errors, error-path executions, reconnect replays, and scoped cancellations preserve the correct final block status instead of showing stale canceled or duplicate rows.

Also fix hitl state machine to not leave executions stuck in wrong state.

Type of Change

  • Bug fix

Testing

Tested manually

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link
Copy Markdown

vercel Bot commented May 5, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 6, 2026 2:24am

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 5, 2026

PR Summary

Medium Risk
Touches workflow execution streaming, reconnect replay, and cancellation paths (including Redis-backed run buffers and HITL state transitions), which are core runtime behaviors and can surface as stuck/incorrect run status if wrong.

Overview
Durable SSE run buffering and replay: execution now initializes stream metadata via initializeExecutionStreamMeta, writes buffered events with server-assigned ids, and publishes terminal events via writeTerminal (with fallback flushExecutionStreamReplayBuffer + stream erroring if a terminal event can’t be durably recorded).

Reconnect stream correctness: /executions/[executionId]/stream switches to readExecutionMetaState/readExecutionEventsState, adds explicit handling for unavailable/pruned buffers, validates strictly-increasing replay order (allowing gaps), and ensures a terminal event is replayed before sending [DONE].

Cancellation + HITL state machine: cancel endpoint now supports a paused-execution cancellation flow (beginPausedCancellation/completePausedCancellation + retries), blocks queued resumes after successful cancellation, and guarantees a durable execution:cancelled terminal event before confirming paused cancellation (with new failure reason reporting).

Client terminal/recovery fixes: SSE processing now awaits event handlers before acknowledging onEventId and surfaces SSEEventHandlerError/SSEStreamInterruptedError as recoverable, enabling execution-pointer checkpointing and more robust reconnect ownership/cancellation semantics; console reconciliation now avoids duplicate start rows and correctly applies child-workflow span updates (including nested/iterated spans), plus scopes “cancel running entries” by executionId.

Executor resume edge-case: execution engine skips starter-block fallback when resuming from a snapshot with no downstream work.

Reviewed by Cursor Bugbot for commit 6af6598. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 5, 2026

Greptile Summary

This PR overhauls the terminal console update pipeline and HITL state machine to eliminate race conditions: child-workflow span reconciliation is replaced with reconcileChildTraceSpans + spanConsoleIdentity, the reconnect flow now correctly drains running entries via finishRunningEntries, and the HITL cancel path becomes a proper two-phase transaction (beginPausedCancellationcompletePausedCancellation).

  • Terminal log races: executionFinished guard, onEventId moved after event dispatch, and pendingChildWorkflowStarts map fix the main ordering bugs that caused stale or duplicate console rows.
  • Reconnect reliability: releaseReconnectStateWithoutTerminal now calls finishRunningEntries (fixing previously-stuck running entries), and the execution pointer is moved from IndexedDB to sessionStorage for tab-scoped ownership.
  • HITL cancellation: Two-phase cancel with sentinel status 'cancelling' guards against concurrent resume/cancel races; markResumeAttemptFailed allows retryable resume errors to roll back cleanly.

Confidence Score: 4/5

The core reconnect and terminal-log race fixes are sound; the main concern is the complexity of the new two-phase HITL cancellation path, which has an edge case in status rollback logic.

The PR correctly addresses the two main defects called out in prior review threads (stuck running entries after non-retryable reconnect, and child-span field mismatch silencing reconciliation). The new clearPausedCancellationIntent SQL misses the fully-resumed branch, leaving the pausedExecutions row with a transiently wrong status. No data loss results from this, but the incorrect intermediate status is observable in the DB and could confuse future reads until the next update corrects it.

apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts — the clearPausedCancellationIntent status rollback logic and the newly identical RESUMABLE_PAUSED_STATUSES/CANCELLABLE_PAUSED_STATUSES constants that will silently diverge if either needs to change.

Important Files Changed

Filename Overview
apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts Major refactor of HITL state machine: adds beginPausedCancellation/completePausedCancellation two-phase cancel, merging pause-point records instead of upsert, processQueuedResumes now loops to drain stale entries, and markResumeAttemptFailed for retryable resume errors. Has a logic gap in clearPausedCancellationIntent that misclassifies a fully-resumed-but-cancelling-rollback execution as partially_resumed.
apps/sim/app/workspace/[workspaceId]/w/[workflowId]/hooks/use-workflow-execution.ts Reconnect flow refactored: uses reconnectAttemptNonce to trigger retries, adds releaseReconnectStateWithoutTerminal that now calls finishRunningEntries (fixing stuck-running entries), replaces immediate cancel with async cancel-then-await-terminal-event, adds onExecutionPaused handling.
apps/sim/app/workspace/[workspaceId]/w/[workflowId]/utils/workflow-execution-utils.ts New reconcileChildTraceSpans and spanConsoleIdentity replace the old broken child-span reconciliation; adds pendingChildWorkflowStarts to handle races between block:childWorkflowStarted and block:started events; childWorkflowInstanceId correctly passed for reconciliation.
apps/sim/stores/terminal/console/storage.ts Execution pointer moved from IndexedDB to sessionStorage (tab-scoped); adds merge-on-write for IndexedDB entries to preserve concurrently-updated rows; persist() now returns a Promise and clearWorkflowConsole correctly uses merge: false.
apps/sim/stores/terminal/console/store.ts Adds finishRunningEntries (marks running entries done without cancelled flag) and childWorkflowInstanceId discrimination in matchesEntryForUpdate; cancelRunningEntries now scoped to optional executionId.
apps/sim/lib/execution/event-buffer.ts Adds in-memory event buffer for non-Redis environments, resetExecutionStreamBuffer for resume replay, initializeExecutionStreamMeta with retry, writeTerminal for atomic terminal event + status update via Lua script.
apps/sim/app/api/workflows/[id]/executions/[executionId]/cancel/route.ts Replaces direct cancelPausedExecution with a two-phase beginPausedCancellation + completePausedCancellationWithRetry, adds ensurePausedCancellationEventPublished to push the terminal event, and handles the 'cancelling'-while-actively-resuming race.
apps/sim/hooks/use-execution-stream.ts All callbacks now async-awaited; SSEEventHandlerError/SSEStreamInterruptedError added for recoverable stream failures; onEventId moved after event dispatch so executionFinished guard works correctly; per-stream abort-key namespacing added.
apps/sim/lib/workflows/executor/queued-workflow-execution.ts Terminal events now written with writeTerminal (atomic event + status); initializeExecutionStreamMeta failures abort execution early; TERMINAL_PUBLISH_ERROR sentinel propagated to HITL manager for retry classification.
apps/sim/lib/workflows/executor/execution-core.ts Block lifecycle callbacks (start/complete) are now fire-and-forget for SSE publishing but synchronous for DB persistence; waitForLifecycleCallbacks drains pending SSE publishes before emitting terminal events.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Cancel Request] --> B{isPausedCancellationPath?}
    B -- Yes --> C[beginPausedCancellation]
    C --> D{activeResume claimed?}
    D -- No --> E[status = cancelling, return true]
    D -- Yes --> F[status = cancelling, return false]
    F --> G[getPausedCancellationStatus]
    G -- null active claim --> H[fall to Redis cancel path]
    E --> I[ensurePausedCancellationEventPublished]
    I -- success --> J[completePausedCancellationWithRetry]
    I -- fail --> K[clearPausedCancellationIntent]
    J --> L[respond success:true]
    B -- No --> M[markExecutionCancelled via Redis]
    M --> N[abortManualExecution]
    N --> O[respond success: durablyRecorded OR locallyAborted]
Loading

Reviews (4): Last reviewed commit: "address greptile" | Re-trigger Greptile

@icecrasher321 icecrasher321 changed the title fix(terminal): terminal console update for child spans fix(terminal): terminal console update for child spans + hitl state machine May 6, 2026
@icecrasher321
Copy link
Copy Markdown
Collaborator Author

bugbot run

@icecrasher321
Copy link
Copy Markdown
Collaborator Author

@greptile

Comment thread apps/sim/lib/execution/event-buffer.ts
Comment thread apps/sim/hooks/use-execution-stream.ts Outdated
@icecrasher321
Copy link
Copy Markdown
Collaborator Author

bugbot run

@icecrasher321
Copy link
Copy Markdown
Collaborator Author

@greptile

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6af6598. Configure here.

@icecrasher321
Copy link
Copy Markdown
Collaborator Author

@greptile

@icecrasher321 icecrasher321 merged commit cef351f into staging May 6, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant