Skip to content

fix(convex): shard workflows by executionId to eliminate OCC contention#425

Merged
Israeltheminer merged 2 commits into
mainfrom
fix/occ-shard-by-execution-id
Feb 10, 2026
Merged

fix(convex): shard workflows by executionId to eliminate OCC contention#425
Israeltheminer merged 2 commits into
mainfrom
fix/occ-shard-by-execution-id

Conversation

@Israeltheminer
Copy link
Copy Markdown
Collaborator

@Israeltheminer Israeltheminer commented Feb 10, 2026

Summary

  • Changed workflow shard routing from deterministic wfDefinitionId hash to unique executionId hash
  • The previous approach funnelled all concurrent executions of the same workflow definition into a single shard, causing persistent OCC failures on pendingStart (32 events), runStatus (13 events), and pendingCompletion (5 events)
  • Now the shard is derived after the execution record is inserted, so concurrent starts of the same definition spread across all 4 component instances

Test plan

  • Unit tests pass (11/11 in shard.test.ts)
  • New distribution test verifies 1000 unique IDs spread across all 4 shards
  • Lint passes (0 errors)
  • Monitor GlitchTip OCC events after deploy to confirm reduction

Summary by CodeRabbit

Release Notes

  • Documentation

    • Updated documentation to clarify execution sharding approach.
  • Tests

    • Added test coverage to verify even distribution of workflow executions across internal shards.
  • Refactor

    • Optimized workflow execution distribution logic to improve reliability during concurrent workflow starts.

The previous sharding used a deterministic hash of wfDefinitionId,
which funnelled all concurrent executions of the same workflow
definition into a single shard. This caused persistent OCC failures
on pendingStart, runStatus, and pendingCompletion tables.

Now the shard is derived from the unique executionId after insert,
so concurrent starts of the same definition spread across all 4
component instances.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

This pull request shifts the workflow engine's shard distribution mechanism from being based on workflow definition IDs to being based on execution IDs. The getShardIndex function parameter is renamed from wfDefinitionId to id to reflect this change. The handleStartWorkflow function signature is updated to accept an array of WorkflowManager instances instead of a single manager and shard index. The shard index is now computed after the execution is created (using the execution ID) and used to select the appropriate manager and patch the execution record. Supporting tests are added to verify execution IDs are distributed across all available shards.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(convex): shard workflows by executionId to eliminate OCC contention' directly describes the main change: shifting from sharding by wfDefinitionId to sharding by executionId to solve OCC contention issues, which is the core objective of this PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/occ-shard-by-execution-id

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@services/platform/convex/workflow_engine/helpers/engine/start_workflow_handler.ts`:
- Around line 76-80: The call to handleStartWorkflow is passing a single
WorkflowManager instance (workflowManager) but the function signature expects
managers: WorkflowManager[]; update the caller in mutations.ts to pass the
four-element workflowManagers array instead of workflowManager so
managers[shardIndex] works; keep the existing getShardIndex(executionId)
sharding logic and ensure you're passing the same workflowManagers variable that
internal_mutations.ts uses to match the expected array type.

The public startWorkflow mutation was passing a single workflowManager
instead of the managers array, which would crash on managers[shardIndex].
@Israeltheminer Israeltheminer merged commit 137a991 into main Feb 10, 2026
14 of 15 checks passed
@Israeltheminer Israeltheminer deleted the fix/occ-shard-by-execution-id branch February 10, 2026 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant