Skip to content

fix(cjson): santize bullmq payloads accurately#3892

Merged
icecrasher321 merged 11 commits intodevfrom
fix/q-sys-backup
Apr 2, 2026
Merged

fix(cjson): santize bullmq payloads accurately#3892
icecrasher321 merged 11 commits intodevfrom
fix/q-sys-backup

Conversation

@icecrasher321
Copy link
Copy Markdown
Collaborator

Summary

Sanitize bullmq payloads since cjson within bullmq is treating empty arrays as objects (known issue).

Type of Change

  • Bug fix

Testing

Tested manually

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 2, 2026

PR Summary

Medium Risk
Touches async execution/dispatch paths (workflow/webhook/schedule) and Redis-backed queueing behavior; mistakes could break background executions or leave jobs unprocessed/over-retained. Changes are mostly defensive normalization and config simplification, but they affect production concurrency/retention settings.

Overview
Fixes BullMQ payload corruption caused by Redis Lua cjson round-tripping empty arrays as objects by adding sanitizeBullMQPayload/ensureArray and using them at worker/executor boundaries (e.g., queued-workflow-execution, workflow-execution, snapshot/selectedOutputs handling).

Simplifies async backend selection by removing shouldExecuteInline/shouldUseBullMQ and basing routing/inline execution decisions on isBullMQEnabled() plus isTriggerDevEnabled across workflow execution, schedules, and webhook polling.

Adjusts BullMQ operational settings: BullMQ is now enabled solely by presence of REDIS_URL (removing CONCURRENCY_CONTROL_ENABLED), queue job cleanup defaults are shortened and capped, workspace-dispatch Redis job TTL drops to 2h, the claim Lua script returns jobId and moves record status updates to TypeScript, and completed/failed dispatch records clear bullmqPayload to reduce retained payload size.

Written by Cursor Bugbot for commit 07716cd. Configure here.

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Apr 2, 2026 1:17am

Request Review

@icecrasher321 icecrasher321 merged commit 7c68061 into dev Apr 2, 2026
7 checks passed
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 2, 2026

Greptile Summary

This PR fixes a known Lua cjson issue where BullMQ's internal Redis Lua scripts silently corrupt empty JSON arrays ([]) into empty objects ({}) when deserializing job payloads. The fix introduces a new sanitizeBullMQPayload utility that normalises all known array fields at the BullMQ worker boundary, and applies targeted Array.isArray guards at every point where those fields are consumed in the execution engine.

Alongside the core cjson fix, the PR also:

  • Simplifies BullMQ activation: removes the CONCURRENCY_CONTROL_ENABLED env-var gate so that BullMQ is now enabled whenever REDIS_URL is set. This is a potentially breaking change for self-hosted deployments that have Redis configured for non-BullMQ purposes (caching, sessions) without running BullMQ workers.
  • Refactors the Redis Lua claim script: the Lua script now returns only a jobId instead of the full record; the status: 'admitting' mutation is moved to TypeScript, while the atomicity-critical operations (lease addition and lane removal) remain in Lua.
  • Reduces job retention TTLs from 24 h – 7 days to 2 h across all queue types and the dispatch Redis store, cutting the post-failure investigation window.
  • Removes shouldExecuteInline / shouldUseBullMQ wrappers from async-jobs, replacing them with direct calls to isBullMQEnabled() and the isTriggerDevEnabled constant.

Key items to verify before merge:

  • Confirm that all production and self-hosted deployments that set REDIS_URL also have BullMQ workers deployed, or add documentation/guards for the new implicit activation behaviour.
  • Consider whether a 2-hour removeOnFail TTL is sufficient for on-call debugging and job recovery.
  • Add unit tests for sanitizeBullMQPayload to cover nested field paths and edge cases.

Confidence Score: 4/5

  • Safe to merge once the implicit BullMQ activation behaviour is confirmed intentional for all deployment targets.
  • The core cjson sanitization fix is correct and well-placed. The main concern is the removal of CONCURRENCY_CONTROL_ENABLED which makes BullMQ implicit for any REDIS_URL deployment — a P1 behavioural change that could silently break self-hosted setups without BullMQ workers. The TTL reduction is a P2 operational concern. All other changes (Lua refactor, wrapper removal, array guards) look correct.
  • apps/sim/lib/core/bullmq/connection.ts (implicit BullMQ activation), apps/sim/lib/core/bullmq/queues.ts and apps/sim/lib/core/workspace-dispatch/redis-store.ts (TTL reduction)

Important Files Changed

Filename Overview
apps/sim/lib/core/utils/json-sanitize.ts New utility that normalizes empty-array fields corrupted by Lua cjson round-tripping through BullMQ; logic is sound but lacks unit tests for the nested sanitization paths.
apps/sim/lib/core/bullmq/connection.ts Removes the CONCURRENCY_CONTROL_ENABLED guard so BullMQ is now implicitly active whenever REDIS_URL is set — a potential breaking change for deployments with Redis but no BullMQ workers.
apps/sim/lib/core/bullmq/queues.ts Uniformly reduces all job retention TTLs from 24h/7d to 2h for both completed and failed jobs, which cuts the debugging/recovery window significantly.
apps/sim/lib/core/workspace-dispatch/redis-store.ts Significant Lua script refactor: the claim script now returns only jobId instead of the full record; status/lease mutation is moved to TypeScript, preserving atomic lane-removal + lease-add invariant. Edge case when the fetched record has already expired is handled gracefully.
apps/sim/lib/workflows/executor/queued-workflow-execution.ts Correctly applies sanitizeBullMQPayload before constructing ExecutionSnapshot, ensuring cjson-corrupted empty arrays are restored prior to any execution logic.
apps/sim/background/workflow-execution.ts Trigger.dev path uses ensureArray only for the top-level callChain field; full sanitizeBullMQPayload is intentionally omitted as Trigger.dev doesn't suffer from the cjson empty-array issue.
apps/sim/lib/webhooks/processor.ts Polling webhook dispatch logic inverted to be explicit about isTriggerDevEnabled vs isBullMQEnabled paths; functional behaviour is preserved across all backend configurations.
apps/sim/lib/core/async-jobs/config.ts Removes the shouldExecuteInline and shouldUseBullMQ wrapper functions; call sites now call isBullMQEnabled() and isTriggerDevEnabled directly, simplifying the indirection.

Sequence Diagram

sequenceDiagram
    participant R as Route/Webhook
    participant Q as WorkspaceDispatch
    participant B as BullMQ Worker
    participant S as sanitizeBullMQPayload
    participant E as ExecutionEngine

    R->>R: isBullMQEnabled()?<br/>(now: Boolean(REDIS_URL))
    alt BullMQ enabled
        R->>Q: enqueueWorkspaceDispatch(bullmqPayload)
        Q->>Q: Lua: atomically claim<br/>(add lease + remove from lane)<br/>returns jobId only
        Q->>Q: TypeScript: fetch record,<br/>update status to 'admitting',<br/>save back to Redis
        Q->>B: dispatch BullMQ job
        B->>S: sanitizeBullMQPayload(payload)<br/>fix cjson {} → [] for empty arrays
        S->>E: executeQueuedWorkflowJob(sanitized)
    else TriggerDev enabled
        R->>Q: getJobQueue().enqueue(...)
        Q->>E: executeWorkflowJob(payload)<br/>ensureArray(callChain) only
    else Database/inline mode
        R->>E: executeInline()
    end
Loading

Reviews (1): Last reviewed commit: "fix(cjson): santization bullmq payloads ..." | Re-trigger Greptile

Comment on lines 4 to 6
export function isBullMQEnabled(): boolean {
return isTruthy(env.CONCURRENCY_CONTROL_ENABLED) && Boolean(env.REDIS_URL)
return Boolean(env.REDIS_URL)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Implicit BullMQ activation breaks deployments without workers

isBullMQEnabled() now returns true whenever REDIS_URL is set. Previously it also required CONCURRENCY_CONTROL_ENABLED=true, which was the explicit opt-in gate. Many self-hosted deployments configure REDIS_URL for caching, rate-limiting, or session storage without running BullMQ workers. After this change, any such deployment will start routing async workflow executions to BullMQ queues that are never drained — jobs silently queue up and never run.

The old guard (isTruthy(env.CONCURRENCY_CONTROL_ENABLED) && Boolean(env.REDIS_URL)) ensured both the infrastructure (Redis) and the operator intent (CONCURRENCY_CONTROL_ENABLED=true) were present. Removing CONCURRENCY_CONTROL_ENABLED entirely eliminates that intent signal with no migration path.

If the goal is to consolidate configuration, consider keeping REDIS_URL as the sole gate but documenting this clearly, and ensuring all deployment guides explicitly note that providing REDIS_URL now enables BullMQ.

Comment on lines +31 to 35
removeOnComplete: { age: 2 * 60 * 60, count: 1000 },
removeOnFail: { age: 2 * 60 * 60, count: 500 },
}
case 'webhook-execution':
return {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Aggressive TTL reduction limits observability and retry windows

All BullMQ queue retention has been reduced from 24 h / 7 days down to 2 hours, and the same change applies to JOB_TTL_SECONDS in redis-store.ts (previously 48 h). While shorter retention helps control Redis memory, 2 hours is a very tight window:

  • A failed workflow job that fails at 01:00 and isn't noticed until 03:30 will have already been evicted — no state to inspect or manually retry.
  • removeOnFail at 2 h means failed jobs disappear before an on-call engineer in a different timezone can investigate them.

The previous removeOnFail values (7 days for workflow-execution, 3 days for webhook-execution and schedule-execution) provided a more reasonable recovery window. Consider retaining a longer failure TTL (e.g., 24–72 h) while keeping the shorter success TTL for completed jobs.

Comment on lines +33 to +67
export function sanitizeBullMQPayload(payload: any): void {
if (!payload) return

payload.selectedOutputs = ensureArray(payload.selectedOutputs)

if (payload.metadata) {
payload.metadata.callChain = ensureArray(payload.metadata.callChain)

if (payload.metadata.pendingBlocks !== undefined) {
payload.metadata.pendingBlocks = ensureArray(payload.metadata.pendingBlocks)
}

if (payload.metadata.workflowStateOverride?.edges !== undefined) {
payload.metadata.workflowStateOverride.edges = ensureArray(
payload.metadata.workflowStateOverride.edges
)
}
}

if (payload.runFromBlock?.sourceSnapshot) {
const state = payload.runFromBlock.sourceSnapshot
for (const field of EXECUTION_STATE_ARRAY_FIELDS) {
if (field in state && !Array.isArray(state[field])) {
state[field] = []
}
}

if (state.dagIncomingEdges && typeof state.dagIncomingEdges === 'object') {
for (const key of Object.keys(state.dagIncomingEdges)) {
if (!Array.isArray(state.dagIncomingEdges[key])) {
state.dagIncomingEdges[key] = []
}
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 sanitizeBullMQPayload has no unit tests

The PR checklist marks "Tests added/updated and passing", but no test file for json-sanitize.ts appears in the diff. The sanitization logic handles several nested paths (payload.metadata.callChain, payload.metadata.workflowStateOverride.edges, payload.runFromBlock.sourceSnapshot.*, and dagIncomingEdges per-key) that are easy to accidentally break.

Consider adding a dedicated test file that exercises:

  • Top-level fields (selectedOutputs: {}[])
  • Nested metadata fields
  • The EXECUTION_STATE_ARRAY_FIELDS loop with mixed present/absent keys
  • The dagIncomingEdges per-key fix
  • A null/undefined payload (guard at line 34)

)

if (shouldExecuteInline()) {
if (!isBullMQEnabled() && !isTriggerDevEnabled) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 isTriggerDevEnabled used inconsistently as a value vs. function call

isBullMQEnabled() is called as a function (parentheses), while isTriggerDevEnabled is referenced as a constant (no parentheses). Both are correct — isBullMQEnabled is a function and isTriggerDevEnabled is a module-level constant — but the asymmetry can confuse readers who might expect both to be function calls given the naming convention isFoo.

The same pattern appears at apps/sim/app/api/workflows/[id]/execute/route.ts (line 246) and apps/sim/lib/webhooks/processor.ts (line 1268). No code change is strictly necessary, but adding an inline comment clarifying that isTriggerDevEnabled is a constant (evaluated at module load) would aid readability.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@waleedlatif1 waleedlatif1 deleted the fix/q-sys-backup branch April 2, 2026 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant