Skip to content

Conversation

ericallam
Copy link
Member

No description provided.

…rrency debugging and repairing

feat(run-engine): ability to repair runs in QUEUED, SUSPENDED, and FINISHED execution status
Copy link

changeset-bot bot commented Sep 26, 2025

⚠️ No Changeset found

Latest commit: a2bd982

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

coderabbitai bot commented Sep 26, 2025

Warning

Rate limit exceeded

@ericallam has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 10 minutes and 58 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae67a6 and a2bd982.

📒 Files selected for processing (1)
  • internal-packages/run-engine/src/engine/index.ts (5 hunks)

Walkthrough

  • Web route updates repair-queues handler to call Engine.repairQueue with repairEnvironmentResults.runIds and return objects shaped as { queue, ...repair }.
  • RunEngine adds a repairSnapshot worker and private handler, introduces repairSnapshotTimeoutMs, refactors repairEnvironment to delegate to #repairRuns, adds public repairQueue and private #repairRuns/#repairRun methods, and schedules repairSnapshot jobs.
  • RunEngine options gain optional repairSnapshotTimeoutMs.
  • Worker catalog adds repairSnapshot with schema { runId, snapshotId, executionStatus } and visibilityTimeoutMs 30_000.
  • RunQueue adds clearMessageFromConcurrencySets API and corresponding Redis Lua command to remove message IDs from concurrency sets.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–75 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description is completely missing and does not include any of the required sections such as issue linkage, checklist, testing steps, changelog, or screenshots as specified by the repository’s template. Please add a detailed description following the repository template, including the linked issue number, completed checklist items, testing steps, a concise changelog entry, and any relevant screenshots.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly conveys the new feature in the run-engine module and highlights the core capability added, namely the ability to repair runs in various execution statuses, matching the primary change in the pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
internal-packages/run-engine/src/engine/workerCatalog.ts (1)

19-26: Constrain executionStatus with z.enum

Use an explicit enum for executionStatus to prevent invalid values and tighten validation.

   repairSnapshot: {
-    schema: z.object({
-      runId: z.string(),
-      snapshotId: z.string(),
-      executionStatus: z.string(),
-    }),
+    schema: z.object({
+      runId: z.string(),
+      snapshotId: z.string(),
+      executionStatus: z.enum([
+        "RUN_CREATED",
+        "QUEUED_EXECUTING",
+        "PENDING_EXECUTING",
+        "EXECUTING",
+        "EXECUTING_WITH_WAITPOINTS",
+        "SUSPENDED",
+        "PENDING_CANCEL",
+        "FINISHED",
+        "QUEUED",
+      ]),
+    }),
     visibilityTimeoutMs: 30_000,
   },
internal-packages/run-engine/src/run-queue/index.ts (2)

934-941: Public API to clear concurrency sets: good addition

Surface area is minimal and delegates to the private call. Consider adding basic tracing attributes for observability (optional).


2693-2712: Lua command definition is sound

Command signature matches the TS declaration and usage. Optionally return a count of removed entries for metrics.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9aedda2 and 7ae67a6.

📒 Files selected for processing (5)
  • apps/webapp/app/routes/admin.api.v1.environments.$environmentId.engine.repair-queues.ts (1 hunks)
  • internal-packages/run-engine/src/engine/index.ts (5 hunks)
  • internal-packages/run-engine/src/engine/types.ts (1 hunks)
  • internal-packages/run-engine/src/engine/workerCatalog.ts (1 hunks)
  • internal-packages/run-engine/src/run-queue/index.ts (5 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations

Files:

  • apps/webapp/app/routes/admin.api.v1.environments.$environmentId.engine.repair-queues.ts
  • internal-packages/run-engine/src/run-queue/index.ts
  • internal-packages/run-engine/src/engine/types.ts
  • internal-packages/run-engine/src/engine/workerCatalog.ts
  • internal-packages/run-engine/src/engine/index.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

We use zod a lot in packages/core and in the webapp

Files:

  • apps/webapp/app/routes/admin.api.v1.environments.$environmentId.engine.repair-queues.ts
apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

When importing from @trigger.dev/core in the webapp, never import the root package path; always use one of the documented subpath exports from @trigger.dev/core’s package.json

Files:

  • apps/webapp/app/routes/admin.api.v1.environments.$environmentId.engine.repair-queues.ts
{apps/webapp/app/**/*.server.{ts,tsx},apps/webapp/app/routes/**/*.ts}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Access environment variables only via the env export from app/env.server.ts; do not reference process.env directly

Files:

  • apps/webapp/app/routes/admin.api.v1.environments.$environmentId.engine.repair-queues.ts
apps/webapp/app/**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Modules intended for test consumption under apps/webapp/app/**/*.ts must not read environment variables; accept configuration via options instead

Files:

  • apps/webapp/app/routes/admin.api.v1.environments.$environmentId.engine.repair-queues.ts
🧬 Code graph analysis (2)
internal-packages/run-engine/src/run-queue/index.ts (3)
internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts (2)
  • runId (1462-1591)
  • runId (1849-1899)
internal-packages/run-engine/src/run-queue/types.ts (1)
  • RunQueueKeyProducerEnvironment (49-52)
internal-packages/run-engine/src/run-queue/keyProducer.ts (5)
  • messageKey (254-258)
  • queueCurrentConcurrencyKey (157-165)
  • envCurrentConcurrencyKey (199-217)
  • queueCurrentDequeuedKey (149-155)
  • envCurrentDequeuedKey (231-247)
internal-packages/run-engine/src/engine/index.ts (3)
internal-packages/run-engine/src/run-queue/index.ts (1)
  • runId (1812-1849)
internal-packages/run-engine/src/shared/index.ts (1)
  • AuthenticatedEnvironment (4-6)
internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts (1)
  • getLatestExecutionSnapshot (95-113)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: typecheck / typecheck
🔇 Additional comments (11)
internal-packages/run-engine/src/engine/types.ts (1)

74-74: Config option addition looks good; ensure it’s wired into defaults/docs

The optional repairSnapshotTimeoutMs aligns with the new repair workflow. Confirm it’s documented and defaulted as intended in the engine constructor.

internal-packages/run-engine/src/run-queue/index.ts (3)

43-44: Type import looks correct

Importing RunQueueKeyProducerEnvironment improves API clarity for environment-scoped operations.


1812-1850: Lua-backed clear operation is correct

Key selection and SREM updates are consistent with your key schema. The log includes useful context.

Please confirm there’s an automated test covering the SUSPENDED-repair path calling this API (see engine #handleRepairSnapshot). If missing, I can help scaffold one.


2796-2806: Commander interface extended correctly

The new clearMessageFromConcurrencySets signature matches the Lua script and call sites.

internal-packages/run-engine/src/engine/index.ts (6)

74-74: Private timeout field added: OK

Matches the new RunEngineOptions and used to schedule repair jobs.


195-197: New repairSnapshot worker job: OK

Job wiring matches workerCatalog and the private handler.


248-249: Default timeout applied

Reasonable default of 60s when not provided.


1183-1184: Delegation to #repairRuns is clean

Returns unified shape for environment repairs.


1186-1197: Queue repair with ignore list: good dedupe

Filters pre-repaired runIds to avoid double work per queue.

Confirm the admin route always calls repairEnvironment before per-queue repair to ensure ignore lists are correct (current route does).


1199-1221: Repair orchestration via pMap: OK

Bounded concurrency (5) is sensible.

apps/webapp/app/routes/admin.api.v1.environments.$environmentId.engine.repair-queues.ts (1)

89-99: Queue-level repair results shape: OK

Passing environment repair runIds to ignore avoids redundant queue repairs; returning { queue, ...repair } is a clear response shape.

Consider including friendlyId alongside queue for UX parity with other endpoints (optional).

Comment on lines 1655 to 1764
async #handleRepairSnapshot({
runId,
snapshotId,
executionStatus,
}: {
runId: string;
snapshotId: string;
executionStatus: string;
}) {
return await this.runLock.lock("handleRepairSnapshot", [runId], async () => {
const latestSnapshot = await getLatestExecutionSnapshot(this.prisma, runId);

if (latestSnapshot.id !== snapshotId) {
this.logger.log(
"RunEngine.handleRepairSnapshot no longer the latest snapshot, stopping the repair.",
{
runId,
snapshotId,
latestSnapshotExecutionStatus: latestSnapshot.executionStatus,
repairExecutionStatus: executionStatus,
}
);

return;
}

// Okay, so this means we haven't transitioned to a new status yes, so we need to do something
switch (latestSnapshot.executionStatus) {
case "EXECUTING":
case "EXECUTING_WITH_WAITPOINTS":
case "FINISHED":
case "PENDING_CANCEL":
case "PENDING_EXECUTING":
case "QUEUED_EXECUTING":
case "RUN_CREATED": {
// Do nothing;
return;
}
case "QUEUED": {
this.logger.log("RunEngine.handleRepairSnapshot QUEUED", {
runId,
snapshotId,
});

//it will automatically be requeued X times depending on the queue retry settings
const gotRequeued = await this.runQueue.nackMessage({
orgId: latestSnapshot.organizationId,
messageId: runId,
});

if (!gotRequeued) {
this.logger.error("RunEngine.handleRepairSnapshot QUEUED repair failed", {
runId,
snapshot: latestSnapshot,
});
} else {
this.logger.log("RunEngine.handleRepairSnapshot QUEUED repair successful", {
runId,
snapshot: latestSnapshot,
});
}

break;
}
case "SUSPENDED": {
this.logger.log("RunEngine.handleRepairSnapshot SUSPENDED", {
runId,
snapshotId,
});

const taskRun = await this.prisma.taskRun.findFirst({
where: { id: runId },
select: {
queue: true,
},
});

if (!taskRun) {
this.logger.error("RunEngine.handleRepairSnapshot SUSPENDED task run not found", {
runId,
snapshotId,
});
return;
}

// We need to clear this run from the current concurrency sets
await this.runQueue.clearMessageFromConcurrencySets({
runId,
orgId: latestSnapshot.organizationId,
queue: taskRun.queue,
env: {
id: latestSnapshot.environmentId,
type: latestSnapshot.environmentType,
project: {
id: latestSnapshot.projectId,
},
organization: {
id: latestSnapshot.organizationId,
},
},
});

break;
}
default: {
assertNever(latestSnapshot.executionStatus);
}
}
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Handle FINISHED runs: attempt ack or clear sets

Currently FINISHED is grouped under a “do nothing” branch. Add explicit FINISHED handling to ack the message if present, else clear concurrency sets, to actually repair stuck finished runs.

-      switch (latestSnapshot.executionStatus) {
-        case "EXECUTING":
-        case "EXECUTING_WITH_WAITPOINTS":
-        case "FINISHED":
+      switch (latestSnapshot.executionStatus) {
+        case "EXECUTING":
+        case "EXECUTING_WITH_WAITPOINTS":
         case "PENDING_CANCEL":
         case "PENDING_EXECUTING":
         case "QUEUED_EXECUTING":
         case "RUN_CREATED": {
           // Do nothing;
           return;
         }
         case "QUEUED": {
           ...
           break;
         }
+        case "FINISHED": {
+          this.logger.log("RunEngine.handleRepairSnapshot FINISHED", {
+            runId,
+            snapshotId,
+          });
+
+          // If the message still exists, ack to fully clean up queue and concurrency artifacts.
+          const hasMessage = await this.runQueue.messageExists(
+            latestSnapshot.organizationId,
+            runId
+          );
+          if (hasMessage) {
+            await this.runQueue.acknowledgeMessage(latestSnapshot.organizationId, runId, {
+              skipDequeueProcessing: true,
+              removeFromWorkerQueue: true,
+            });
+            break;
+          }
+
+          // Fallback: ensure concurrency sets are cleared if the message is gone.
+          const taskRun = await this.prisma.taskRun.findFirst({
+            where: { id: runId },
+            select: { queue: true },
+          });
+          if (!taskRun) {
+            this.logger.error("RunEngine.handleRepairSnapshot FINISHED task run not found", {
+              runId,
+              snapshotId,
+            });
+            return;
+          }
+          await this.runQueue.clearMessageFromConcurrencySets({
+            runId,
+            orgId: latestSnapshot.organizationId,
+            queue: taskRun.queue,
+            env: {
+              id: latestSnapshot.environmentId,
+              type: latestSnapshot.environmentType,
+              project: { id: latestSnapshot.projectId },
+              organization: { id: latestSnapshot.organizationId },
+            },
+          });
+          break;
+        }
         case "SUSPENDED": {
           ...
           break;
         }
         default: {
           assertNever(latestSnapshot.executionStatus);
         }
       }

This makes FINISHED repair effective without relying on the periodic sweeper.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async #handleRepairSnapshot({
runId,
snapshotId,
executionStatus,
}: {
runId: string;
snapshotId: string;
executionStatus: string;
}) {
return await this.runLock.lock("handleRepairSnapshot", [runId], async () => {
const latestSnapshot = await getLatestExecutionSnapshot(this.prisma, runId);
if (latestSnapshot.id !== snapshotId) {
this.logger.log(
"RunEngine.handleRepairSnapshot no longer the latest snapshot, stopping the repair.",
{
runId,
snapshotId,
latestSnapshotExecutionStatus: latestSnapshot.executionStatus,
repairExecutionStatus: executionStatus,
}
);
return;
}
// Okay, so this means we haven't transitioned to a new status yes, so we need to do something
switch (latestSnapshot.executionStatus) {
case "EXECUTING":
case "EXECUTING_WITH_WAITPOINTS":
case "FINISHED":
case "PENDING_CANCEL":
case "PENDING_EXECUTING":
case "QUEUED_EXECUTING":
case "RUN_CREATED": {
// Do nothing;
return;
}
case "QUEUED": {
this.logger.log("RunEngine.handleRepairSnapshot QUEUED", {
runId,
snapshotId,
});
//it will automatically be requeued X times depending on the queue retry settings
const gotRequeued = await this.runQueue.nackMessage({
orgId: latestSnapshot.organizationId,
messageId: runId,
});
if (!gotRequeued) {
this.logger.error("RunEngine.handleRepairSnapshot QUEUED repair failed", {
runId,
snapshot: latestSnapshot,
});
} else {
this.logger.log("RunEngine.handleRepairSnapshot QUEUED repair successful", {
runId,
snapshot: latestSnapshot,
});
}
break;
}
case "SUSPENDED": {
this.logger.log("RunEngine.handleRepairSnapshot SUSPENDED", {
runId,
snapshotId,
});
const taskRun = await this.prisma.taskRun.findFirst({
where: { id: runId },
select: {
queue: true,
},
});
if (!taskRun) {
this.logger.error("RunEngine.handleRepairSnapshot SUSPENDED task run not found", {
runId,
snapshotId,
});
return;
}
// We need to clear this run from the current concurrency sets
await this.runQueue.clearMessageFromConcurrencySets({
runId,
orgId: latestSnapshot.organizationId,
queue: taskRun.queue,
env: {
id: latestSnapshot.environmentId,
type: latestSnapshot.environmentType,
project: {
id: latestSnapshot.projectId,
},
organization: {
id: latestSnapshot.organizationId,
},
},
});
break;
}
default: {
assertNever(latestSnapshot.executionStatus);
}
}
});
}
// Okay, so this means we haven't transitioned to a new status yet, so we need to do something
switch (latestSnapshot.executionStatus) {
case "EXECUTING":
case "EXECUTING_WITH_WAITPOINTS":
case "PENDING_CANCEL":
case "PENDING_EXECUTING":
case "QUEUED_EXECUTING":
case "RUN_CREATED": {
// Do nothing;
return;
}
case "QUEUED": {
this.logger.log("RunEngine.handleRepairSnapshot QUEUED", {
runId,
snapshotId,
});
//it will automatically be requeued X times depending on the queue retry settings
const gotRequeued = await this.runQueue.nackMessage({
orgId: latestSnapshot.organizationId,
messageId: runId,
});
if (!gotRequeued) {
this.logger.error("RunEngine.handleRepairSnapshot QUEUED repair failed", {
runId,
snapshot: latestSnapshot,
});
} else {
this.logger.log("RunEngine.handleRepairSnapshot QUEUED repair successful", {
runId,
snapshot: latestSnapshot,
});
}
break;
}
case "FINISHED": {
this.logger.log("RunEngine.handleRepairSnapshot FINISHED", {
runId,
snapshotId,
});
// If the message still exists, ack to fully clean up queue and concurrency artifacts.
const hasMessage = await this.runQueue.messageExists(
latestSnapshot.organizationId,
runId
);
if (hasMessage) {
await this.runQueue.acknowledgeMessage(latestSnapshot.organizationId, runId, {
skipDequeueProcessing: true,
removeFromWorkerQueue: true,
});
break;
}
// Fallback: ensure concurrency sets are cleared if the message is gone.
const taskRun = await this.prisma.taskRun.findFirst({
where: { id: runId },
select: { queue: true },
});
if (!taskRun) {
this.logger.error("RunEngine.handleRepairSnapshot FINISHED task run not found", {
runId,
snapshotId,
});
return;
}
await this.runQueue.clearMessageFromConcurrencySets({
runId,
orgId: latestSnapshot.organizationId,
queue: taskRun.queue,
env: {
id: latestSnapshot.environmentId,
type: latestSnapshot.environmentType,
project: { id: latestSnapshot.projectId },
organization: { id: latestSnapshot.organizationId },
},
});
break;
}
case "SUSPENDED": {
this.logger.log("RunEngine.handleRepairSnapshot SUSPENDED", {
runId,
snapshotId,
});
const taskRun = await this.prisma.taskRun.findFirst({
where: { id: runId },
select: {
queue: true,
},
});
if (!taskRun) {
this.logger.error("RunEngine.handleRepairSnapshot SUSPENDED task run not found", {
runId,
snapshotId,
});
return;
}
// We need to clear this run from the current concurrency sets
await this.runQueue.clearMessageFromConcurrencySets({
runId,
orgId: latestSnapshot.organizationId,
queue: taskRun.queue,
env: {
id: latestSnapshot.environmentId,
type: latestSnapshot.environmentType,
project: {
id: latestSnapshot.projectId,
},
organization: {
id: latestSnapshot.organizationId,
},
},
});
break;
}
default: {
assertNever(latestSnapshot.executionStatus);
}
}
🤖 Prompt for AI Agents
internal-packages/run-engine/src/engine/index.ts around lines 1655 to 1764: the
FINISHED case is currently lumped into the "do nothing" branch; implement
explicit FINISHED handling that first tries to ack the queue message (await
this.runQueue.ackMessage({ orgId: latestSnapshot.organizationId, messageId:
runId })) and if ack returns false fall back to clearing the run from
concurrency sets (fetch taskRun.queue like in SUSPENDED, log and return if not
found, then call this.runQueue.clearMessageFromConcurrencySets with the same env
shape used in SUSPENDED); log success/failure paths similarly to the
QUEUED/SUSPENDED branches.

@ericallam ericallam merged commit 558fb11 into main Sep 26, 2025
28 checks passed
@ericallam ericallam deleted the fix/repair-more-runs branch September 26, 2025 14:16
nicktrn pushed a commit that referenced this pull request Sep 30, 2025
…NISHED execution status (#2564)

* feat(server): add two admin endpoints for queue and environment concurrency debugging and repairing
feat(run-engine): ability to repair runs in QUEUED, SUSPENDED, and FINISHED execution status

* Handle FINISHED snapshot in the repair
samejr pushed a commit that referenced this pull request Oct 4, 2025
…NISHED execution status (#2564)

* feat(server): add two admin endpoints for queue and environment concurrency debugging and repairing
feat(run-engine): ability to repair runs in QUEUED, SUSPENDED, and FINISHED execution status

* Handle FINISHED snapshot in the repair
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants