Skip to content

feat(webapp,redis): handle READONLY / LOADING during ElastiCache failover#3548

Merged
ericallam merged 2 commits into
mainfrom
feat/tri-8868-reconnect-on-error
May 11, 2026
Merged

feat(webapp,redis): handle READONLY / LOADING during ElastiCache failover#3548
ericallam merged 2 commits into
mainfrom
feat/tri-8868-reconnect-on-error

Conversation

@ericallam
Copy link
Copy Markdown
Member

@ericallam ericallam commented May 11, 2026

Summary

During an ElastiCache role swap (failover) or node-type change (vertical scale), the ioredis TCP/TLS connection stays open but the server starts answering with READONLY (the client is talking to a node that became a replica) or LOADING (node still loading data from disk). Without an explicit hook, those errors surface to caller code as ReplyError instances — every write op on the affected connection fails until the cluster fully cuts over.

This PR adds reconnectOnError to every prod ioredis client so the disconnect + reconnect + retry cycle absorbs these errors and caller code never sees them.

Fix

export function defaultReconnectOnError(err: Error): boolean | 1 | 2 {
  const msg = err.message ?? "";
  if (msg.startsWith("READONLY") || msg.startsWith("LOADING")) return 2;
  return false;
}

Returning 2 tells ioredis to disconnect, reconnect, and re-issue the failed command. After reconnect, DNS / SG state routes the new socket to a writable node.

The helper lives in @internal/redis and is wired into both the shared createRedisClient (which covers RunQueue, schedule-engine, redis-worker, and every other internal-package consumer) and the direct new Redis(...) call sites in the webapp.

V1-only marqs files are intentionally not migrated.

Test plan

  • pnpm run typecheck --filter webapp
  • pnpm run typecheck --filter @internal/run-engine
  • Verified end-to-end against a live ElastiCache vertical-scale event — caller-surfaced errors went from tens of thousands during the cutover window down to a handful per ioredis client
  • Confirm steady-state behavior unchanged after deploy

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 11, 2026

⚠️ No Changeset found

Latest commit: 4df210c

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c5240c63-938d-431e-af51-d0f85de0fac4

📥 Commits

Reviewing files that changed from the base of the PR and between 0ed192d and 4df210c.

📒 Files selected for processing (11)
  • .server-changes/redis-reconnect-on-readonly-loading.md
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/redis.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • internal-packages/redis/src/index.ts
✅ Files skipped from review due to trivial changes (2)
  • .server-changes/redis-reconnect-on-readonly-loading.md
  • apps/webapp/app/services/platformNotificationCounter.server.ts
🚧 Files skipped from review as they are similar to previous changes (9)
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • internal-packages/redis/src/index.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/redis.server.ts
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)

Walkthrough

This PR defines and exports a new defaultReconnectOnError(err) in the internal Redis package that maps READONLY/LOADING reply errors to a value instructing ioredis to disconnect, reconnect, and retry. That helper is added to the package's default options and applied across the webapp: the primary Redis factory (cluster and non-cluster) and multiple service/presenter/Socket.IO Redis clients. A server-changes note documents the behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly identifies the main change: adding reconnectOnError handling for READONLY/LOADING errors during ElastiCache failover, which matches the core objective across the changeset.
Description check ✅ Passed The description provides a comprehensive summary with problem statement, fix implementation, scope (shared and direct clients, V1 exclusions), and test results—exceeding the template's basic requirements.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/tri-8868-reconnect-on-error

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

ericallam added 2 commits May 11, 2026 06:43
During an ElastiCache role swap or node-type change, the TCP/TLS
connection stays open but the server starts answering with READONLY
(the client is talking to a node that became a replica) or LOADING
(node still loading data from disk). Without an explicit hook, these
errors surface to caller code as ReplyError instances — every write
op on the affected connection fails until the cluster cuts over.

Returning 2 from reconnectOnError tells ioredis to tear down the
connection, reconnect, and re-issue the failed command. After
reconnect, DNS / SG state routes the new socket to a writable node.

The shared createRedisClient helper picks this up automatically.
defaultReconnectOnError is exported for direct ioredis call sites
that bypass createRedisClient.
The webapp constructs ioredis clients directly in several prod files
that bypass the shared createRedisClient helper. Each of these needs
the same reconnectOnError hook so an ElastiCache role swap or
node-type change on the realtime / cache / socket-io / cluster Redis
instances doesn't surface READONLY / LOADING reply errors to caller
code.

V1-only marqs files are intentionally not migrated.
@ericallam ericallam force-pushed the feat/tri-8868-reconnect-on-error branch from 0ed192d to 4df210c Compare May 11, 2026 05:43
@ericallam ericallam changed the title feat(redis): handle READONLY / LOADING during ElastiCache failover and scale-up feat(webapp,redis): handle READONLY / LOADING during ElastiCache failover May 11, 2026
@ericallam ericallam marked this pull request as ready for review May 11, 2026 05:44
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts`:
- Around line 39-40: The reconnectOnError property is currently placed before
the spread of ...this.options.redis which lets callers override or nullify the
default failover handler; move the reconnectOnError assignment to after the
spread and set it using a nullish-coalescing fallback (i.e., use the
caller-provided reconnectOnError if explicitly set, otherwise
defaultReconnectOnError) so that reconnectOnError, defaultReconnectOnError and
...this.options.redis are applied in the correct order and the failover handler
is preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c4d97766-1dd3-468d-86fb-5ad3635c68f1

📥 Commits

Reviewing files that changed from the base of the PR and between 6cdd881 and 0ed192d.

📒 Files selected for processing (11)
  • .server-changes/redis-reconnect-on-readonly-loading.md
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/redis.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • internal-packages/redis/src/index.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (10)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • internal-packages/redis/src/index.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/redis.server.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/redis.server.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • internal-packages/redis/src/index.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/redis.server.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • internal-packages/redis/src/index.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/redis.server.ts
apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: Access environment variables through the env export of env.server.ts instead of directly accessing process.env
Use subpath exports from @trigger.dev/core package instead of importing from the root @trigger.dev/core path

Use named constants for sentinel/placeholder values (e.g. const UNSET_VALUE = '__unset__') instead of raw string literals scattered across comparisons

Files:

  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/redis.server.ts
apps/webapp/**/*.server.ts

📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)

apps/webapp/**/*.server.ts: Never use request.signal for detecting client disconnects. Use getRequestAbortSignal() from app/services/httpAsyncStorage.server.ts instead, which is wired directly to Express res.on('close') and fires reliably
Access environment variables via env export from app/env.server.ts. Never use process.env directly
Always use findFirst instead of findUnique in Prisma queries. findUnique has an implicit DataLoader that batches concurrent calls and has active bugs even in Prisma 6.x (uppercase UUIDs returning null, composite key SQL correctness issues, 5-10x worse performance). findFirst is never batched and avoids this entire class of issues

Files:

  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/redis.server.ts
{apps,internal-packages}/**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

Use pnpm run typecheck to verify changes in apps and internal packages (apps/*, internal-packages/*) instead of build, which proves almost nothing about correctness

Files:

  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • internal-packages/redis/src/index.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/redis.server.ts
{package.json,**/*.{ts,tsx,js}}

📄 CodeRabbit inference engine (CLAUDE.md)

Pin Zod to version 3.25.76 exactly across the entire monorepo - never use a different version or version range

Files:

  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • internal-packages/redis/src/index.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/redis.server.ts
**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx,js}: Import from @trigger.dev/core using subpaths only, never the root export
Always import tasks from @trigger.dev/sdk, never from @trigger.dev/sdk/v3 or deprecated client.defineJob
Add crumbs to code using // @Crumbs comments or `// `#region` `@crumbs blocks for debug tracing during development

Files:

  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • internal-packages/redis/src/index.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/redis.server.ts
**/*.{ts,tsx,js,jsx,json,md,css,scss}

📄 CodeRabbit inference engine (AGENTS.md)

Code formatting is enforced using Prettier. Run pnpm run format before committing

Files:

  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • internal-packages/redis/src/index.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/redis.server.ts
🧠 Learnings (6)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • internal-packages/redis/src/index.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/redis.server.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • internal-packages/redis/src/index.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/redis.server.ts
📚 Learning: 2026-03-26T09:02:07.973Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 3274
File: apps/webapp/app/services/runsReplicationService.server.ts:922-924
Timestamp: 2026-03-26T09:02:07.973Z
Learning: When parsing Trigger.dev task run annotations in server-side services, keep `TaskRun.annotations` strictly conforming to the `RunAnnotations` schema from `trigger.dev/core/v3`. If the code already uses `RunAnnotations.safeParse` (e.g., in a `#parseAnnotations` helper), treat that as intentional/necessary for atomic, schema-accurate annotation handling. Do not recommend relaxing the annotation payload schema or using a permissive “passthrough” parse path, since the annotations are expected to be written atomically in one operation and should not contain partial/legacy payloads that would require a looser parser.

Applied to files:

  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
📚 Learning: 2026-05-05T09:38:02.512Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3523
File: apps/webapp/app/routes/api.v3.batches.ts:178-181
Timestamp: 2026-05-05T09:38:02.512Z
Learning: When reviewing code that catches `ServiceValidationError` in `*.server.ts` files, do not blindly forward `error.status` to HTTP responses, because SVEs may be thrown with non-default statuses (e.g., 400/500) and forwarding them can cause client-visible behavioral regressions (e.g., surfacing 500s to clients). Prefer a safe default response status of `error.status ?? 422`, but only after confirming via the reachable call graph that the caught `ServiceValidationError` instances are expected to carry those non-default statuses; otherwise, normalize to `422` to avoid unexpected client-visible 5xx behavior.

Applied to files:

  • apps/webapp/app/services/taskIdentifierCache.server.ts
  • apps/webapp/app/services/autoIncrementCounter.server.ts
  • apps/webapp/app/presenters/v3/DevPresence.server.ts
  • apps/webapp/app/services/platformNotificationCounter.server.ts
  • apps/webapp/app/v3/handleSocketIo.server.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
  • apps/webapp/app/services/inputStreamWaitpointCache.server.ts
  • apps/webapp/app/redis.server.ts
📚 Learning: 2026-02-06T19:53:38.843Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/presenters/v3/DeploymentListPresenter.server.ts:233-237
Timestamp: 2026-02-06T19:53:38.843Z
Learning: When constructing Vercel dashboard URLs from deployment IDs, always strip the dpl_ prefix from the ID. Implement this by transforming the ID with .replace(/^dpl_/, "") before concatenating into the URL: https://vercel.com/${teamSlug}/${projectName}/${cleanedDeploymentId}. Consider centralizing this logic in a small helper (e.g., getVercelDeploymentId(id) or a URL builder) and add tests to verify both prefixed and non-prefixed inputs.

Applied to files:

  • apps/webapp/app/presenters/v3/DevPresence.server.ts
📚 Learning: 2026-03-29T19:16:28.864Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3291
File: apps/webapp/app/v3/featureFlags.ts:53-65
Timestamp: 2026-03-29T19:16:28.864Z
Learning: When reviewing TypeScript code that uses Zod v3, treat `z.coerce.*()` schemas as their direct Zod type (e.g., `z.coerce.boolean()` returns a `ZodBoolean` with `_def.typeName === "ZodBoolean"`) rather than a `ZodEffects`. Only `.preprocess()`, `.refine()`/`.superRefine()`, and `.transform()` are expected to wrap schemas in `ZodEffects`. Therefore, in reviewers’ logic like `getFlagControlType`, do not flag/unblock failures that require unwrapping `ZodEffects` when the input schema is a `z.coerce.*` schema.

Applied to files:

  • apps/webapp/app/v3/handleSocketIo.server.ts
🔇 Additional comments (10)
internal-packages/redis/src/index.ts (1)

6-23: Reconnect-on-error helper is cleanly centralized and correctly wired.

The mapping and default option integration are consistent with the failover goal and keep client behavior uniform across consumers.

Also applies to: 31-31

.server-changes/redis-reconnect-on-readonly-loading.md (1)

1-7: Change note accurately reflects behavior and scope.

This doc update is aligned with the Redis reconnect handling introduced in code.

apps/webapp/app/services/taskIdentifierCache.server.ts (1)

2-2: Good consistency update for cache Redis initialization.

Applying the shared reconnect policy here keeps behavior aligned with the rest of the Redis-backed services.

Also applies to: 57-57

apps/webapp/app/v3/handleSocketIo.server.ts (1)

18-18: Socket.IO Redis adapter path is now correctly aligned with reconnect policy.

Nice to see the pub/sub transport included so failover behavior is consistent across realtime traffic too.

Also applies to: 97-97

apps/webapp/app/redis.server.ts (1)

2-2: Cluster and standalone paths are both covered correctly.

Wiring the same reconnect hook into both constructors avoids behavioral drift between modes.

Also applies to: 46-46, 74-74

apps/webapp/app/services/autoIncrementCounter.server.ts (1)

2-2: Reconnect policy integration in AutoIncrementCounter looks good.

The change is focused and preserves existing transactional behavior while improving failover resilience.

Also applies to: 15-15

apps/webapp/app/presenters/v3/DevPresence.server.ts (1)

2-2: Dev presence client update is straightforward and consistent.

This keeps reconnect behavior aligned without altering presence semantics.

Also applies to: 11-11

apps/webapp/app/services/sessionStreamWaitpointCache.server.ts (1)

2-2: Session stream waitpoint cache wiring is correct.

Good consistency with the shared reconnect strategy across Redis-backed services.

Also applies to: 32-32

apps/webapp/app/services/platformNotificationCounter.server.ts (1)

2-2: Reconnect-on-error wiring looks good.

Line 22 correctly applies the shared defaultReconnectOnError handler to this Redis client and matches the failover objective.

Also applies to: 22-22

apps/webapp/app/services/inputStreamWaitpointCache.server.ts (1)

2-2: Looks good — cache client now has failover-aware reconnect behavior.

The added reconnectOnError: defaultReconnectOnError is correctly applied for this Redis instance.

Also applies to: 28-28

Comment thread apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 0 potential issues.

View 2 additional findings in Devin Review.

Open in Devin Review

@ericallam ericallam merged commit 567e2a2 into main May 11, 2026
44 checks passed
@ericallam ericallam deleted the feat/tri-8868-reconnect-on-error branch May 11, 2026 06:17
ericallam added a commit that referenced this pull request May 11, 2026
…3549)

## Summary

When ElastiCache demotes a primary to replica — during a Multi-AZ
failover or a vertical node-type change — the demoting primary issues an
`UNBLOCKED` reply to any in-flight blocking commands (`BLPOP`, `BRPOP`,
`BLMOVE`, `XREADGROUP ... BLOCK`, etc.) to clear them before the role
flips. ioredis surfaces these as `ReplyError` to caller code.

The shared `defaultReconnectOnError` added in #3548 only matches
`READONLY` and `LOADING`. This extends it to `UNBLOCKED` so the
disconnect-reconnect-retry cycle handles BLPOP-shaped errors the same
way the existing two cases handle non-blocking-command errors.

## Fix

```ts
export function defaultReconnectOnError(err: Error): boolean | 1 | 2 {
  const msg = err.message ?? "";
  if (
    msg.startsWith("READONLY") ||
    msg.startsWith("LOADING") ||
    msg.startsWith("UNBLOCKED")
  ) {
    return 2;
  }
  return false;
}
```

Returning `2` tells ioredis to disconnect, reconnect, and re-issue the
command. For a BLPOP that means a fresh BLPOP against the new primary
instead of the `UNBLOCKED` error escaping to the caller.

## Test plan

- [ ] CI green
- [ ] Trigger a Multi-AZ failover or a vertical scale event on an
ElastiCache replication group whose clients are running blocking
commands and confirm no `UNBLOCKED` errors surface to caller code during
the cutover.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants