Skip to content

bug: Escalating duplicate reconnections in failedPodHandler #2623

@NERLOE

Description

@NERLOE

Provide environment information

  • Trigger.dev Version: 4.0.4
  • Deployment: Self-hosted Kubernetes (GKE Autopilot)

Describe the bug

The failedPodHandler in the supervisor creates escalating duplicate connections when the Kubernetes informer experiences disconnections. Each time the informer disconnects (typically every 5 minutes due to Kubernetes watch timeouts), multiple error events fire, and each one independently calls informer.start(), creating duplicate handlers that compound over time.

Observed escalation pattern:

  • First disconnect: 2 errors → 2 reconnects
  • Second disconnect: 3 errors → 3 reconnects
  • Third disconnect: 5 errors → 5 reconnects
  • Fourth disconnect: 9 errors → 9 reconnects
  • Fifth disconnect: 17 errors → 33 reconnects (!)

This creates growing CPU usage, API server load, and log pollution.

Reproduction repo

N/A - Bug occurs in the main trigger.dev supervisor when self-hosted on Kubernetes

To reproduce

  1. Deploy Trigger.dev self-hosted on Kubernetes with RBAC configured
  2. Monitor supervisor logs for the failed-pod-handler component
  3. Wait for normal Kubernetes watch timeout (5-10 minutes)
  4. Observe error events and reconnection messages escalating over multiple disconnects

Expected: 1 error → 1 reconnect per disconnect
Actual: Errors and reconnects escalate: 2 → 3 → 5 → 9 → 17 → 33+

Additional information

Production Log Evidence:

Initial disconnect (5 errors):

{"timestamp":"2025-10-22T08:21:05.843Z","message":"error event fired","$name":"failed-pod-handler"...}
{"timestamp":"2025-10-22T08:21:05.844Z","message":"error event fired","$name":"failed-pod-handler"...}
[5 total errors, followed by 5 reconnects]

Later disconnect (17 errors → 33 reconnects!):

{"timestamp":"2025-10-22T08:31:07.893Z","message":"error event fired"...} // x17
{"timestamp":"2025-10-22T08:31:08.893Z","message":"informer connected"...} // x33 over 100ms

Root Cause: The error handler has no guard against concurrent reconnections. When multiple error events fire, each independently calls informer.start(), creating compounding duplicate handlers.

Proposed Fix: Add a reconnection guard flag to ensure only one reconnection happens at a time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions