-
-
Notifications
You must be signed in to change notification settings - Fork 866
Description
Provide environment information
- Trigger.dev Version: 4.0.4
- Deployment: Self-hosted Kubernetes (GKE Autopilot)
Describe the bug
The failedPodHandler in the supervisor creates escalating duplicate connections when the Kubernetes informer experiences disconnections. Each time the informer disconnects (typically every 5 minutes due to Kubernetes watch timeouts), multiple error events fire, and each one independently calls informer.start(), creating duplicate handlers that compound over time.
Observed escalation pattern:
- First disconnect: 2 errors → 2 reconnects
- Second disconnect: 3 errors → 3 reconnects
- Third disconnect: 5 errors → 5 reconnects
- Fourth disconnect: 9 errors → 9 reconnects
- Fifth disconnect: 17 errors → 33 reconnects (!)
This creates growing CPU usage, API server load, and log pollution.
Reproduction repo
N/A - Bug occurs in the main trigger.dev supervisor when self-hosted on Kubernetes
To reproduce
- Deploy Trigger.dev self-hosted on Kubernetes with RBAC configured
- Monitor supervisor logs for the
failed-pod-handlercomponent - Wait for normal Kubernetes watch timeout (5-10 minutes)
- Observe error events and reconnection messages escalating over multiple disconnects
Expected: 1 error → 1 reconnect per disconnect
Actual: Errors and reconnects escalate: 2 → 3 → 5 → 9 → 17 → 33+
Additional information
Production Log Evidence:
Initial disconnect (5 errors):
{"timestamp":"2025-10-22T08:21:05.843Z","message":"error event fired","$name":"failed-pod-handler"...}
{"timestamp":"2025-10-22T08:21:05.844Z","message":"error event fired","$name":"failed-pod-handler"...}
[5 total errors, followed by 5 reconnects]Later disconnect (17 errors → 33 reconnects!):
{"timestamp":"2025-10-22T08:31:07.893Z","message":"error event fired"...} // x17
{"timestamp":"2025-10-22T08:31:08.893Z","message":"informer connected"...} // x33 over 100msRoot Cause: The error handler has no guard against concurrent reconnections. When multiple error events fire, each independently calls informer.start(), creating compounding duplicate handlers.
Proposed Fix: Add a reconnection guard flag to ensure only one reconnection happens at a time.