Description
Summary
Found a bug in the AWS Node Termination Handler GitHub repository when processing ASG terminations where heartbeat goroutines are not cleaned up when cordon and drain operations fail, causing memory leaks and AWS throttling from excessive API calls.
Highlight of Flaw in Process
Expected Processing Flow
When NTH processes ASG termination messages successfully:
- Execute PreDrain Task which starts heartbeat goroutine
- Successfully Cordon and Drain node
- Run MarkAllAsProcessed then PostDrainTask which closes the heartbeat routine
Problematic Flow
When NTH encounters a problem processing ASG termination messages:
- Execute PreDrain Task which starts heartbeat goroutine
- Fails Cordon and Drain for any reason
- Runs CancelInterruptionEvent which removes the event from the Event Store, allowing NTH to reprocess the message
- Exits without ever cleaning up the heartbeat goroutine
- Picks up the failed message again 20 seconds later (NTH hard-coded visibility timeout) and reprocesses it repeatedly (running PreDrain each time) until success or the associated EC2 instance is terminated
Root Cause
NTH doesn't have logic to clean up started heartbeat routines when processing fails. The only existing cleanup logic is when cordoning and draining is successful. This becomes problematic when processing large numbers of EC2 instances that each require hours to gracefully terminate.
On every failure to cordon and drain, the heartbeat associated with the lifecycle hook can be duplicated every 20 seconds when the message is picked up and attempted again.
Impact
While NTH is unlikely to fail cordoning and draining under normal circumstances, the potential of generating extra goroutines pinging AWS raises concerns of potentially cascading failures affecting other components, especially in production environments with extended graceful shutdown periods.
Possible Simple Fix
Add an "early exit" function in ASG Termination Event definition alongside PreDrain and PostDrain. Simply check if it exists similar to PreDrain and PostDrain right before running CancelInterruptionEvent for the purpose of calling "close" heartbeat.