Skip to content

AWS Node Termination Handler Bug Fix Request Heartbeat #1172

Closed
@pshand-1

Description

@pshand-1

Summary

Found a bug in the AWS Node Termination Handler GitHub repository when processing ASG terminations where heartbeat goroutines are not cleaned up when cordon and drain operations fail, causing memory leaks and AWS throttling from excessive API calls.

Highlight of Flaw in Process

Expected Processing Flow

When NTH processes ASG termination messages successfully:

Problematic Flow

When NTH encounters a problem processing ASG termination messages:

  • Execute PreDrain Task which starts heartbeat goroutine
  • Fails Cordon and Drain for any reason
  • Runs CancelInterruptionEvent which removes the event from the Event Store, allowing NTH to reprocess the message
  • Exits without ever cleaning up the heartbeat goroutine
  • Picks up the failed message again 20 seconds later (NTH hard-coded visibility timeout) and reprocesses it repeatedly (running PreDrain each time) until success or the associated EC2 instance is terminated

Root Cause

NTH doesn't have logic to clean up started heartbeat routines when processing fails. The only existing cleanup logic is when cordoning and draining is successful. This becomes problematic when processing large numbers of EC2 instances that each require hours to gracefully terminate.

On every failure to cordon and drain, the heartbeat associated with the lifecycle hook can be duplicated every 20 seconds when the message is picked up and attempted again.

Impact

While NTH is unlikely to fail cordoning and draining under normal circumstances, the potential of generating extra goroutines pinging AWS raises concerns of potentially cascading failures affecting other components, especially in production environments with extended graceful shutdown periods.

Possible Simple Fix

Add an "early exit" function in ASG Termination Event definition alongside PreDrain and PostDrain. Simply check if it exists similar to PreDrain and PostDrain right before running CancelInterruptionEvent for the purpose of calling "close" heartbeat.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions