Description
Motivation
Respond to spot instance terminations more gracefully. That is to prevent getting failed requests when the traffic is supposed to migrate from the terminating instance to another one that is healthy.
Questions
- What is the current behavior, and what would this achieve that's better? Does the cluster autoscaler help with this at all?
Description
- https://github.com/aws/aws-node-termination-handler
- https://itnext.io/the-definitive-guide-to-running-ec2-spot-instances-as-kubernetes-worker-nodes-68ef2095e767
Edit (Research)
Some relevant articles here:
- https://aws.amazon.com/blogs/compute/best-practices-for-handling-ec2-spot-instance-interruptions/
- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html#spot-instance-termination-notices
- https://docs.aws.amazon.com/autoscaling/ec2/userguide/healthcheck.html
- https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#use-kubectl-drain-to-remove-a-node-from-service
If we add aws-node-termination-handler
and make kubectl
drain the node upon notice, then I think the serving container will react to that by rejecting the requests currently in the queue and for those that are still being processed to finish. For testing, killing/terminating the instance might not be the best way to run this - instead, a way of reproducing the termination notice that AWS emits has to be found.
With https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-conn-drain.html and with the kubectl
drain procedure we might be able to gracefully transition to a healthy instance. And it looks like the back-end connection timeout is set to 300 seconds before the ELB kills the requests headed to the de-registering instance. We’d probably want to set that to 120 seconds, to match the termination notice period.