How to achieve graceful server shutdown for environments that rely on healthcheck to mark nodes unhealthy #587

niodice · 2023-05-15T17:15:04Z

Background Context

I'm working in a hosting environment that at a high level works like this:

Load balancer monitors /health endpoint, exposed via admin port, to identify if hosts are healthy and routable.
If the health check fails for an instance some number of times in a row, the service is marked as unlealthy and traffic will not be routed to that instance.
During deployments or instance replacements, a SIGTERM is sent to the service to trigger a shutdown.

Observed behavior

When the service receives a SIGTERM, it appears that a few things happen:
- Both the admin server and application server handle the signal and shutdown immediately
- The service shutsdown because at least 1 of the awaitables has closed: https://github.com/twitter/finatra/blob/finatra-22.7.0/inject/inject-server/src/main/scala/com/twitter/inject/server/TwitterServer.scala#L169-L174

Pain points

This is problematic because there is a period of time where:
- The admin server has shut down and won't report healthy
- The application server has shut down and won't accept any requests
- The load balancer is still routing traffic to this instance because there needs to be > 1 occurrence of a failed health check before the instance is taken out of rotation. These requests fail. It is not always safe to retry these requests because they may not be idempotent.

Desired behavior

Is there a way to have fine control over the shutdown sequence? Ideally, I would:
- Shut down admin server (or tell admin server to report a non 200 status code) and keep the application server up for a configurable amount of time, say 15 seconds)
- After some period of time, shutdown the application server and then gracefully terminate the program.

The text was updated successfully, but these errors were encountered:

niodice · 2023-05-15T18:32:40Z

Also posted in gitter: https://matrix.to/#/!fqPdgWhOEtxbPlasve:gitter.im/$skWDihqObMBVhErRuXgkeqIploCBKlZWmiria1xpVAg?via=gitter.im&via=matrix.org

cacoco · 2023-08-17T22:42:17Z

@niodice using the SIGTERM (as opposed to calling close() on the server) is going to bypass a lot of the graceful shutdown mechanics, IIRC.

The most straightforward thing to do, likely is to handle the interrupt using c.t.util.HandleSignal and then call close().

See: https://twitter.github.io/finatra/user-guide/app/index.html#an-example-of-handling-signals (note a TwitterServer is an App, so you can follow the example similarly somewhere in your Server definition)

You likely also want to mess with the grace period in the server as well if necessary to allow for more time for closing resources if necessary. Hope that helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to achieve graceful server shutdown for environments that rely on healthcheck to mark nodes unhealthy #587

How to achieve graceful server shutdown for environments that rely on healthcheck to mark nodes unhealthy #587

niodice commented May 15, 2023

niodice commented May 15, 2023

cacoco commented Aug 17, 2023

How to achieve graceful server shutdown for environments that rely on healthcheck to mark nodes unhealthy #587

How to achieve graceful server shutdown for environments that rely on healthcheck to mark nodes unhealthy #587

Comments

niodice commented May 15, 2023

Background Context

Observed behavior

Pain points

Desired behavior

niodice commented May 15, 2023

cacoco commented Aug 17, 2023