Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness/Readiness probes failure while using graceful shutdown #24617

Closed
paoven opened this issue Dec 28, 2020 · 9 comments
Closed

Liveness/Readiness probes failure while using graceful shutdown #24617

paoven opened this issue Dec 28, 2020 · 9 comments
Labels
status: duplicate A duplicate of another issue

Comments

@paoven
Copy link

paoven commented Dec 28, 2020

Hi! While I was investigating on Spring Boot graceful shutdown feature as well as liveness/readiness endpoints (to be called at a later stage from Kubernetes probes) I discovered that the mentioned endpoints become unreacheble as soon as shutdown is initiated by a SIGTERM. This would cause Kubernetes liveness probe fail and may lead to unclean shutdown. As the purpose of graceful shutdown is the opposite I filled this report.

I created a sample Spring Boot project on GitHub to simplify issue testing. The service exposes a controller that, when invoked, will sleep for the desired amount of time.

Please see instructions to reproduce the issue below. Many thanks for your support, best regards

Paolo

Environment:

  • OS: Windows 10 Pro
  • JDK: openjdk 11.0.7 2020-04-14 LTS
  • Spring boot starter parent : 2.4.1
  • Servlet Engine: Apache Tomcat/9.0.41

How to reproduce:

  • Please checkout sample project https://github.com/paoven/graceful-shutdown
  • Run the project and call the endpoint (E.g. curl -H "Content-Type:application/json" -H "Accept:application/json" -XPOST -d "1" http://localhost:8080/wait?waitMs=20000)
  • Take note of process PID and initiate a graceful shutdown by issuing a SIGTERM signal within 20seconds. (E.g. kill -SIGTERM [PID])
  • The Spring Boot service logs the graceful shutdown and the request is fulfilled as it takes less than the configured graceful shutdown timeout (30s). The problem is that, as soon as you issue the SIGTERM, the Spring Boot actuator health endpoints become unreachable (both liveness and rediness groups). External Systems relying on that endpoints for availability/healthy checks (such as Kubernetes) would think that the service is not available anymore too soon.
@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Dec 28, 2020
@bclozel
Copy link
Member

bclozel commented Dec 28, 2020

@paoven Have you seen the kubernetes deployment section in the reference documentation?

We assume that the graceful shutdown sequence should start once the platform has stopped routing traffic to the application instance. The shutdown delay really depends on the platform (in your case the readiness check period, which is configurable).

We're considering adding an optional delay in #20995 - but a possible solution here is to configure a preStop hook as explained in our documentation.

@bclozel bclozel added the status: waiting-for-feedback We need additional information before we can continue label Dec 28, 2020
@paoven
Copy link
Author

paoven commented Dec 30, 2020

@bclozel Thanks for prompt answer. I was able to obtain a clean shutdown (both on client and server side) by leveraging on preStop hook and a sleeping thread which introduces the mentioned delay. About the relation between the shutdown delay and readiness check period, as far as I can see Kubernetes is not relying on liveness/readiness checks anymore as soon as the Pod enters Terminating state but of course the delay is necessary in order to guarantee that the platform removes the Pod reference from Services/RS/ and has effectively stopped sending traffic to it). Thanks again, kind regards

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Dec 30, 2020
@jcook793
Copy link

The problem with preStop delays is that they are fixed amounts of time. So if I'm willing to let uploads take 3 minutes if necessary, that means every time I do a pod deployment it is going to take 3 minutes, even if there is no traffic at all.

@paoven
Copy link
Author

paoven commented Dec 30, 2020

@jcook793 as far as I understoop preStop delay is just necessary to let the platform/load balancer to stop routing new traffic to the shutting down service (In Kubernetes should be 5/10 seconds according to this useful article).

Already existing connections are not forcibly dropped at this stage. In the upload example you mentioned, when Spring Boot receives the SIGTERM signal, it will wait for pending requests to terminate up to the maximum shutdown timeout (e.g. spring.lifecycle.timeout-per-shutdown-phase configured to 3mins), but if there are no requests it will shutdown fast without waiting 3 minutes.

@bclozel
Copy link
Member

bclozel commented Jan 4, 2021

@jcook793 See @paoven 's comment - the app is shutting down as soon as possible.

I'm closing this issue as a duplicate of #20995 - I don't know if we'll implement it, but in the meantime the preStop hook seems like the sensible solution here.

@bclozel bclozel closed this as completed Jan 4, 2021
@bclozel bclozel added status: duplicate A duplicate of another issue and removed status: feedback-provided Feedback has been provided status: waiting-for-triage An issue we've not yet triaged labels Jan 4, 2021
@lturcsanyi
Copy link

The documentation is still wrong, because it states that during the graceful shutdown period the liveness probe should report LIVE state, but both endpoints are unreachable.
docs

@bclozel
Copy link
Member

bclozel commented Jan 14, 2021

@lturcsanyi could you quote here exactly the section that states this?

@lturcsanyi
Copy link

Sorry, I linked a wrong section, in the "Application lifecycle and Probes states" section, the second table: "When a Spring Boot application shuts down:"
I would assume this means that the liveness probe still returns live state.

@bclozel
Copy link
Member

bclozel commented Jan 14, 2021

Thanks for the feedback @lturcsanyi , I've created #24843

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: duplicate A duplicate of another issue
Projects
None yet
Development

No branches or pull requests

5 participants