Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waiting for termination after ExecStop #13284

Open
sonOfRa opened this issue Aug 8, 2019 · 10 comments
Open

Waiting for termination after ExecStop #13284

sonOfRa opened this issue Aug 8, 2019 · 10 comments
Labels
needs-discussion 🤔 pid1 RFE 🎁 Request for Enhancement, i.e. a feature request

Comments

@sonOfRa
Copy link

sonOfRa commented Aug 8, 2019

Is your feature request related to a problem? Please describe.
I have a service which can only be terminated by sending a command to the service via an external tool. That external tool sends a "shutdown" command to the service. The service then reports back to the tool "I am now shutting down", which the tool interprets as success. At this point, the tool terminates, and the ExecStop= command terminates successfully.

At this point, the service is still in the process of shutting down, but systemd immediately sends SIGTERM (or whatever is specified in KillSignal=), and the process is killed abruptly.

Describe the solution you'd like
I would like systemd to honor TimeoutStopSec= in this case. TimeoutStopSec is honored before sending FinalKillSignal later, but it does not seem to be honored in this case.

Describe alternatives you've considered
Currently the unit file simply contains a second ExecStop=sleep 5 line, which is enough time for the service to actually terminate. This solution is obviously ugly.
Another solution would be the introduction of another Timeout-like variable that exists precisely for this use case, in order not to change behaviour in existing units.

I realize that documentation states that "The specified command should hence be a synchronous operation, not an asynchronous one.". However, at least to someone not deeply familiar with systemd internals; this behaviour was very surprising, I would completely expect TimeoutStopSec to also apply to the gap between ExecStop= and KillSignal=.

@poettering
Copy link
Member

The documented assumption is that any command configured in ExecStop= is synchronous and complete, i.e. more than an asynchronous, friendly request to shutdown. i.e. so that any processes left after ExecStop= is left-over stuff that can be killed immediately.

This assumption builds on the fact that sysvinit required such synchronous stop behaviour too to work correctly, and ExecStop= is mostly about mimicking compat with that.

Modern services should not need ExecStop= anyway: if they react correctly to SIGTERM (and exit cleanly) they don't need to define ExecStop= at all, and can set KillMode=mixed and thus get an asynchronous SIGTERM to the main PID to which they then can react asynchronously to.

So, summary: any service that implements an asynchronous ExecStop= is not in line with either systemd's nor sysvinit#s expected behaviour. I'd really prefer if the service would be fixed to follow this logic (and ideally not require any ExecStop= command at all, just by handling SIGTERM correctly, as mentioned).

@poettering poettering added needs-discussion 🤔 pid1 RFE 🎁 Request for Enhancement, i.e. a feature request labels Aug 8, 2019
@sonOfRa
Copy link
Author

sonOfRa commented Aug 8, 2019

The service in question is written in Java, which doesn't really lend itself well to signal handling:

You can register shutdown hooks, but they tend to be rather fragile and while they work reliably in single-threaded applications, they can cause problems in multithreaded applications like this one.

If this is too much of a niche case to be handled in systemd I understand, and will simply stick to the 5 second sleep as a second ExecStop= command. This works reliably as the service really doesn't need long to exit, it's just not quite immediate after the command exits.

In any case, while this is documented in the ExecStop= section, it might be a good idea to add a note to TimeoutStopSec= that this value explicitly does not wait between ExecStop= and KillSignal=. It surprised me and might surprise other readers of the documentation as well.

@arvidjaar
Copy link
Contributor

arvidjaar commented Aug 23, 2019

assumption is that any command configured in ExecStop= is synchronous and complete

This is unrealistic. There is absolutely no sane way ExecStop command can confirm that service has stopped. It obviously cannot receive confirmation after service process exited; checking for process is unreliable as well (and duplicates what systemd does). So how do you expect ExecStop command to be synchronous and complete in the first place?

Anything that ExecStop command may try will just duplicate what systemd does already and it likely does it better. Do you require every service to re-implement systemd process handling?

@poettering
Copy link
Member

This is unrealistic. There is absolutely no sane way ExecStop command can confirm that service has stopped. It obviously cannot receive confirmation after service process exited; checking for process is unreliable as well (and duplicates what systemd does). So how do you expect ExecStop command to be synchronous and complete in the first place?

Do IPC, talk to the service, and wait until it reports back that it is almost finish going down now. Then wait for the EOF on the IPC channel and assume the thing is done with that.

But quite frankly: don't use ExecStop= at all, just handle SIGTERM properly. There must be a way to do this with Java.

Anything that ExecStop command may try will just duplicate what systemd does already and it likely does it better. Do you require every service to re-implement systemd process handling?

Well, my assumption is that new-style stuff would just handle SIGTERM properly, and old style stuff already has some hacky solution in place. ExecStop= for service shutdown is a legacy concept if you so will, you don't need it for clean services that properly handle SIGTERM.

@vincentbernat
Copy link
Contributor

Another use case for this is nginx. To gracefully stops nginx, you need to send SIGSTOP, wait a bit, send SIGTERM if still needed, wait a bit, send SIGKILL if really needed. In Debian, this is implemented using start-stop-daemon:

ExecStop=-/sbin/start-stop-daemon --quiet --stop --retry QUIT/5 --pidfile /run/nginx.pid
TimeoutStopSec=5
KillMode=mixed

It would be nice to not have to rely on Debian-specific tools. With the proposed change, we could implement the QUIT/5/TERM/5/KILL with the following snippet:

ExecStop=/usr/bin/kill -QUIT $MAINPID
TimeoutStopSec=5

nginx cannot conflates the two signals into one as they have different purposes:

  • SIGSTOP to wait for existing connections to terminate
  • SIGTERM to do a fast shutdown, which, among other tasks, flushes logs

An alternative configuration could be:

KillSignal=QUIT
TimeoutSec=30s

But if a connection is lingering for 30s, logs may not be flushed correctly due to nginx receiving SIGKILL and not SIGTERM.

@ghost
Copy link

ghost commented Nov 12, 2020

This is exactly what happened to me, I wrote a service that should cleanly shutdown a program within a screen session. For this it is necessary to send a specific text command to the program within the screen session. The ExecStop command was executed but immediately afterwards the SIGTERM ensured that the screen was closed before the program could be ended properly. At this point the documentation of TimeoutStopSecis is really a bit misleading and should be clarified. It took me some time to realize that there is no option for a timeout between ExecStop and the KillSignal.

Therefore I had the choice to put a SIGTERM wrapper around the screen or just tinker with the behavior of systemd.
For now I simply changed the KillSignal to SIGCONT which does not cause any harm to screen or the program itself to get rid of the unwanted SIGTERM. This way the program has enough time to do a clean shutdown and if it fails there is still the SIGKILL after the timeout. I would still prefer some kind of optional timeout between ExecStop and the KillSignal, but it seems that this will be prevented for reasons that are not really understandable.

Yes, in a perfect world you would simply adapt any software so that it reacts cleanly to a SIGTERM signal and would avoid yourself all the problems, but we don't live in this perfect world and have to get along with what we have, at least for the time being.

@elyscape
Copy link

While I agree that it would be ideal if everything handled SIGTERM properly, unfortunately sometimes one needs to write service units for apps provided by third parties that don't do this. As mentioned by @sonOfRa, this is particularly common with Java apps. As an extremely janky workaround, I've been doing something along these lines:

[Service]
# Type must be set correctly for $MAINPID detection to work
Type=exec
ExecStart=/path/to/myservice/start-command
ExecStop=/path/to/myservice/stop-command
ExecStop=/usr/bin/sh -c 'while kill -0 $MAINPID 2>/dev/null; do sleep 1; done'
TimeoutStopSec=15

It's ugly and terrible, and ideally it wouldn't be necessary, but it gets the job done.

@boucman
Copy link
Contributor

boucman commented Feb 24, 2021

That's definitely ugly, but a nice trick to know... I had never heard of kill -0. thx

@zabbal
Copy link

zabbal commented Feb 28, 2023

The following workaround works just fine for ugly Java app I have to work with:

[Service]
Type=simple
ExecStart=...
ExecStop=...
# Replace default systemd termination signal as daemon fails to handle it properly:
KillSignal=SIGCONT
# Wait for 2 minutes - Java apps even die slowly
TimeoutStopSec=120

I think this ticket can be safely closed - it's really not systemd responsibility to cater for broken Java apps.

@kevincox
Copy link

I would also like to see better tools here. While I agree that it would be nice if all services handled SIGTERM the fact is that the world isn't perfect and we need to live with third-party services or just languages that make this difficult.

I think the default behaviour is quite reasonable, but it would nice to see an option to change it. For me the best solution would be the following:

  1. Run ExecStop
  2. Watch the service for some timeout.
  3. If not dead send SIGTERM.
  4. Watch for TimeoutStopSec=
  5. Send SIGKILL.

Basically since systemd already has code to watch for service exit I would much rather not need to add some sort of loop in my ExecStop script that will both be slow and inefficient as it will inevitably use some sort of polling. Maybe this could be some extra option like: ExecStopAsyncTimeout=10min which would enable this wait after ExecStop where the service is expected to exit. If it doesn't then we can proceed with more direct killing measures.

Yes, this can be implemented by the user by adding a polling loop at the end of their script. But this is tedious, inefficient and probably breaks with PID reuse.

For now I will be using zabbal's workaround which is pretty close to the desired behaviour. But it would be nice to be able to still maintain a SIGTERM step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-discussion 🤔 pid1 RFE 🎁 Request for Enhancement, i.e. a feature request
Development

No branches or pull requests

8 participants