New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn/disallow Restart=always without preventing repeated too quickly loop
#30804
Comments
This is already the case, i.e. if start limit is exceeded, no further auto-restart will be allowed. So I'm not really sure what this is about? |
|
While the original report might not have been perfectly phrased, I think there is indeed substance to the report and would like to add my observations and specific suggestion. I published a blog post titled “systemd: enable indefinite service restarts” where I share the one important bit of systemd knowledge that I apply to every single server system I maintain: how to make services restart indefinitely, instead of having systemd give up on them: https://michael.stapelberg.ch/posts/2024-01-17-systemd-indefinite-service-restarts/ To summarize: the problem is that systemd’s default restart limits are chosen such that systemd gives up on restarting crash-looping services: I don’t know if this was an intentional decision, or if the limits were chosen arbitrarily. Either way, to give a specific suggestion, how about we change the default configuration to: With these settings, services will be restarted indefinitely, with a 5-second delay in between. This doesn’t seem too aggressive, and should result in transient failures eventually resolving themselves without human intervention. What do you think? |
|
I find the default behaviour quite appropriate; at least I oppose defaulting or even suggesting an indefinite unthrottled restart behaviour with the suggested approach. IMO it's bad advice, especially the bit with enabling it for all services. If you apply such a big blunt "retry harder" hammer, you should really know what you are doing.. If your service crashes 5 times in 10 seconds, it's just plain broken. Trying to restart it will hide it from A possible compromise might be exponential back-off with |
|
Hey Martin 👋 Good to read from you, it’s been a long time :) Hope you’re doing well.
Note that we’re only talking about changing the behavior for services that specify
Yeah, this is a good point that has also been pointed out in the discussion on Mastodon: Some people alert on failed systemd services, so if a service never fails, it won’t show up in their monitoring/alerting. Personally, I prefer monitoring/alerting based on probing the application, not specific systemd units, but I agree that this is a difference in philosophy and that a crash-looping service should be discoverable as a problem in monitoring. Maybe systemd could expose that the service is restarting, but still keep trying to start it? In my Go appliance platform gokrazy, I’m exposing crashing services with a “restarting” status. For example, in this case, the frame buffer status display program is restarting because no monitor is connected:
Slowing down restarts to a crawl — you suggest once per hour after backoff, though hopefully the exact upper limit will be configurable — sounds like a good improvement to the status quo to me. |

Component
systemd
Is your feature request related to a problem? Please describe
I've recently opened a few issues, and PRs for various projects to address failure conditions that are all related to these services using a systemd
*.servicefile that they've configured by default withRestart=always, but have not included anyStartLimitIntervalSec,StartLimitBurst, orRestartSecdirectives to prevent the service from entering an infiniteStart request repeated too quicklyloop.I don't know if there's a commonly understood best-practice for creating service units that handle failures smartly. It just seems that some projects do, while others don't.
The problem in several of the situations I've experienced, and have tried to address, is that when this endless loop happens, it manifests in a way that doesn't leave the user with any obvious understanding as to what exactly is happening, such as just a blank unresponsive screen.
LukeShortCloud/winesapOS#716
canonical/lightdm#340
https://gitlab.archlinux.org/archlinux/packaging/packages/openssh/-/issues/3
Only to help illustrate what a user might experience, I asked ChatGPT to talk me through the execution steps of the gdm3 .service unit, in a scenario where xorg.conf contains a value that will cause x11 to immediately fail 100% of the time:
https://chat.openai.com/share/ca3c9786-53f2-476e-8c21-17b4465c4327
Describe the solution you'd like
While I don't think that it is systemd's responsibility that projects use systems units in the most optimal way, I feel like a wandering lunatic roaming from project to project suggesting changes, and having to justify why this was an annoying issue that I've had to work around (such as the case with sshd.service for literally any device running Arch on my network, for whatever reason), and so I wanted to check upstream with the common piece around all this.
At a minimum, perhaps it can be agreed upon that using
Restart=alwayswithout any supporting directives may possibly result in a negative outcome, as it assumes that restarting the service will magickally fix everything?If so, maybe a check, and a warning when the user executed
daemon-reload?I don't want to get ahead of myself, and there might be something I'm misunderstanding, so I want to first verify that what I understand is accurate, and that my concern is valid. If so, I hoped to open a conversation about this to see it there is any sort of consensus that can be agreed upon.
Describe alternatives you've considered
Have some reasonable built-in default values for
StartLimitIntervalSec, andStartLimitBurstthat are used ifRestart=alwaysis specified.Or maybe there ARE default values, and the presence of the
repeated too quicklymessages means that the service had exceeded those defaults, and systemd permanently put the service in a failed state that requires intervention to restart?I'm not really clear on the behavior, but logs seem to indicate to me that there are repeated occurrences of
repeated too quickly.The systemd version you checked that didn't have the feature you are asking for
255
The text was updated successfully, but these errors were encountered: