Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn/disallow Restart=always without preventing repeated too quickly loop #30804

Open
wallentx opened this issue Jan 6, 2024 · 4 comments
Open
Labels
needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer pid1 RFE 🎁 Request for Enhancement, i.e. a feature request

Comments

@wallentx
Copy link

wallentx commented Jan 6, 2024

Component

systemd

Is your feature request related to a problem? Please describe

I've recently opened a few issues, and PRs for various projects to address failure conditions that are all related to these services using a systemd *.service file that they've configured by default with Restart=always, but have not included any StartLimitIntervalSec, StartLimitBurst, or RestartSec directives to prevent the service from entering an infinite Start request repeated too quickly loop.

I don't know if there's a commonly understood best-practice for creating service units that handle failures smartly. It just seems that some projects do, while others don't.

The problem in several of the situations I've experienced, and have tried to address, is that when this endless loop happens, it manifests in a way that doesn't leave the user with any obvious understanding as to what exactly is happening, such as just a blank unresponsive screen.

LukeShortCloud/winesapOS#716

canonical/lightdm#340

https://gitlab.archlinux.org/archlinux/packaging/packages/openssh/-/issues/3

Only to help illustrate what a user might experience, I asked ChatGPT to talk me through the execution steps of the gdm3 .service unit, in a scenario where xorg.conf contains a value that will cause x11 to immediately fail 100% of the time:
https://chat.openai.com/share/ca3c9786-53f2-476e-8c21-17b4465c4327

Describe the solution you'd like

While I don't think that it is systemd's responsibility that projects use systems units in the most optimal way, I feel like a wandering lunatic roaming from project to project suggesting changes, and having to justify why this was an annoying issue that I've had to work around (such as the case with sshd.service for literally any device running Arch on my network, for whatever reason), and so I wanted to check upstream with the common piece around all this.

At a minimum, perhaps it can be agreed upon that using Restart=always without any supporting directives may possibly result in a negative outcome, as it assumes that restarting the service will magickally fix everything?

If so, maybe a check, and a warning when the user executed daemon-reload?

I don't want to get ahead of myself, and there might be something I'm misunderstanding, so I want to first verify that what I understand is accurate, and that my concern is valid. If so, I hoped to open a conversation about this to see it there is any sort of consensus that can be agreed upon.

Describe alternatives you've considered

Have some reasonable built-in default values for StartLimitIntervalSec, and StartLimitBurst that are used if Restart=always is specified.
Or maybe there ARE default values, and the presence of the repeated too quickly messages means that the service had exceeded those defaults, and systemd permanently put the service in a failed state that requires intervention to restart?
I'm not really clear on the behavior, but logs seem to indicate to me that there are repeated occurrences of repeated too quickly.

The systemd version you checked that didn't have the feature you are asking for

255

@wallentx wallentx added the RFE 🎁 Request for Enhancement, i.e. a feature request label Jan 6, 2024
@github-actions github-actions bot added the pid1 label Jan 6, 2024
@YHNdnzj
Copy link
Member

YHNdnzj commented Jan 6, 2024

Or maybe there ARE default values, and the presence of the repeated too quickly messages means that the service had exceeded those defaults, and systemd permanently put the service in a failed state that requires intervention to restart?

This is already the case, i.e. if start limit is exceeded, no further auto-restart will be allowed. So I'm not really sure what this is about?

@YHNdnzj YHNdnzj added the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Jan 6, 2024
@stapelberg
Copy link
Contributor

stapelberg commented Jan 16, 2024

While the original report might not have been perfectly phrased, I think there is indeed substance to the report and would like to add my observations and specific suggestion.

I published a blog post titled “systemd: enable indefinite service restarts” where I share the one important bit of systemd knowledge that I apply to every single server system I maintain: how to make services restart indefinitely, instead of having systemd give up on them:

https://michael.stapelberg.ch/posts/2024-01-17-systemd-indefinite-service-restarts/

To summarize: the problem is that systemd’s default restart limits are chosen such that systemd gives up on restarting crash-looping services:

DefaultRestartSec=100ms
DefaultStartLimitIntervalSec=10s
DefaultStartLimitBurst=5

I don’t know if this was an intentional decision, or if the limits were chosen arbitrarily.

Either way, to give a specific suggestion, how about we change the default configuration to:

[Manager]
DefaultRestartSec=5s
DefaultStartLimitIntervalSec=0

With these settings, services will be restarted indefinitely, with a 5-second delay in between. This doesn’t seem too aggressive, and should result in transient failures eventually resolving themselves without human intervention.

What do you think?

@martinpitt
Copy link
Contributor

I find the default behaviour quite appropriate; at least I oppose defaulting or even suggesting an indefinite unthrottled restart behaviour with the suggested approach. IMO it's bad advice, especially the bit with enabling it for all services. If you apply such a big blunt "retry harder" hammer, you should really know what you are doing..

If your service crashes 5 times in 10 seconds, it's just plain broken. Trying to restart it will hide it from systemctl --failed, obfuscate it in the logs, is a resource waste, possibly even a data loss (think databases or services which spew lots of files to the disk and drive it to ENOSPC), and make it less apparent that it needs fixing.

A possible compromise might be exponential back-off with Restart=always|on-failure, so that e.g. services which crash on external dependencies (it's always DNS, right?) try to recover every hour or so after having failed a dozen times. Would that suit your use case better?

@stapelberg
Copy link
Contributor

Hey Martin 👋 Good to read from you, it’s been a long time :) Hope you’re doing well.

IMO it's bad advice, especially the bit with enabling it for all services. If you apply such a big blunt "retry harder" hammer, you should really know what you are doing..

Note that we’re only talking about changing the behavior for services that specify Restart=always|on-failure.

Trying to restart it will hide it from systemctl --failed,

Yeah, this is a good point that has also been pointed out in the discussion on Mastodon: Some people alert on failed systemd services, so if a service never fails, it won’t show up in their monitoring/alerting.

Personally, I prefer monitoring/alerting based on probing the application, not specific systemd units, but I agree that this is a difference in philosophy and that a crash-looping service should be discoverable as a problem in monitoring.

Maybe systemd could expose that the service is restarting, but still keep trying to start it?

In my Go appliance platform gokrazy, I’m exposing crashing services with a “restarting” status. For example, in this case, the frame buffer status display program is restarting because no monitor is connected:

image

A possible compromise might be exponential back-off with Restart=always|on-failure, so that e.g. services which crash on external dependencies (it's always DNS, right?) try to recover every hour or so after having failed a dozen times. Would that suit your use case better?

Slowing down restarts to a crawl — you suggest once per hour after backoff, though hopefully the exact upper limit will be configurable — sounds like a good improvement to the status quo to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer pid1 RFE 🎁 Request for Enhancement, i.e. a feature request
Development

No branches or pull requests

4 participants