example rabbitmq-server.service.example systemd service unit should automatically restart rmq #1359

Closed
rgl opened this Issue Sep 12, 2017 · 13 comments

Comments

Projects
None yet
3 participants

rgl commented Sep 12, 2017

The rabbitmq-server.service.example is not configured to automatically restart when there is an error. Is there a reason not to?

i.e. it should contain the line Restart=always or Restart=on-failure and maybe RestartSec=10.

For reference see the Restart= documentation.

Owner

lukebakken commented Sep 12, 2017

Is there a reason not to?

I suspect it is because there are (rare) failure situations where auto-restart is not a good solution, or that an auto-restart may clobber data that could be used to diagnose the failure.

@michaelklishin probably has more historical information about this.

Owner

michaelklishin commented Sep 12, 2017

@rgl there is no big idea behind not having that line. I recall a similar discussion and a similar question about the Windows service. Feel free to submit a PR that adds Restart=on-failure against the stable branch. I think we should also add a StartLimitIntervalSec value, say, 5 seconds?

Owner

lukebakken commented Sep 12, 2017

We may want to use StartLimitIntervalSec= and StartLimitBurst= to prevent too many restarts.

There is some discussion here.

Owner

michaelklishin commented Sep 13, 2017

@rgl would you like to submit a PR or should our team handle this?

rgl commented Sep 17, 2017

Ah, those are good points! Now I'm in a catch-22 "to restart, or not to restart, that is the question" moment!

But, I'm lingering towards starting indefinitely... with Restart=always, StartLimitIntervalSec=0 (to disable rate limiting and keep restarting forever) and RestartSec=10 (restart after 10 seconds for not burning the cpu/logs in case it really keeps restarting forever). Something like:

cat >/etc/systemd/system/test.service <<'EOF'
[Unit]
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
ExecStart=/bin/bash -c "date '+%F %T.%%N';exit 1;"
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable test
systemctl start test
journalctl --follow -u test
systemctl disable test

But I'm not really sure what should be the default for rabbitmq :-/

Owner

michaelklishin commented Sep 17, 2017

Restarting forever may be OK but not rate limiting (in particular, not limiting concurrency of possible restart attempts) sounds like a very bad idea to me. I 👎 setting StartLimitIntervalSec to 0. A single restart attempt every 10 seconds seems reasonable to me.

rgl commented Sep 18, 2017

But is there other way to keep restarting forever without disabling rate limiting with StartLimitIntervalSec=0?

Owner

michaelklishin commented Sep 18, 2017

Why do we need to restart forever? Is it really such a good idea? I don't know.

I'm no systemd expert but my reading of the docs is that they have the same "restart intensity" settings as in Erlang: a time interval and how many restart attempts ("bursts") in that time frame are considered to be reasonable.

The only caveat is

they apply to all kinds of starts (including manual)

and

Note that systemctl reset-failed will cause the restart rate counter for a service to be flushed,
which is useful if the administrator wants to manually start a unit and the start limit interferes with that

So it sounds like this will effectively restart forever unless things are so broken that it restarts more than N times in T seconds.

In which case maybe "1 restart a second" or two seconds should be the limit. Because…

Note that this rate-limiting is enforced after any unit condition checks are executed,
and hence unit activations with failing conditions are not counted by this rate limiting

rgl commented Sep 18, 2017

Oh, I now realize that I failed to mention the reason why I've initially created this issue... I had a disk outage, freed the disk, and much latter noticed rmq was stopped, and scratched my head why rmq was never (re)started by systemd.

It turns out that rmq stopped due to the disk outage and was never restarted due to the systemd unit configuration.

This made me realize that I needed to have alerting in place and perhaps change the systemd unit to keep restarting rmq forever. Hence this issue was created.

In my particular case, having systemd restart (forever) rmq would have helped.

Owner

michaelklishin commented Sep 18, 2017

RabbitMQ does not intentionally stop when the disk is full: enough I/O operations fail and cause certain critically important parts to shut down. Restarting forever is a great idea on the surface, not so much in practice. We highly recommend replacing nodes that ran out of disk space.

Owner

michaelklishin commented Sep 18, 2017

With a full disk restarts would have failed as well, possibly forever. Picking a cut-off frequency is pretty challenging for the general case but according to the docs quotes above 1 time per second should be reasonable.

Owner

lukebakken commented Sep 18, 2017

At this time, I don't see a way in the systemd documentation to limit the total number of restart attempts. Additionally, I don't like the idea of auto-restarting RabbitMQ as that is likely to hide an issue that an operator may want to investigate.

lukebakken added a commit that referenced this issue Sep 19, 2017

Add optional Restart and RestartSec configuration
Fixes #1359

This gives guidances for those users who wish to automatically restart RabbitMQ in the event of a failure. Tested by using the `Restart=on-failure` setting, then running `rabbitmqctl eval "erlang:halt(abort)."`

@michaelklishin michaelklishin added this to the 3.6.13 milestone Sep 19, 2017

Owner

michaelklishin commented Sep 19, 2017

@rgl we updated the example and will make Restart=on-failure the default (with a 10 second restart delay). Thank you for your feedback along the way!

lukebakken added a commit to rabbitmq/rabbitmq-server-release that referenced this issue Sep 19, 2017

Add optional Restart and RestartSec configuration
See rabbitmq/rabbitmq-server#1359

This gives guidances for those users who wish to automatically restart RabbitMQ in the event of a failure. Tested by using the `Restart=on-failure` setting, then running `rabbitmqctl eval "erlang:halt(abort)."`

@lukebakken lukebakken referenced this issue in rabbitmq/rabbitmq-server-release Sep 19, 2017

Merged

Add optional Restart and RestartSec configuration #49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment