{xrootd,cmsd}@.service: RestartSec=0 may lead to downtimes #1410

olifre · 2021-02-24T07:53:02Z

During the last XRootD package upgrade, the services were down after the upgrade with journald having logged:

Feb 24 04:49:24 systemd[1]: Current command vanished from the unit file, execution of the command list won't be resumed.
Feb 24 04:49:24 systemd[1]: Stopping XRootD xrootd daemon instance grid...
Feb 24 04:49:24 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:24 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:24 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:24 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:24 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:25 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:25 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:25 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:25 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:25 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:25 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:26 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:26 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:26 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:26 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:26 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:26 systemd[1]: start request repeated too quickly for xrootd@grid.service
Feb 24 04:49:26 systemd[1]: Failed to start XRootD xrootd daemon instance grid.
Feb 24 04:49:26 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:26 systemd[1]: xrootd@grid.service failed.

This is expected behaviour with RestartSec=0 during upgrade of all the library packages, since for 1-2 seconds, the library versions may mismatch and xrootd (or potentially also cmsd) may segfault trying to load them.
Since systemd enters the start request repeated too quickly state, the service remains in stopped / crashed state afterwards and will not autorestart anymore, until it is manually restarted. So depending on the scale of the update (e.g. to the new major version 5.1.0), the operator (or his/her configuration management) has to manually revive the service.

I wonder if RestartSec=5 (or something similar) would be more "resilient" during upgrades?

The text was updated successfully, but these errors were encountered:

simonmichal · 2021-03-11T07:54:03Z

@olifre : thanks for reporting it, what you propose does sound reasonable, let me play around with it.

olifre · 2021-03-11T13:39:46Z

@simonmichal Thanks for playing with it :-).
Just to confirm, I upgraded from 5.0.3 to 5.1.1 just now, and this happened only on one machine (our redirector running on a VM). Likely, high traffic / higher I/O latency (as seen in virtual environments) during an upgrade causes this to trigger "reliably".

olifre · 2021-04-20T13:43:05Z

Thanks! :-)

simonmichal self-assigned this Mar 11, 2021

simonmichal added the enhancement label Mar 11, 2021

abh3 mentioned this issue Apr 12, 2021

Xrootd Service restarted continously #1442

Closed

simonmichal closed this as completed in 2f4dfc2 Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{xrootd,cmsd}@.service: RestartSec=0 may lead to downtimes #1410

{xrootd,cmsd}@.service: RestartSec=0 may lead to downtimes #1410

olifre commented Feb 24, 2021

simonmichal commented Mar 11, 2021

olifre commented Mar 11, 2021

olifre commented Apr 20, 2021

{xrootd,cmsd}@.service: RestartSec=0 may lead to downtimes #1410

{xrootd,cmsd}@.service: RestartSec=0 may lead to downtimes #1410

Comments

olifre commented Feb 24, 2021

simonmichal commented Mar 11, 2021

olifre commented Mar 11, 2021

olifre commented Apr 20, 2021