Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{xrootd,cmsd}@.service: RestartSec=0 may lead to downtimes #1410

Closed
olifre opened this issue Feb 24, 2021 · 3 comments
Closed

{xrootd,cmsd}@.service: RestartSec=0 may lead to downtimes #1410

olifre opened this issue Feb 24, 2021 · 3 comments
Assignees

Comments

@olifre
Copy link
Contributor

olifre commented Feb 24, 2021

During the last XRootD package upgrade, the services were down after the upgrade with journald having logged:

Feb 24 04:49:24 systemd[1]: Current command vanished from the unit file, execution of the command list won't be resumed.
Feb 24 04:49:24 systemd[1]: Stopping XRootD xrootd daemon instance grid...
Feb 24 04:49:24 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:24 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:24 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:24 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:24 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:25 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:25 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:25 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:25 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:25 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:25 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:26 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:26 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:26 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:26 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:26 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:26 systemd[1]: start request repeated too quickly for xrootd@grid.service
Feb 24 04:49:26 systemd[1]: Failed to start XRootD xrootd daemon instance grid.
Feb 24 04:49:26 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:26 systemd[1]: xrootd@grid.service failed.

This is expected behaviour with RestartSec=0 during upgrade of all the library packages, since for 1-2 seconds, the library versions may mismatch and xrootd (or potentially also cmsd) may segfault trying to load them.
Since systemd enters the start request repeated too quickly state, the service remains in stopped / crashed state afterwards and will not autorestart anymore, until it is manually restarted. So depending on the scale of the update (e.g. to the new major version 5.1.0), the operator (or his/her configuration management) has to manually revive the service.

I wonder if RestartSec=5 (or something similar) would be more "resilient" during upgrades?

@simonmichal
Copy link
Contributor

@olifre : thanks for reporting it, what you propose does sound reasonable, let me play around with it.

@olifre
Copy link
Contributor Author

olifre commented Mar 11, 2021

@simonmichal Thanks for playing with it :-).
Just to confirm, I upgraded from 5.0.3 to 5.1.1 just now, and this happened only on one machine (our redirector running on a VM). Likely, high traffic / higher I/O latency (as seen in virtual environments) during an upgrade causes this to trigger "reliably".

@olifre
Copy link
Contributor Author

olifre commented Apr 20, 2021

Thanks! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants