You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the last XRootD package upgrade, the services were down after the upgrade with journald having logged:
Feb 24 04:49:24 systemd[1]: Current command vanished from the unit file, execution of the command list won't be resumed.
Feb 24 04:49:24 systemd[1]: Stopping XRootD xrootd daemon instance grid...
Feb 24 04:49:24 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:24 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:24 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:24 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:24 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:24 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:24 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:25 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:25 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:25 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:25 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:25 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:25 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:25 systemd[1]: Started XRootD xrootd daemon instance grid.
Feb 24 04:49:26 systemd[1]: xrootd@grid.service: main process exited, code=killed, status=11/SEGV
Feb 24 04:49:26 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:26 systemd[1]: xrootd@grid.service failed.
Feb 24 04:49:26 systemd[1]: xrootd@grid.service has no holdoff time, scheduling restart.
Feb 24 04:49:26 systemd[1]: Stopped XRootD xrootd daemon instance grid.
Feb 24 04:49:26 systemd[1]: start request repeated too quickly for xrootd@grid.service
Feb 24 04:49:26 systemd[1]: Failed to start XRootD xrootd daemon instance grid.
Feb 24 04:49:26 systemd[1]: Unit xrootd@grid.service entered failed state.
Feb 24 04:49:26 systemd[1]: xrootd@grid.service failed.
This is expected behaviour with RestartSec=0 during upgrade of all the library packages, since for 1-2 seconds, the library versions may mismatch and xrootd (or potentially also cmsd) may segfault trying to load them.
Since systemd enters the start request repeated too quickly state, the service remains in stopped / crashed state afterwards and will not autorestart anymore, until it is manually restarted. So depending on the scale of the update (e.g. to the new major version 5.1.0), the operator (or his/her configuration management) has to manually revive the service.
I wonder if RestartSec=5 (or something similar) would be more "resilient" during upgrades?
The text was updated successfully, but these errors were encountered:
@simonmichal Thanks for playing with it :-).
Just to confirm, I upgraded from 5.0.3 to 5.1.1 just now, and this happened only on one machine (our redirector running on a VM). Likely, high traffic / higher I/O latency (as seen in virtual environments) during an upgrade causes this to trigger "reliably".
During the last
XRootD
package upgrade, the services were down after the upgrade withjournald
having logged:This is expected behaviour with
RestartSec=0
during upgrade of all the library packages, since for 1-2 seconds, the library versions may mismatch andxrootd
(or potentially alsocmsd
) may segfault trying to load them.Since
systemd
enters thestart request repeated too quickly
state, the service remains instopped
/crashed
state afterwards and will not autorestart anymore, until it is manually restarted. So depending on the scale of the update (e.g. to the new major version5.1.0
), the operator (or his/her configuration management) has to manually revive the service.I wonder if
RestartSec=5
(or something similar) would be more "resilient" during upgrades?The text was updated successfully, but these errors were encountered: