New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion 'fd >= 0' failed at ../src/basic/fd-util.c:43, function close_nointr(). Aborting. #8035
Comments
I am seeing this same exact assert when resuming host system with a VirtualBox linux guest with systemd v237. |
@sourcejedi I prepped a fix for this in #8285. Any chance you can verify whether that fixes things for you? |
So if I read this correctly, the desired outcome is just not to assert. (Which is definitely the biggest problem). You're leaving the device object until it gets cleaned up by the udev remove event. Not worrying about any intermediate weirdness visible on dbus, for example. Ah, interesting that session_device_start() was already using a safe_close() to handle this sort of thing. The fstat() thing above is kinda fun. Test method: let's make sure my other (non-vdagent) input devices are unbound in sysfs, suspend for 4 minutes. Check I get the watchdog timeout, no asserts, and can still use the mouse cursor through the (replugged) spice vdagent tablet. |
Test method did not succeed on first attempt. systemd-logind did not get restarted by the watchdog. Only systemd-networkd did. But they both have 3 minute watchdog timers. The suspend period was 21:06:46 - 21:16:44; the gap can be observed in both host and guest logs. When I wrote up this KVM bug, I was definitely getting multiple daemons restarted by watchdog timers. Can't think of a radically simpler way to simulate this. I really need to make sure the watchdog triggers. This is not what happened on previous occasions, but I was not attempting to reproduce it with a relatively short (10 min) period.
At least looking at the log the logind crash is relatively frequent. So hopefully I will notice relatively soon if it hasn't been fixed.
|
Amusingly, reducing the watchdog timer (and reloading systemd) causes the watchdog to trigger, as systemd expects the shorter heartbeat but the daemon has no idea. Seems like that should be fixed. |
And any logind restart seems to break the wayland session, including the keyboard. So I probably can't test real function, only the lack of an assert. Btw no log messages got redirected to Welp. My uneasyness about the fix' approach may have been justified. In the process of this, I got another assert for an fd, in a different location.
the full log messages for this logind instance were
(the subsequent 5th restart of logind succeeds. The full logs look as if it were dropping a different fd-holding-object on each restart until it had gotten rid of all of them :). The first restart was not immediate after the watchdog kill. Might have been triggered by a (somewhat forced?) vt switch. As you might guess, VT2 is where my GUI session was logged in and I believe was my VT before I first tried to switch it. But I'm not sure if it was the first VT switch which triggered the assert, or a later one.
|
So we can conclude I managed to get logind to keep an "uninitialized" fd field around for six minutes. If there's a way to get rid of such objects at the end of startup, that would be great. I.e. before we start processing events from the scary real world :). I notice this is a DRM fd, the original report was for an input device fd. So it's not the exact same case. Maybe it's a second root cause that could be identified, and then it would be ok not to worry about having intermediate states involving with uninitialized fd fields, I haven't gathered enough information to disprove that :). |
$ git grep FDNAME logind-session-device.c: ... "FDNAME=session-", sd->session->id); logind-session-device.c: ... "FDNAME=session", sd->session->id); Oops. Fixes systemd#8343. Or at least a more minimal reproducer. Xorg still dies when logind is restarted, but the Xorg message says this is entirely deliberate. (This must also have been the reason I hit systemd#8035 / systemd#8291, though I did think at least the first one could also be hit in a race condition).
$ git grep FDNAME logind-session-device.c: ... "FDNAME=session-", sd->session->id); logind-session-device.c: ... "FDNAME=session", sd->session->id); Oops. Fixes systemd#8343. Or at least a more minimal reproducer. Xorg still dies when logind is restarted, but the Xorg message says this is entirely deliberate. (This could also be the reason I hit systemd#8035, instead of the race condition I originally suggested).
Fixes: systemd#8035 (cherry picked from commit 4d219f5) [fbui: fixes bsc#1123727]
$ git grep FDNAME logind-session-device.c: ... "FDNAME=session-", sd->session->id); logind-session-device.c: ... "FDNAME=session", sd->session->id); Oops. Fixes systemd#8343. Or at least a more minimal reproducer. Xorg still dies when logind is restarted, but the Xorg message says this is entirely deliberate. (This could also be the reason I hit systemd#8035, instead of the race condition I originally suggested). (cherry picked from commit b5cdfa4) [fbui: follow-up of 4050e47]
Submission type
systemd version the issue has been seen with
I think v237, but maybe a few commits earlier.
Used distribution
Fedora 27
In case of bug report: Unexpected behaviour you saw
Jan 28 10:38:19 fedora27-vm systemd[1]: systemd-logind.service: Watchdog timeout (limit 3min)!
Jan 28 10:38:19 fedora27-vm systemd[1]: systemd-logind.service: Killing process 648 (systemd-logind) with signal SIGABRT.
Jan 28 10:38:22 fedora27-vm systemd[1]: systemd-logind.service: Main process exited, code=dumped, status=6/ABRT
Jan 28 10:38:22 fedora27-vm systemd[1]: systemd-logind.service: Failed with result 'watchdog'.
Jan 28 10:38:22 fedora27-vm systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart.
Jan 28 10:38:22 fedora27-vm systemd[1]: systemd-logind.service: Scheduled restart job, restart counter is at 1.
this happens due to suspending the host of a VM, it's a bug in KVM, effectively it needs to be fixed to used CLOCK_MONOTONIC and not CLOCK_BOOTTIME. but then very quickly systemd-logins restarts and then fails an assertion
Jan 28 10:38:22 fedora27-vm systemd[1]: Starting Login Service...
Jan 28 10:38:22 fedora27-vm systemd-logind[2752]: New seat seat0.
Jan 28 10:38:22 fedora27-vm systemd-logind[2752]: Watching system buttons on /dev/input/event0 (Power Button)
Jan 28 10:38:22 fedora27-vm systemd-logind[2752]: Watching system buttons on /dev/input/event1 (AT Translated Set 2 keyboard)
Jan 28 10:38:22 fedora27-vm systemd[1]: Started Login Service.
Jan 28 10:38:22 fedora27-vm systemd-logind[2752]: New session 3 of user alan-sysop.
Jan 28 10:38:22 fedora27-vm systemd-logind[2752]: New session c1 of user gdm.
Jan 28 10:38:22 fedora27-vm systemd-logind[2752]: Assertion 'fd >= 0' failed at ../src/basic/fd-util.c:43, function close_nointr(). Aborting.
The backtrace for the assert above is
it looks like some other VM quirk, which also happens around the suspend
the general point is, no combination of udev events (or even timejump shenanigans) should be able to cause logind to assert.
My code analysis points to
which is followed by
but in this case we would skip the call to session_device_attach_fd(), and then this assert() will end up catching us, right?
I also notice, there's a comment that says systemd just forgets device fds if it notices they are ever revoked. So I'm suspicious that we don't handle that case correctly either (i.e. fds being revoked for some reason, while we were being restarted). It doesn't seem very robust for logind to end up failing an assert() in this case either. If nothing else, it falls under the aegis of trying not to blow up when we get slightly weird data loaded on restart, as it could be due to an online package update.
The text was updated successfully, but these errors were encountered: