Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion 'fd >= 0' failed at ../src/basic/fd-util.c:43, function close_nointr(). Aborting. #8035

Closed
sourcejedi opened this issue Jan 29, 2018 · 7 comments
Labels
bug 🐛 Programming errors, that need preferential fixing login
Milestone

Comments

@sourcejedi
Copy link
Contributor

sourcejedi commented Jan 29, 2018

Submission type

  • Bug report

systemd version the issue has been seen with

I think v237, but maybe a few commits earlier.

Used distribution

Fedora 27

In case of bug report: Unexpected behaviour you saw

Jan 28 10:38:19 fedora27-vm systemd[1]: systemd-logind.service: Watchdog timeout (limit 3min)!
Jan 28 10:38:19 fedora27-vm systemd[1]: systemd-logind.service: Killing process 648 (systemd-logind) with signal SIGABRT.
Jan 28 10:38:22 fedora27-vm systemd[1]: systemd-logind.service: Main process exited, code=dumped, status=6/ABRT
Jan 28 10:38:22 fedora27-vm systemd[1]: systemd-logind.service: Failed with result 'watchdog'.
Jan 28 10:38:22 fedora27-vm systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart.
Jan 28 10:38:22 fedora27-vm systemd[1]: systemd-logind.service: Scheduled restart job, restart counter is at 1.

this happens due to suspending the host of a VM, it's a bug in KVM, effectively it needs to be fixed to used CLOCK_MONOTONIC and not CLOCK_BOOTTIME. but then very quickly systemd-logins restarts and then fails an assertion

Jan 28 10:38:22 fedora27-vm systemd[1]: Starting Login Service...
Jan 28 10:38:22 fedora27-vm systemd-logind[2752]: New seat seat0.
Jan 28 10:38:22 fedora27-vm systemd-logind[2752]: Watching system buttons on /dev/input/event0 (Power Button)
Jan 28 10:38:22 fedora27-vm systemd-logind[2752]: Watching system buttons on /dev/input/event1 (AT Translated Set 2 keyboard)
Jan 28 10:38:22 fedora27-vm systemd[1]: Started Login Service.
Jan 28 10:38:22 fedora27-vm systemd-logind[2752]: New session 3 of user alan-sysop.
Jan 28 10:38:22 fedora27-vm systemd-logind[2752]: New session c1 of user gdm.
Jan 28 10:38:22 fedora27-vm systemd-logind[2752]: Assertion 'fd >= 0' failed at ../src/basic/fd-util.c:43, function close_nointr(). Aborting.

The backtrace for the assert above is

(gdb) where
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f9cda892381 in __GI_abort () at abort.c:79
#2  0x00007f9cdb147c78 in log_object_internalv (level=2, error=43, file=0x7f9cdb244857 "", line=32668, 
    func=0x7f9cdb2449b8 <__PRETTY_FUNCTION__.9557+24> "me", object_field=0x7f9cdb247318 <__PRETTY_FUNCTION__.8273+8> "t_iovec", 
    object=0x7ffc97991660 "\001", extra_field=0x0, extra=0x0, format=0x0, ap=0x0) at ../src/basic/log.c:735
#3  0x00007f9cdb12fdd9 in extract_many_words (p=0x0, separators=0x0, flags=0) at ../src/basic/extract-word.c:258
#4  0x0000556809f28dc6 in session_device_free (sd=0x55680a789b70) at ../src/login/logind-session-device.c:426
#5  0x0000556809f1eaa6 in device_detach (d=0x55680a78bab0) at ../src/login/logind-device.c:63
#6  0x0000556809f1ec53 in device_free (d=0x55680a78bab0) at ../src/login/logind-device.c:78
#7  0x0000556809f1d4c7 in manager_process_seat_device (m=0x55680a76b3c0, d=0x55680a77da90) at ../src/login/logind-core.c:204
#8  0x0000556809f1981b in manager_dispatch_device_udev (s=0x55680a76d820, fd=14, revents=1, userdata=0x55680a76b3c0) at ../src/login/logind.c:587
#9  0x00007f9cdb1df87b in process_signal (e=0x55680a76b3c0, d=0x0, events=21864) at ../src/libsystemd/sd-event/sd-event.c:2197
#10 0x00007f9cdb1e0c90 in sd_event_wait (e=0x6e139687ba3bde00, timeout=140722851877680) at ../src/libsystemd/sd-event/sd-event.c:2572
#11 0x00007f9cdb1e1122 in sd_event_dispatch (e=0x55680a76c5c0) at ../src/libsystemd/sd-event/sd-event.c:2652
#12 0x0000556809f1c7a9 in manager_run (m=0x55680a76b3c0) at ../src/login/logind.c:1206
#13 0x0000556809f1ca5d in main (argc=1, argv=0x7ffc97991668) at ../src/login/logind.c:1264

it looks like some other VM quirk, which also happens around the suspend

the general point is, no combination of udev events (or even timejump shenanigans) should be able to cause logind to assert.

(gdb) down
#7  0x0000556809f1d4c7 in manager_process_seat_device (m=0x55680a76b3c0, d=0x55680a77da90) at ../src/login/logind-core.c:204
204	                device_free(device);
(gdb) p *device
$6 = {manager = 0x55680a76b3c0, sysfs = 0x55680a78c350 "/sys/devices/virtual/input/input6", seat = 0x55680a778460, master = false, timestamp = {
    realtime = 1517135902602730, monotonic = 37160749639}, devices_next = 0x55680a78c3c0, devices_prev = 0x55680a78a470, 
  session_devices = 0x55680a789b70}
(gdb) p *device->seat
$7 = {manager = 0x55680a76b3c0, id = 0x55680a7784e3 "seat0", state_file = 0x55680a7784d0 "/run/systemd/seats/seat0", devices = 0x55680a77ad70, 
  active = 0x55680a77b4a0, pending_switch = 0x0, sessions = 0x55680a77b4a0, positions = 0x55680a774d30, position_count = 8, in_gc_queue = true, 
  started = true, gc_queue_next = 0x0, gc_queue_prev = 0x0}
$ journalctl -b | grep input6
Jan 28 10:38:22 fedora27-vm kernel: input: spice vdagent tablet as /devices/virtual/input/input6

My code analysis points to

session_load_devices():

            /* The file descriptors for loaded devices will be reattached later. */
            k = session_device_new(s, dev, false, &sd);
            if (k < 0)
                    r = k;

which is followed by

manager_attach_fds():

            if (fstat(fd, &st) < 0) {
                    /* The device is allowed to go away at a random point, in which
                     * case fstat failing is expected. */
                    log_debug_errno(errno, "Failed to stat device fd for session %s: %m", id);
                    close_nointr(fd);
                    continue;
            }

but in this case we would skip the call to session_device_attach_fd(), and then this assert() will end up catching us, right?

I also notice, there's a comment that says systemd just forgets device fds if it notices they are ever revoked. So I'm suspicious that we don't handle that case correctly either (i.e. fds being revoked for some reason, while we were being restarted). It doesn't seem very robust for logind to end up failing an assert() in this case either. If nothing else, it falls under the aegis of trying not to blow up when we get slightly weird data loaded on restart, as it could be due to an online package update.

@poettering poettering added the bug 🐛 Programming errors, that need preferential fixing label Feb 1, 2018
@poettering poettering added this to the v238 milestone Feb 1, 2018
@jubos
Copy link

jubos commented Feb 8, 2018

I am seeing this same exact assert when resuming host system with a VirtualBox linux guest with systemd v237.

@poettering
Copy link
Member

@sourcejedi I prepped a fix for this in #8285. Any chance you can verify whether that fixes things for you?

@sourcejedi
Copy link
Contributor Author

So if I read this correctly, the desired outcome is just not to assert. (Which is definitely the biggest problem). You're leaving the device object until it gets cleaned up by the udev remove event. Not worrying about any intermediate weirdness visible on dbus, for example.

Ah, interesting that session_device_start() was already using a safe_close() to handle this sort of thing.

The fstat() thing above is kinda fun. stat() without an f shouldn't act that way - it should always succeed and tell you about the device node minor/major. If fstat() works on O_PATH fds it should do the same. I'd expect if fstat() succeeds you're guaranteed to get the minor/major of the device node you opened. I find it really interesting we anticipate fstat() failing when the device has gone away.

Test method: let's make sure my other (non-vdagent) input devices are unbound in sysfs, suspend for 4 minutes. Check I get the watchdog timeout, no asserts, and can still use the mouse cursor through the (replugged) spice vdagent tablet.

@sourcejedi
Copy link
Contributor Author

sourcejedi commented Feb 26, 2018

Test method did not succeed on first attempt. systemd-logind did not get restarted by the watchdog. Only systemd-networkd did. But they both have 3 minute watchdog timers. The suspend period was 21:06:46 - 21:16:44; the gap can be observed in both host and guest logs.

When I wrote up this KVM bug, I was definitely getting multiple daemons restarted by watchdog timers.

Can't think of a radically simpler way to simulate this. I really need to make sure the watchdog triggers.

This is not what happened on previous occasions, but I was not attempting to reproduce it with a relatively short (10 min) period. Maybe there is a bug in the watchdog timers? Edit: possibly there's just a race between systemd deciding to kill the daemon, and the daemon sending the heartbeat. This seems correct behaviour, it's just frustrating for testing because it's non-deterministic.

Feb 20 17:53:40 fedora27-vm systemd[1]: systemd-timesyncd.service: Watchdog timeout (limit 3min)!
Feb 20 17:53:40 fedora27-vm systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!
Feb 20 17:53:40 fedora27-vm systemd[1]: systemd-networkd.service: Watchdog timeout (limit 3min)!
Feb 20 18:16:41 fedora27-vm systemd[1]: systemd-logind.service: Watchdog timeout (limit 3min)!
Feb 20 18:16:41 fedora27-vm systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!
Feb 20 18:16:41 fedora27-vm systemd[1]: systemd-timesyncd.service: Watchdog timeout (limit 3min)!
Feb 26 21:16:44 fedora27-vm systemd[1]: systemd-networkd.service: Watchdog timeout (limit 3min)!

At least looking at the log the logind crash is relatively frequent. So hopefully I will notice relatively soon if it hasn't been fixed.

Jan 31 11:32:38 fedora27-vm systemd-logind[3422]: Assertion 'fd >= 0' failed at ../src/basic/fd-util.c:43, function close_nointr(). Aborting.
Feb 10 22:22:32 fedora27-vm systemd-logind[5629]: Assertion 'fd >= 0' failed at ../src/basic/fd-util.c:43, function close_nointr(). Aborting.
Feb 11 21:13:58 fedora27-vm systemd-logind[1490]: Assertion 'fd >= 0' failed at ../src/basic/fd-util.c:43, function close_nointr(). Aborting.
Feb 14 15:16:19 fedora27-vm systemd-logind[3562]: Assertion 'fd >= 0' failed at ../src/basic/fd-util.c:43, function close_nointr(). Aborting.
Feb 14 18:28:07 fedora27-vm systemd-logind[4423]: Assertion 'fd >= 0' failed at ../src/basic/fd-util.c:43, function close_nointr(). Aborting.
Feb 14 18:28:07 fedora27-vm systemd-logind[4465]: Assertion 'fd >= 0' failed at ../src/basic/fd-util.c:43, function close_nointr(). Aborting.
Feb 14 18:28:07 fedora27-vm systemd-logind[4484]: Assertion 'fd >= 0' failed at ../src/basic/fd-util.c:43, function close_nointr(). Aborting.
Feb 18 11:12:43 fedora27-vm systemd-logind[3559]: Assertion 'fd >= 0' failed at ../src/basic/fd-util.c:43, function close_nointr(). Aborting.

@sourcejedi
Copy link
Contributor Author

Amusingly, reducing the watchdog timer (and reloading systemd) causes the watchdog to trigger, as systemd expects the shorter heartbeat but the daemon has no idea. Seems like that should be fixed.

@sourcejedi
Copy link
Contributor Author

sourcejedi commented Feb 26, 2018

And any logind restart seems to break the wayland session, including the keyboard. So I probably can't test real function, only the lack of an assert.

Btw no log messages got redirected to dmesg during the suspend test above, and the same pid of logind was still running, so I don't think I was missing anything in the suspend test by relying on the journal...

Welp. My uneasyness about the fix' approach may have been justified. In the process of this, I got another assert for an fd, in a different location.

#0  0x00007f55146da66b in raise () from /lib64/libc.so.6
#1  0x00007f55146dc381 in abort () from /lib64/libc.so.6
#2  0x00007f5514f94b15 in log_assert_failed_realm (realm=LOG_REALM_SYSTEMD, text=0x560449c29580 "fd >= 0", 
    file=0x560449c29500 "../src/login/logind-session-device.c", line=126, func=0x560449c296d0 <__PRETTY_FUNCTION__.11557> "sd_drmsetmaster")
    at ../src/basic/log.c:833
#3  0x0000560449c0e7c4 in sd_drmsetmaster (fd=-1) at ../src/login/logind-session-device.c:126
#4  0x0000560449c0ea75 in session_device_start (sd=0x56044ac64c40) at ../src/login/logind-session-device.c:200
#5  0x0000560449c0f51f in session_device_resume_all (s=0x56044ac75f20) at ../src/login/logind-session-device.c:461
#6  0x0000560449c08386 in seat_set_active (s=0x56044ac617b0, session=0x56044ac75f20) at ../src/login/logind-seat.c:257
#7  0x0000560449c0886e in seat_active_vt_changed (s=0x56044ac617b0, vtnr=2) at ../src/login/logind-seat.c:368
#8  0x0000560449c08c8a in seat_read_active_vt (s=0x56044ac617b0) at ../src/login/logind-seat.c:411
#9  0x0000560449bffdd2 in manager_dispatch_console (s=0x56044ac56080, fd=11, revents=8, userdata=0x56044ac546c0) at ../src/login/logind.c:640
#10 0x00007f55150468f0 in source_dispatch (s=0x56044ac56080) at ../src/libsystemd/sd-event/sd-event.c:2307
#11 0x00007f5515047d05 in sd_event_dispatch (e=0x56044ac558d0) at ../src/libsystemd/sd-event/sd-event.c:2667
#12 0x00007f5515048197 in sd_event_run (e=0x56044ac558d0, timeout=18446744073709551615) at ../src/libsystemd/sd-event/sd-event.c:2727
#13 0x0000560449c02a26 in manager_run (m=0x56044ac546c0) at ../src/login/logind.c:1207
#14 0x0000560449c02cda in main (argc=1, argv=0x7fffed13b418) at ../src/login/logind.c:1265

the full log messages for this logind instance were

Feb 26 21:49:11 fedora27-vm systemd[1]: Starting Login Service...
Feb 26 21:49:11 fedora27-vm systemd-logind[3973]: New seat seat0.
Feb 26 21:49:11 fedora27-vm systemd[1]: Started Login Service.
Feb 26 21:49:11 fedora27-vm systemd-logind[3973]: Loading session devices for session c1 failed: No such device
Feb 26 21:49:11 fedora27-vm systemd-logind[3973]: Watching system buttons on /dev/input/event0 (Power Button)
Feb 26 21:49:11 fedora27-vm systemd-logind[3973]: Watching system buttons on /dev/input/event1 (AT Translated Set 2 keyboard)
Feb 26 21:49:11 fedora27-vm systemd-logind[3973]: New session 4 of user alan-sysop.
Feb 26 21:49:11 fedora27-vm systemd-logind[3973]: New session c1 of user gdm.
Feb 26 21:49:11 fedora27-vm systemd-logind[3973]: New session 5 of user alan-sysop.
Feb 26 21:49:11 fedora27-vm systemd-logind[3973]: New session 3 of user alan-sysop.
Feb 26 21:49:36 fedora27-vm systemd-logind[3973]: Assertion 'fd >= 0' failed at ../src/login/logind-session-device.c:126, function sd_drmsetmaster(). 
Feb 26 21:49:38 fedora27-vm systemd[1]: systemd-logind.service: Main process exited, code=dumped, status=6/ABRT
Feb 26 21:49:38 fedora27-vm systemd[1]: systemd-logind.service: Failed with result 'core-dump'.
Feb 26 21:49:38 fedora27-vm systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart.
Feb 26 21:49:38 fedora27-vm systemd[1]: systemd-logind.service: Scheduled restart job, restart counter is at 4.

(the subsequent 5th restart of logind succeeds. The full logs look as if it were dropping a different fd-holding-object on each restart until it had gotten rid of all of them :).

The first restart was not immediate after the watchdog kill. Might have been triggered by a (somewhat forced?) vt switch. As you might guess, VT2 is where my GUI session was logged in and I believe was my VT before I first tried to switch it. But I'm not sure if it was the first VT switch which triggered the assert, or a later one.

Feb 26 21:42:13 fedora27-vm systemd-logind[3602]: New session 5 of user alan-sysop.
Feb 26 21:48:53 fedora27-vm systemd-logind[3602]: Assertion 'fd >= 0' failed at ../src/login/logind-session-device.c:126, function sd_drmsetmaster(). 

@sourcejedi
Copy link
Contributor Author

So we can conclude I managed to get logind to keep an "uninitialized" fd field around for six minutes. If there's a way to get rid of such objects at the end of startup, that would be great. I.e. before we start processing events from the scary real world :).

I notice this is a DRM fd, the original report was for an input device fd. So it's not the exact same case. Maybe it's a second root cause that could be identified, and then it would be ok not to worry about having intermediate states involving with uninitialized fd fields, I haven't gathered enough information to disprove that :).

sourcejedi added a commit to sourcejedi/systemd that referenced this issue Mar 3, 2018
$ git grep FDNAME
logind-session-device.c: ... "FDNAME=session-", sd->session->id);
logind-session-device.c: ... "FDNAME=session", sd->session->id);

Oops.

Fixes systemd#8343.  Or at least a more minimal reproducer.  Xorg still
dies when logind is restarted, but the Xorg message says this
is entirely deliberate.

(This must also have been the reason I hit systemd#8035 / systemd#8291,
though I did think at least the first one could also be
hit in a race condition).
sourcejedi added a commit to sourcejedi/systemd that referenced this issue Mar 3, 2018
$ git grep FDNAME
logind-session-device.c: ... "FDNAME=session-", sd->session->id);
logind-session-device.c: ... "FDNAME=session", sd->session->id);

Oops.

Fixes systemd#8343.  Or at least a more minimal reproducer.  Xorg still
dies when logind is restarted, but the Xorg message says this
is entirely deliberate.

(This could also be the reason I hit systemd#8035, instead of the race
condition I originally suggested).
Werkov pushed a commit to Werkov/systemd that referenced this issue Jun 3, 2020
Fixes: systemd#8035
(cherry picked from commit 4d219f5)

[fbui: fixes bsc#1123727]
Werkov pushed a commit to Werkov/systemd that referenced this issue Jun 3, 2020
$ git grep FDNAME
logind-session-device.c: ... "FDNAME=session-", sd->session->id);
logind-session-device.c: ... "FDNAME=session", sd->session->id);

Oops.

Fixes systemd#8343.  Or at least a more minimal reproducer.  Xorg still
dies when logind is restarted, but the Xorg message says this
is entirely deliberate.

(This could also be the reason I hit systemd#8035, instead of the race
condition I originally suggested).

(cherry picked from commit b5cdfa4)

[fbui: follow-up of 4050e47]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Programming errors, that need preferential fixing login
Development

No branches or pull requests

3 participants