New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

udev fails to trigger loading of modules #11314

Closed
mbiebl opened this Issue Jan 2, 2019 · 110 comments

Comments

@mbiebl
Copy link
Contributor

mbiebl commented Jan 2, 2019

systemd version the issue has been seen with

v240

Used distribution

Debian sid

Since the update of systemd from v239 to v240, we had several downstream bug reports, where udev fails to trigger loading of kernel modules.
I think the following bug reports are basically all about the same underlying issue:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917185
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917247
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917607

So far, all reports are either from users using udev with sysvinit or with udev in the initramfs (which doesn't use systemd in Debian). I wonder if there is a race condition in udev which is more likely to trigger with sysvinit or in the initramfs.
I wasn't able to reproduce the issue myself so far.

The udev SysV init script can be found at
https://salsa.debian.org/systemd-team/systemd/blob/master/debian/udev.init
The initramfs hooks to start udevd are at
https://salsa.debian.org/systemd-team/systemd/blob/master/debian/extra/initramfs-tools/scripts/init-top/udev
https://salsa.debian.org/systemd-team/systemd/blob/master/debian/extra/initramfs-tools/scripts/init-bottom/udev

Any hints on how to further debug this would be greatly appreciated.

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 2, 2019

Does PR #11270 fix these issue?

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 2, 2019

From https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917185

I've just updated from 239-15 to 240-2; but, unfortunately, none of the problems were solved.

Hmm... The reporter rebuild initramfs?

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 2, 2019

Are there any chance to start udevd with debug logging? Otherwise, the kmod udev-builtin reports almost nothing on error.

@mbiebl

This comment has been minimized.

Copy link
Contributor

mbiebl commented Jan 2, 2019

@yuwata 240-2 does contain the patch from #11270. The initramfs is automatically rebuilt as part of the package update.

@mbiebl

This comment has been minimized.

Copy link
Contributor

mbiebl commented Jan 2, 2019

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 2, 2019

@yuwata does https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917247#45 help?

Unfortunately, no... I guess the reporter just paste only lines which contain 'systemd-udevd'. Currently, udev workers do not append 'systemd-udevd' at the beginning of the journal log. So, the reporter log does not contain any udev worker's log. Relevant fix is waiting in #11244.

@gregoa

This comment has been minimized.

Copy link

gregoa commented Jan 3, 2019

@mbiebl

This comment has been minimized.

Copy link
Contributor

mbiebl commented Jan 3, 2019

@gregoa there still seem some (initial) messages to be missing from a cursory glance.
Maybe increasing the kernel buffer size via log_buf_len=1M helps

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 3, 2019

systemd-udevd[526]: sd-device-monitor: Failed to create device from received message: Invalid argument

Hmm...

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 3, 2019

Ah, @mbiebl Debian or Ubuntu do not apply all patches in #11244 but only one commit which fixes segafault. So, still bind/unbind actions are dropped. I guess that causes so many issues, e.g. #11277 and #11299.

Please try to apply all patches, at least, b261494.

@mbiebl

This comment has been minimized.

Copy link
Contributor

mbiebl commented Jan 3, 2019

@gregoa can you try with that patch applied? I can provide deb packages if it's too much hassle for you.

@yuwata regarding this particular patch, I wasn't certain whether it's safe to pull, especially since that PR hasn't been merged yet. With all the back and forth regarding that bind/unbind problem, it wasn't immediately obvious if this revert is the right thing to do or not

@gregoa

This comment has been minimized.

Copy link

gregoa commented Jan 3, 2019

@mbiebl sure, I'll give it a try (tomorrow).

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 3, 2019

@mbiebl My previous comment does not intend to backport the patch right now, sorry. Yeah, let's wait reviews.

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 3, 2019

#11244 was merged. Let's close this, at least tentatively. If problem still exists, please reopen this.

@yuwata yuwata closed this Jan 3, 2019

@gregoa

This comment has been minimized.

Copy link

gregoa commented Jan 3, 2019

Adding b261494 on top of debian's 240-2 doesn't help, still the same failures. Logs at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917247#79

@keszybz keszybz reopened this Jan 3, 2019

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 3, 2019

@gregoa Thank you for the logs.
The module list of v240 in your first post does not contain 'evdev', but, the logs in the post#79, I found the following:

[   27.329014] [1106]: Inserted module 'evdev'
[   27.359598] [1105]: Module 'evdev' is already loaded

So, at least now evdev is correctly loaded?

@keszybz

This comment has been minimized.

Copy link
Member

keszybz commented Jan 3, 2019

Hmm, there were also my changes in ebcc52f and related commits. Maybe they are the issue.

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 3, 2019

BTW, the log provided by @gregoa does not contain, e.g., udevd processing 'acpi'. The first device processed by udevd is 'vcs2'. It seems quite strange for me. I am not sure whether it depends on the implementation of initramfs or not, but in my case, with dracut and plain systemd provided udev units, the first device is 'acpi'.

@gregoa

This comment has been minimized.

Copy link

gregoa commented Jan 3, 2019

@yuwata Please note that message 5 in the Debian bug report is not from me; I think in my prior logs (messages 45 and and 74), evdev was always present.

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 3, 2019

@gregoa Ah, sorry for that...

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 3, 2019

I am not familiar with the sysv init script, but it seems strange that it triggers only devices but not subsystems.

But init-top/udev triggers both. Is it intended?

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 3, 2019

@gregoa What happens if udevadm trigger --type=subsystems --action=add is added to udev.init?

@gregoa

This comment has been minimized.

Copy link

gregoa commented Jan 3, 2019

Wow. Looks like it just works; i.e. I have all modules back (and also no lvm warnings/timeouts). Thank you!
Quick diff for the debian init script and logs (just in case anyone is interested) at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917247#84

@mbiebl

This comment has been minimized.

Copy link
Contributor

mbiebl commented Jan 3, 2019

Fwiw, this part of the sysv init script has basically been unchanged for years (it was actually written before I started (co) maintaining systemd).
I'm still worried that we haven't actually found the root cause yet, as we have other user reports, where udev in the initramfs fails to load the modules as well (and initramfs-tools does use udevadm trigger --type=subsystems --action=add)

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 11, 2019

I've tried to create initrd by dracut without udevd: dracut -f -m 'bash base rootfs-block'. But, fortunately(?) udev works fine.
As you can see the script that dracut uses,
https://github.com/dracutdevs/dracut/blob/1ff306a3964840fc548188a8aa4862143b347e8a/modules.d/99base/init.sh#L152-L170,
dracut calls udevadm settle --timeout=0 and udevadm control --reload before triggering uevents. So, these may prevent to hit races in udevd (vs udevadm?).

p.s. Even if all lines between systemd-udevd --daemon and udevadm trigger are commented out, the issue cannot be reproduced...

@bikefrivolously

This comment has been minimized.

Copy link

bikefrivolously commented Jan 11, 2019

I did a bisect between v239 and v240 on systemd-stable
git bisect bad 1742aae <-- v240
git bisect good de7436b <-- v239

And came up with "e39b4d254de7c12be99a17cc54026695b0bd9c27 is the first bad commit".

I hope this helps narrow it down.

@bikefrivolously

This comment has been minimized.

Copy link

bikefrivolously commented Jan 11, 2019

And now I see that @lnussbaum already came to this conclusion.. welp, at least I learned something.

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 11, 2019

@bikefrivolously and @lnussbaum Thank you for bisecting.

And came up with "e39b4d254de7c12be99a17cc54026695b0bd9c27 is the first bad commit".

I've posted #11389. Could you test the PR?

@lnussbaum

This comment has been minimized.

Copy link

lnussbaum commented Jan 11, 2019

I can confirm that #11389 fixes the problem.

@lnussbaum

This comment has been minimized.

Copy link

lnussbaum commented Jan 11, 2019

I can also confirm that commit f90169d is enough to fix the problem

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 11, 2019

Wow, thanks a lot! Which commit does fix this issue?

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 11, 2019

Ah, I missed the previous comment. Thanks.

@ssolidus

This comment has been minimized.

Copy link

ssolidus commented Jan 11, 2019

@yuwata I can confirm that #11389 fixes the issue! My arch VM is booting fine from it. Thank you!

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 11, 2019

@ssolidus Thank you for testing the PR. If possible, please also test with only f90169d?

@cankoluman

This comment has been minimized.

Copy link

cankoluman commented Jan 11, 2019

@yuwata Thanks. I can also confirm that #11389 / commit f90169d fixes boot on Debian-Buster with a Gigabyte X-99 SLI (X99 Express Chipset)

Many thanks to all 👍

@ssolidus

This comment has been minimized.

Copy link

ssolidus commented Jan 11, 2019

@yuwata I confirm that f90169d fixes the boot still :)

@keszybz keszybz added this to the v241 milestone Jan 11, 2019

yuwata added a commit to yuwata/systemd that referenced this issue Jan 11, 2019

sd-device: do not try to set receive buffer size with SO_RCVBUF
v239 does not do that. But in v240, b1c097a
and e39b4d2 make set buffer size with SO_RCVBUF.
Still I do not know why that introduces systemd#11314. But, dropping that
actually fixes the issue. So, let's simply revert the relevant changes
made by those commits.

Fixes systemd#11314.
@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 11, 2019

@cankoluman and @ssolidus Thank you for testing the PR.

Uploading v240 to Debian was probably not a good idea it seems, I received quite a few angry user emails so far.
That said, not sure what more I can do :-/

@mbiebl I hope you will be freed from such emails soon :-)

yuwata added a commit to yuwata/systemd that referenced this issue Jan 11, 2019

sd-device-monitor: fix ordering of setting buffer size
By b1c097a (systemd#10239), the receive buffer
size for uevents was set by SO_RCVBUF at first, and fallback to
use SO_RCVBUFFORCE. So, as SO_RCVBUF limits to the buffer size
net.core.rmem_max, which is usually much smaller than 128MB udevd requests,
uevents buffer size was not sufficient.

This fixes the ordering of the request: SO_RCVBUFFORCE first, and
fallback to SO_RCVBUF. Then, udevd's uevent buffer size can be set to
128MB.

This also revert 9038932.

Fixes systemd#11314 and systemd#10754.
@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 11, 2019

Finally I found the root of the issue. Please test the revised version (98aed1d).

@keszybz keszybz closed this in ee0b9e7 Jan 11, 2019

@keszybz

This comment has been minimized.

Copy link
Member

keszybz commented Jan 11, 2019

Fixed by #11389.

@cankoluman

This comment has been minimized.

Copy link

cankoluman commented Jan 12, 2019

@yuwata & @keszybz, thanks again for the updates. I tested both 98aed1d and ee0b9e7 via systemd-udevd substitution - (on Debian-Buster X99 Chipset).

initramfs boots fine in both cases

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 12, 2019

Thank you for your help.

@ssolidus

This comment has been minimized.

Copy link

ssolidus commented Jan 12, 2019

@yuwata I also tested 98aed1d and confirm its working :)

@msg7086

This comment has been minimized.

Copy link

msg7086 commented Jan 15, 2019

Tested on Debian Buster VMware ESXi 6.7, and the patch seems to be working. I had to pull both systemd and udev 240-4 from sid, and since then I haven't seen any booting failures.

EDIT: Thanks!

@ewhac

This comment has been minimized.

Copy link

ewhac commented Jan 18, 2019

BTW, I didn't properly express my thanks. v240-4 from Debian Sid contains this patch, and works fine for me. Thank you kindly.

@yuwata

This comment has been minimized.

Copy link
Member

yuwata commented Jan 18, 2019

@ewhac Thank you for your help.

wkennington pushed a commit to triton/systemd that referenced this issue Jan 20, 2019

udev: open control and netlink sockets before daemonization
c4b69e9 effectively moved the initalization of socket.
Before that commit:
run → listen_fds → udev_ctrl_new → udev_ctrl_new_from_fd → socket()
After:
run → main_loop → manager_new → udev_ctrl_new_from_fd → socket()

The problem is that main_loop was called after daemonization. Move manager_new
out of main_loop and before daemonization.

Fixes systemd#11314 (hopefully ;)).

v2: Yu Watanabe
sd_event is initialized in main_loop().

(cherry picked from commit b5af8c8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment