Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

udev fails to trigger loading of modules #11314

Closed
mbiebl opened this issue Jan 2, 2019 · 110 comments
Closed

udev fails to trigger loading of modules #11314

mbiebl opened this issue Jan 2, 2019 · 110 comments
Labels
regression ⚠️ A bug in something that used to work correctly and broke through some recent commit udev
Milestone

Comments

@mbiebl
Copy link
Contributor

mbiebl commented Jan 2, 2019

systemd version the issue has been seen with

v240

Used distribution

Debian sid

Since the update of systemd from v239 to v240, we had several downstream bug reports, where udev fails to trigger loading of kernel modules.
I think the following bug reports are basically all about the same underlying issue:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917185
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917247
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917607

So far, all reports are either from users using udev with sysvinit or with udev in the initramfs (which doesn't use systemd in Debian). I wonder if there is a race condition in udev which is more likely to trigger with sysvinit or in the initramfs.
I wasn't able to reproduce the issue myself so far.

The udev SysV init script can be found at
https://salsa.debian.org/systemd-team/systemd/blob/master/debian/udev.init
The initramfs hooks to start udevd are at
https://salsa.debian.org/systemd-team/systemd/blob/master/debian/extra/initramfs-tools/scripts/init-top/udev
https://salsa.debian.org/systemd-team/systemd/blob/master/debian/extra/initramfs-tools/scripts/init-bottom/udev

Any hints on how to further debug this would be greatly appreciated.

@mbiebl mbiebl added udev regression ⚠️ A bug in something that used to work correctly and broke through some recent commit labels Jan 2, 2019
@yuwata
Copy link
Member

yuwata commented Jan 2, 2019

Does PR #11270 fix these issue?

@yuwata
Copy link
Member

yuwata commented Jan 2, 2019

From https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917185

I've just updated from 239-15 to 240-2; but, unfortunately, none of the problems were solved.

Hmm... The reporter rebuild initramfs?

@yuwata
Copy link
Member

yuwata commented Jan 2, 2019

Are there any chance to start udevd with debug logging? Otherwise, the kmod udev-builtin reports almost nothing on error.

@yuwata yuwata added the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Jan 2, 2019
@mbiebl
Copy link
Contributor Author

mbiebl commented Jan 2, 2019

@yuwata 240-2 does contain the patch from #11270. The initramfs is automatically rebuilt as part of the package update.

@mbiebl
Copy link
Contributor Author

mbiebl commented Jan 2, 2019

@yuwata
Copy link
Member

yuwata commented Jan 2, 2019

@yuwata does https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917247#45 help?

Unfortunately, no... I guess the reporter just paste only lines which contain 'systemd-udevd'. Currently, udev workers do not append 'systemd-udevd' at the beginning of the journal log. So, the reporter log does not contain any udev worker's log. Relevant fix is waiting in #11244.

@gregoa
Copy link

gregoa commented Jan 3, 2019

I tried again, new logs attached to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917247#74

@mbiebl
Copy link
Contributor Author

mbiebl commented Jan 3, 2019

@gregoa there still seem some (initial) messages to be missing from a cursory glance.
Maybe increasing the kernel buffer size via log_buf_len=1M helps

@yuwata
Copy link
Member

yuwata commented Jan 3, 2019

systemd-udevd[526]: sd-device-monitor: Failed to create device from received message: Invalid argument

Hmm...

@yuwata
Copy link
Member

yuwata commented Jan 3, 2019

Ah, @mbiebl Debian or Ubuntu do not apply all patches in #11244 but only one commit which fixes segafault. So, still bind/unbind actions are dropped. I guess that causes so many issues, e.g. #11277 and #11299.

Please try to apply all patches, at least, b261494.

@mbiebl
Copy link
Contributor Author

mbiebl commented Jan 3, 2019

@gregoa can you try with that patch applied? I can provide deb packages if it's too much hassle for you.

@yuwata regarding this particular patch, I wasn't certain whether it's safe to pull, especially since that PR hasn't been merged yet. With all the back and forth regarding that bind/unbind problem, it wasn't immediately obvious if this revert is the right thing to do or not

@gregoa
Copy link

gregoa commented Jan 3, 2019

@mbiebl sure, I'll give it a try (tomorrow).

@yuwata
Copy link
Member

yuwata commented Jan 3, 2019

@mbiebl My previous comment does not intend to backport the patch right now, sorry. Yeah, let's wait reviews.

@yuwata
Copy link
Member

yuwata commented Jan 3, 2019

#11244 was merged. Let's close this, at least tentatively. If problem still exists, please reopen this.

@yuwata yuwata closed this as completed Jan 3, 2019
@gregoa
Copy link

gregoa commented Jan 3, 2019

Adding b261494 on top of debian's 240-2 doesn't help, still the same failures. Logs at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917247#79

@keszybz keszybz reopened this Jan 3, 2019
@yuwata
Copy link
Member

yuwata commented Jan 3, 2019

@gregoa Thank you for the logs.
The module list of v240 in your first post does not contain 'evdev', but, the logs in the post#79, I found the following:

[   27.329014] [1106]: Inserted module 'evdev'
[   27.359598] [1105]: Module 'evdev' is already loaded

So, at least now evdev is correctly loaded?

@keszybz
Copy link
Member

keszybz commented Jan 3, 2019

Hmm, there were also my changes in ebcc52f and related commits. Maybe they are the issue.

@yuwata
Copy link
Member

yuwata commented Jan 3, 2019

BTW, the log provided by @gregoa does not contain, e.g., udevd processing 'acpi'. The first device processed by udevd is 'vcs2'. It seems quite strange for me. I am not sure whether it depends on the implementation of initramfs or not, but in my case, with dracut and plain systemd provided udev units, the first device is 'acpi'.

@gregoa
Copy link

gregoa commented Jan 3, 2019

@yuwata Please note that message 5 in the Debian bug report is not from me; I think in my prior logs (messages 45 and and 74), evdev was always present.

@yuwata
Copy link
Member

yuwata commented Jan 3, 2019

@gregoa Ah, sorry for that...

@yuwata
Copy link
Member

yuwata commented Jan 3, 2019

I am not familiar with the sysv init script, but it seems strange that it triggers only devices but not subsystems.

But init-top/udev triggers both. Is it intended?

@yuwata
Copy link
Member

yuwata commented Jan 3, 2019

@gregoa What happens if udevadm trigger --type=subsystems --action=add is added to udev.init?

@gregoa
Copy link

gregoa commented Jan 3, 2019

Wow. Looks like it just works; i.e. I have all modules back (and also no lvm warnings/timeouts). Thank you!
Quick diff for the debian init script and logs (just in case anyone is interested) at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917247#84

@mbiebl
Copy link
Contributor Author

mbiebl commented Jan 3, 2019

Fwiw, this part of the sysv init script has basically been unchanged for years (it was actually written before I started (co) maintaining systemd).
I'm still worried that we haven't actually found the root cause yet, as we have other user reports, where udev in the initramfs fails to load the modules as well (and initramfs-tools does use udevadm trigger --type=subsystems --action=add)

@yuwata
Copy link
Member

yuwata commented Jan 11, 2019

I've tried to create initrd by dracut without udevd: dracut -f -m 'bash base rootfs-block'. But, fortunately(?) udev works fine.
As you can see the script that dracut uses,
https://github.com/dracutdevs/dracut/blob/1ff306a3964840fc548188a8aa4862143b347e8a/modules.d/99base/init.sh#L152-L170,
dracut calls udevadm settle --timeout=0 and udevadm control --reload before triggering uevents. So, these may prevent to hit races in udevd (vs udevadm?).

p.s. Even if all lines between systemd-udevd --daemon and udevadm trigger are commented out, the issue cannot be reproduced...

@bikefrivolously
Copy link

I did a bisect between v239 and v240 on systemd-stable
git bisect bad 1742aae <-- v240
git bisect good de7436b <-- v239

And came up with "e39b4d254de7c12be99a17cc54026695b0bd9c27 is the first bad commit".

I hope this helps narrow it down.

@bikefrivolously
Copy link

And now I see that @lnussbaum already came to this conclusion.. welp, at least I learned something.

@yuwata
Copy link
Member

yuwata commented Jan 11, 2019

@bikefrivolously and @lnussbaum Thank you for bisecting.

And came up with "e39b4d254de7c12be99a17cc54026695b0bd9c27 is the first bad commit".

I've posted #11389. Could you test the PR?

@lnussbaum
Copy link

I can confirm that #11389 fixes the problem.

@lnussbaum
Copy link

I can also confirm that commit f90169d is enough to fix the problem

@yuwata
Copy link
Member

yuwata commented Jan 11, 2019

Wow, thanks a lot! Which commit does fix this issue?

@yuwata
Copy link
Member

yuwata commented Jan 11, 2019

Ah, I missed the previous comment. Thanks.

@ssolidus
Copy link

@yuwata I can confirm that #11389 fixes the issue! My arch VM is booting fine from it. Thank you!

@yuwata
Copy link
Member

yuwata commented Jan 11, 2019

@ssolidus Thank you for testing the PR. If possible, please also test with only f90169d?

@cankoluman
Copy link

@yuwata Thanks. I can also confirm that #11389 / commit f90169d fixes boot on Debian-Buster with a Gigabyte X-99 SLI (X99 Express Chipset)

Many thanks to all 👍

@ssolidus
Copy link

@yuwata I confirm that f90169d fixes the boot still :)

@keszybz keszybz added has-pr ✨ and removed needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer labels Jan 11, 2019
@keszybz keszybz added this to the v241 milestone Jan 11, 2019
yuwata added a commit to yuwata/systemd that referenced this issue Jan 11, 2019
v239 does not do that. But in v240, b1c097a
and e39b4d2 make set buffer size with SO_RCVBUF.
Still I do not know why that introduces systemd#11314. But, dropping that
actually fixes the issue. So, let's simply revert the relevant changes
made by those commits.

Fixes systemd#11314.
@yuwata
Copy link
Member

yuwata commented Jan 11, 2019

@cankoluman and @ssolidus Thank you for testing the PR.

Uploading v240 to Debian was probably not a good idea it seems, I received quite a few angry user emails so far.
That said, not sure what more I can do :-/

@mbiebl I hope you will be freed from such emails soon :-)

yuwata added a commit to yuwata/systemd that referenced this issue Jan 11, 2019
By b1c097a (systemd#10239), the receive buffer
size for uevents was set by SO_RCVBUF at first, and fallback to
use SO_RCVBUFFORCE. So, as SO_RCVBUF limits to the buffer size
net.core.rmem_max, which is usually much smaller than 128MB udevd requests,
uevents buffer size was not sufficient.

This fixes the ordering of the request: SO_RCVBUFFORCE first, and
fallback to SO_RCVBUF. Then, udevd's uevent buffer size can be set to
128MB.

This also revert 9038932.

Fixes systemd#11314 and systemd#10754.
@yuwata
Copy link
Member

yuwata commented Jan 11, 2019

Finally I found the root of the issue. Please test the revised version (98aed1d).

@keszybz
Copy link
Member

keszybz commented Jan 11, 2019

Fixed by #11389.

@cankoluman
Copy link

@yuwata & @keszybz, thanks again for the updates. I tested both 98aed1d and ee0b9e7 via systemd-udevd substitution - (on Debian-Buster X99 Chipset).

initramfs boots fine in both cases

@yuwata
Copy link
Member

yuwata commented Jan 12, 2019

Thank you for your help.

@ssolidus
Copy link

@yuwata I also tested 98aed1d and confirm its working :)

@msg7086
Copy link

msg7086 commented Jan 15, 2019

Tested on Debian Buster VMware ESXi 6.7, and the patch seems to be working. I had to pull both systemd and udev 240-4 from sid, and since then I haven't seen any booting failures.

EDIT: Thanks!

@ewhac
Copy link

ewhac commented Jan 18, 2019

BTW, I didn't properly express my thanks. v240-4 from Debian Sid contains this patch, and works fine for me. Thank you kindly.

@yuwata
Copy link
Member

yuwata commented Jan 18, 2019

@ewhac Thank you for your help.

wkennington pushed a commit to triton/systemd that referenced this issue Jan 20, 2019
c4b69e9 effectively moved the initalization of socket.
Before that commit:
run → listen_fds → udev_ctrl_new → udev_ctrl_new_from_fd → socket()
After:
run → main_loop → manager_new → udev_ctrl_new_from_fd → socket()

The problem is that main_loop was called after daemonization. Move manager_new
out of main_loop and before daemonization.

Fixes systemd#11314 (hopefully ;)).

v2: Yu Watanabe
sd_event is initialized in main_loop().

(cherry picked from commit b5af8c8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
regression ⚠️ A bug in something that used to work correctly and broke through some recent commit udev
Development

No branches or pull requests