Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd-boot: Error preparing initrd: Bad Buffer Size #25911

Closed
kernle32dll opened this issue Jan 2, 2023 · 45 comments · Fixed by #25948
Closed

systemd-boot: Error preparing initrd: Bad Buffer Size #25911

kernle32dll opened this issue Jan 2, 2023 · 45 comments · Fixed by #25948
Labels
bug 🐛 Programming errors, that need preferential fixing sd-boot/sd-stub/bootctl

Comments

@kernle32dll
Copy link

kernle32dll commented Jan 2, 2023

systemd version the issue has been seen with

252

Used distribution

Arch Linux

Linux kernel version used

6.1.1-arch1-1

CPU architectures issue was seen on

x86_64

Component

systemd-boot, other

Expected behaviour you didn't see

A successful boot

Unexpected behaviour you saw

Unsuccessful boot after selecting the entry in systemd-boot, with a cryptic error message:

image

Steps to reproduce the problem

This happened on a Dell R420 server.

I run a fairly default setup. A fat32 mounted as /boot, containing the built initramfs, etc. I don't use any modules or something. I do use https://github.com/random-archer/mkinitcpio-systemd-tool for a cryptsetup, but I believe the problem occurs much earlier.

This setup worked for about two years without any issue, but has been flaky for a few weeks (might be months) now. After chrooting into the installation and randomly reinstalling stuff, and rebuilding the boot components, it did work again briefly, but has been broken again since. I really have no idea what influences the problem.

I do have the hunch, that this might be related to the server's nvram, or efi vars. Problems started occurring when I was tinkering around with unified kernel images (which the server won't boot neither directly nor via systemd-boot, but that is a different topic). In any case, I did briefly run out of space while tinkering around with uefi boot entries using efibootmgr.

Additional program output to the terminal or log subsystem illustrating the issue

No response

@kernle32dll kernle32dll added the bug 🐛 Programming errors, that need preferential fixing label Jan 2, 2023
@kernle32dll kernle32dll changed the title systemd-boot: Error preparing initrd: Bad Buffer Size. systemd-boot: Error preparing initrd: Bad Buffer Size Jan 2, 2023
@medhefgo
Copy link
Contributor

medhefgo commented Jan 3, 2023

Does one of the initrds referenced by the entry happen to have a size of 0 by any chance? (Ideally checked via EFI shell to make sure kernel and EFI agree)

Also, if you remove the initrd lines in the .conf file and instead append initrd=\path-to-initrd-relative-to-ESP-root to the cmdline for each of them, does it boot then?

Problems started occurring when I was tinkering around with unified kernel images (which the server won't boot neither directly nor via systemd-boot, but that is a different topic).

I'd like to hear about that as well, would be nice if you could create a separate issue about that.

poettering added a commit to poettering/systemd that referenced this issue Jan 3, 2023
Let's avoid calling Read() with zero-sized buffer, to avoid needless firmware
quirkiness.

See: systemd#25911
@poettering
Copy link
Member

Does one of the initrds referenced by the entry happen to have a size of 0 by any chance? (Ideally checked via EFI shell to make sure kernel and EFI agree)

Or alternatively: does the issue go away if you apply #25922?

@kernle32dll
Copy link
Author

kernle32dll commented Jan 3, 2023

Does one of the initrds referenced by the entry happen to have a size of 0 by any chance? (Ideally checked via EFI shell to make sure kernel and EFI agree)

Looks OK

image

image

Also, if you remove the initrd lines in the .conf file and instead append initrd=\path-to-initrd-relative-to-ESP-root to the cmdline for each of them, does it boot then?

Will try that

Edit: WTF, that worked. I am double checking if I changed nothing else.

Edit 2: Works indeed. My configs:

Does not work:

title Arch Linux
linux /vmlinuz-linux
initrd /intel-ucode.img
initrd /initramfs-linux.img
options root=/dev/mapper/root1 rootflags=subvol=@ rw

Does work:

title Arch Linux
linux /vmlinuz-linux
options initrd=/intel-ucode.img initrd=/initramfs-linux.img root=/dev/mapper/root1 rootflags=subvol=@ rw

Problems started occurring when I was tinkering around with unified kernel images (which the server won't boot neither directly nor via systemd-boot, but that is a different topic).

I'd like to hear about that as well, would be nice if you could create a separate issue about that.

Sure, I'm just not sure where to place it, as it doesn't seem to be a systemd problem per-se. The error I get is the same regardless if booted directly or via systemd-boot.

@medhefgo
Copy link
Contributor

medhefgo commented Jan 3, 2023

Well, an empty file was a wild shot. https://github.com/medhefgo/systemd/tree/boot-bad-buffer-size contains a potential fix along with some debug logging in case it doesn't work (this should be with initrd config options instead of in the cmdline).

Edit: WTF, that worked. I am double checking if I changed nothing else.

This is expected. This just leaves the work of fetching the initrd to the kernel instead of doing it ourselves. Now we just need to figure out what the kernel does better…

Sure, I'm just not sure where to place it, as it doesn't seem to be a systemd problem per-se. The error I get is the same regardless if booted directly or via systemd-boot.

Well, telling us the error message would be a starter. :D

@kernle32dll
Copy link
Author

Well, telling us the error message would be a starter. :D

image

Obviously, the file is there, as its picked up by systemd-boot without any loader config.

@kernle32dll
Copy link
Author

Well, an empty file was a wild shot. https://github.com/medhefgo/systemd/tree/boot-bad-buffer-size contains a potential fix along with some debug logging in case it doesn't work (this should be with initrd config options instead of in the cmdline).

I don't see any changes unfortunately? Give me a ping if you want me to test something 💪

bluca pushed a commit that referenced this issue Jan 3, 2023
Let's avoid calling Read() with zero-sized buffer, to avoid needless firmware
quirkiness.

See: #25911
@medhefgo
Copy link
Contributor

medhefgo commented Jan 4, 2023

I don't see any changes unfortunately? Give me a ping if you want me to test something muscle

Would've helped if I actually commited my changes. Please try again

@kernle32dll
Copy link
Author

kernle32dll commented Jan 5, 2023

image

@medhefgo there you go

Edit: Note that this test was done with the fallback initrd, but it fails for the non fallback as well.

@medhefgo
Copy link
Contributor

medhefgo commented Jan 5, 2023

Well, that firmware is dented. The read size we give it is valid and the buffer suitably allocated.

There is a slight chance #25848 is causing this, I pulled it into the branch, just in case you wanna test this.

But more likely the firmware is one of those that cannot read large buffers, considering that the (small) ucode initrd was read without issues. You could try booting with efi=nochunk to see if the kernel would hit the same issue then too.

@medhefgo
Copy link
Contributor

medhefgo commented Jan 5, 2023

Also, regarding the UKI not loading: could be the same issue at hand (when we discover it we only read small chunks from it instead of the whole file at once).

You said booting it directly without a bootloader in between also fails? Can you try starting it from the EFI shell to see if it gives the same error? (If it says nothing, you can get an error code with echo %lasterror%.)

@poettering
Copy link
Member

Hmm, maybe we should load these files with EFI_LOAD_FILE_PROTOCOL or so?

@kernle32dll
Copy link
Author

There is a slight chance #25848 is causing this, I pulled it into the branch, just in case you wanna test this.

No dice

But more likely the firmware is one of those that cannot read large buffers, considering that the (small) ucode initrd was read without issues. You could try booting with efi=nochunk to see if the kernel would hit the same issue then too.

image

Spot on

You said booting it directly without a bootloader in between also fails? Can you try starting it from the EFI shell to see if it gives the same error? (If it says nothing, you can get an error code with echo %lasterror%.)

image

@poettering
Copy link
Member

maybe we should try to load the thin in one go, and if that fails revert to chunked reads?

@medhefgo
Copy link
Contributor

medhefgo commented Jan 5, 2023

Hmm, maybe we should load these files with EFI_LOAD_FILE_PROTOCOL or so?

That would require the firmware to provide that protocol on the device (it likely doesn't). And it would have to go through the broken file system code anyways.

maybe we should try to load the thin in one go, and if that fails revert to chunked reads?

Always so impatient…

@kernle32dll Please give the PR a try.

@kernle32dll
Copy link
Author

Welp, coming back with some unexpected results... I double checked

title Arch Linux (nochunk)
linux /vmlinuz-linux
options efi=nochunk initrd=/intel-ucode.img initrd=/initramfs-linux.img root=/dev/mapper/root1 rootflags=subvol=@ rw

First of all, efi=nochunk suddenly started working (even without the PR changes). I have no idea why - I made no modifications to the system, besides rebuilding systemd.

title Arch Linux (bad buffer size)
linux /vmlinuz-linux
initrd /intel-ucode.img  
initrd /initramfs-linux-fallback.img
options root=/dev/mapper/root1 rootflags=subvol=@ rw

Still not working, same Bad Buffer Size error.

@kernle32dll
Copy link
Author

@medhefgo Could you provide another commit with additional debug output? I would love to help debugging this further

@medhefgo
Copy link
Contributor

I haven't forgotten about you. I am focusing on other areas right now and also still thinking on what to do next here.

@medhefgo
Copy link
Contributor

Not sure if you are comfortable with changing the c code:

  • Maybe you can play with the chunk size (make it 1M, very small, and maybe even larger than the buf/file size.
  • Maybe the firmware will refuse to use the handle once a too large read was called. Maybe removing the GetPosition()/Read()/SetPosition() calls in front of the loop helps.

Let's maybe also rule out some other issues:

  • memtest
  • fsck
  • S.M.A.R.T. self-test
  • firmware update?

Why is it that the quirky firmware always have to be remote machines I can't put my hands on... :(

@kernle32dll
Copy link
Author

I haven't forgotten about you. I am focusing on other areas right now and also still thinking on what to do next here.

No pressure :) Got a working workaround with the kernel options. Just eager to understand the issue.

Not sure if you are comfortable with changing the c code:

* Maybe you can play with the chunk size (make it 1M, very small, and maybe even larger than the buf/file size.

* Maybe the firmware will refuse to use the handle once a too large read was called. Maybe removing the GetPosition()/Read()/SetPosition() calls in front of the loop helps.

Sure, will give that a try.

Let's maybe also rule out some other issues:

* memtest

* fsck

* S.M.A.R.T. self-test

* firmware update?

Will do

Why is it that the quirky firmware always have to be remote machines I can't put my hands on... :(

Tbh, I got myself a PiKVM for exactly THAT issue 😄

@medhefgo
Copy link
Contributor

Please try this new branch: https://github.com/medhefgo/systemd/tree/boot-bad-buffer-size-test

It will automatically perform a chunk size bisection for any initrds that are loaded for a given entry. It has two phases and I hope at least one of them will converge.

@bluca bluca added the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Jan 19, 2023
eworm-de pushed a commit to eworm-de/systemd that referenced this issue Feb 4, 2023
Let's avoid calling Read() with zero-sized buffer, to avoid needless firmware
quirkiness.

See: systemd#25911
(cherry picked from commit fd1fec5)
d-hatayama pushed a commit to d-hatayama/systemd that referenced this issue Feb 15, 2023
Let's avoid calling Read() with zero-sized buffer, to avoid needless firmware
quirkiness.

See: systemd#25911
@medhefgo
Copy link
Contributor

This makes no sense. Are you sure you built and tested the correct commit (d2ab1a4)?

There's a chance you're testing a stale build artifact too. We now need python-pyelftools instead of gnu-efi, so unless it was already installed for you, you would've had to install it or no bootloader would've been built.

@kernle32dll
Copy link
Author

Yeah, I checked twice. I know it built correctly, since I did not see the bisect test again. If you put in some debug code, we at least might know where it fails?

@kernle32dll
Copy link
Author

@medhefgo did some testing myself - it seems to fail to re-open the file. Altho I can't say why
https://github.com/medhefgo/systemd/blob/d2ab1a4f332166c743f7160dcc575e882cb7e192/src/boot/efi/util.c#L348

@medhefgo
Copy link
Contributor

The only thing I can think of is that the file handles are reused. The re-opening worked for the bisection test and we did not have the handle open twice there.

I updated the PR, please give it another try.

@kernle32dll
Copy link
Author

@kernle32dll
Copy link
Author

I also double checked that line again (on your previous version) https://github.com/medhefgo/systemd/blob/d2ab1a4f332166c743f7160dcc575e882cb7e192/src/boot/efi/util.c#L348

Its indeed returning the same BAD_BUFFER_SIZE error, which is odd for the open call. So something is up with the handle alright.

@medhefgo
Copy link
Contributor

We're making progress. I've updated the branch to print out the size we expect vs get along with a sha256 sum of the initrd. Please give it a try (and tell me if sha256sum disagrees).

@kernle32dll
Copy link
Author

Build is failing :( Same as CI:

../src/boot/efi/boot.c: In function "initrd_prepare":
../src/boot/efi/boot.c:2334:17: error: implicit declaration of function "hexdump" [-Werror=implicit-function-declaration]
 2334 |                 hexdump(u"sha256", sha256, SHA256_DIGEST_SIZE);
      |                 ^~~~~~~
../src/boot/efi/boot.c:2334:17: error: nested extern declaration of "hexdump" [-Werror=nested-externs]

@medhefgo
Copy link
Contributor

medhefgo commented Mar 31, 2023

You need to pass --debug (and -Dmode=developer) to meson.

@kernle32dll
Copy link
Author

Not experienced with meson - where do I put these? The configure file?

@medhefgo
Copy link
Contributor

medhefgo commented Apr 1, 2023

You must have called meson at some point when compiling. Either manually or as part of your PKGBUILD. Just append it to the cmdline there. And if you have, don't use the configure script, just call meson setup $builddir directly.

@kernle32dll
Copy link
Author

kernle32dll commented Apr 2, 2023

image

Followed by:

image

@medhefgo
Copy link
Contributor

medhefgo commented Apr 3, 2023

Delightful. The re-opened handle will silently truncate the file. I guess we have to always do chunked-reads just like the kernel does. 😿

Please try the PR again. This should hopefully work now.

Also, is there any chance you're missing a firmware upgrade (that happens to fix this)?

@kernle32dll
Copy link
Author

Delightful. The re-opened handle will silently truncate the file. I guess we have to always do chunked-reads just like the kernel does. crying_cat_face

Nice 😞

Please try the PR again. This should hopefully work now.

Will do. I will come back with results.

Also, is there any chance you're missing a firmware upgrade (that happens to fix this)?

Unfortunately not. The server is already fully updated. However, the server is a Dell R420 series, which is almost 10 years old by now. So I have little hope there 😢

@ElvishJerricco
Copy link
Contributor

ElvishJerricco commented Apr 21, 2023

For anyone who wants to test this in VMs, here's a little patch to OVMF that causes it to exhibit (part of) this bug:

diff --git a/FatPkg/EnhancedFatDxe/ReadWrite.c b/FatPkg/EnhancedFatDxe/ReadWrite.c
index 8f525044d1..1fed0fecce 100644
--- a/FatPkg/EnhancedFatDxe/ReadWrite.c
+++ b/FatPkg/EnhancedFatDxe/ReadWrite.c
@@ -216,6 +216,10 @@ FatIFileAccess (
   Volume = OFile->Volume;
   Task   = NULL;
 
+  if (*BufferSize > (10U * 1024U * 1024U)) {
+    return EFI_BAD_BUFFER_SIZE;
+  }
+
   //
   // Write to a directory is unsupported
   //

It just makes the FAT driver return EFI_BAD_BUFFER_SIZE if you try to read or write more than 10M. Why 10M? Because if it's much smaller, LoadImage breaks on the 8.3M kernel I was testing with, and 16M would have been too big for the 13M initrd I was testing with. I suspect that if your kernel exceeds the limit on these real-world buggy firmwares, LoadImage will probably fail too. So keep your kernels small, I guess.

I did not bother trying to replicate the truncate-on-reopen behavior because that sounded like a much bigger patch.

@kernle32dll
Copy link
Author

@medhefgo Hey, sorry for coming back so late to you.

Tested your MR, works like a charm!

@ThomasLamprecht
Copy link

ThomasLamprecht commented Jun 28, 2023

Could the commit fixing this (3ed1d96) please get also ported back to the systemd-stable v252 branch, which e.g., Debian 12 Bookworm bases on?
As then, it would be automatically shipped by a future 12.x point release of theirs, as a maintainer of the systemd package in Debian wrote0.

We (Proxmox, based on Debian) will do the backport ourselves earlier in the meantime anyway, but as we got quite a few reports already I think Debian users, and users from other distros relying on the systemd-stable project, would benefit from this.

Thanks for your consideration!

@Skinner927
Copy link

I figured I'll write this here since multiple forum posts link here. I was able to circumvent this issue by turning on fast boot in BIOS. Oddly my boot time actually takes longer but an OROM driver gets loaded for my raid card (hardware RAID is disabled, but the controller still exists) and the error goes away. Also a Proxmox user if that matters.

@mbiebl mbiebl removed the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Aug 22, 2023
@JackPala
Copy link

I have the same issue described here on the latest official ISO from Proxmox on two seperate systems. Both of which were dell poweredges with PERC raid cards, flashed to IT mode for ZFS. A third server that did not use ZFS to boot off, did not have the issue. Proxmox 7 works flawlessly with ZFS booting.

@ElvishJerricco
Copy link
Contributor

@JackPala ZFS isn't relevant to this issue. This issue is solely about systemd-boot failing to read the initrd from the FAT32 ESP. ZFS doesn't come into play until a significantly later stage during boot.

nmeyerhans pushed a commit to nmeyerhans/systemd that referenced this issue Jan 21, 2024
Fixes: systemd#25911
(cherry picked from commit f70f992)
(cherry picked from commit 1a0f2c5)
@mudler
Copy link

mudler commented Feb 22, 2024

Bumping into this with UKI files - it happens in a not-easy-reproducible way. Some of the UKI files we generate occasionally fails with this error, sometime it doesn't.

The interesting aspect is that I'm testing this with qemu/ed2k - not a real HW at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Programming errors, that need preferential fixing sd-boot/sd-stub/bootctl
Development

Successfully merging a pull request may close this issue.

10 participants