Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Systemd init fails on version 255, trying to mount a non-existent disk with no time limit #30395

Closed
ghost opened this issue Dec 9, 2023 · 12 comments
Labels
bug 🐛 Programming errors, that need preferential fixing pid1 regression ⚠️ A bug in something that used to work correctly and broke through some recent commit
Milestone

Comments

@ghost
Copy link

ghost commented Dec 9, 2023

systemd version the issue has been seen with

255-1

Used distribution

Arch

Linux kernel version used

6.6.4-arch1-1

CPU architectures issue was seen on

x86_64

Component

systemd

Expected behaviour you didn't see

System being able to initialize successfully.

Unexpected behaviour you saw

System fails to boot. I've tried to go into emergency or rescue modes, but that doesn't work since systemd first tries to mount a non-existant disk with no time limit, therefore emergency mode is unaccessible.
I have checked the crypttab and fstab files for the entry that systemd v255 tries to mount, but it's not there. lsblk and blkid doesn't even show the uuid of the disk that systemd tries to mount on boot.

For now, I'm using a fallback boot entry which uses busybox for the init system. I'm using v255 on another, newer system with a luks2 encrypted ssd partition, which the newest systemd can initialize with no problem. And of course since this is a regression I could initialize the same system successfully with systemd v254.

Tell me if there's anything I can do for the debug process.
This is my first issue here, so I'd happily send any requested logs or debug output if requested :)

Steps to reproduce the problem

I don't think it's reproducible for everyone, but in my case at least:

  • Have a Luks2 encrypted HDD root partition
  • Add required kernel parameters like rd.luks.name, etc. and modify the mkinitcpio.conf files with the relevant hooks like sd-encrypt
  • Try to startup the system with systemd v255
  • The bootup process gets stuck trying to mount a non-existent disk forever

Additional program output to the terminal or log subsystem illustrating the issue

No response

@ghost ghost added the bug 🐛 Programming errors, that need preferential fixing label Dec 9, 2023
@github-actions github-actions bot added the pid1 label Dec 9, 2023
@schultetwin
Copy link

schultetwin commented Dec 9, 2023

I'm also experiencing this regression. I tried downgrading kernel versions, and no luck.

(I tried 6.6.1-arch1-1).

Attached is a picture of what I saw when attempting to boot.

IMG_3412

$ sudo blkid
/dev/nvme0n1p3: UUID="a86ddce0-bdc2-4b1b-a326-d84170cfb1a4" TYPE="crypto_LUKS" PARTLABEL="cryptsystem" PARTUUID="c44307ed-7be8-4f74-a0ec-a81dcf5d57b4"
/dev/nvme0n1p1: LABEL_FATBOOT="EFI" LABEL="EFI" UUID="FB64-5466" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="EFI" PARTUUID="033b02ad-b9d5-4513-9392-141eb195436e"
/dev/mapper/cryptsystem: LABEL="system" UUID="7540596c-554e-4227-861a-0838f22dab1e" UUID_SUB="e591011c-a1c9-4752-870b-db9113bc207d" BLOCK_SIZE="4096" TYPE="btrfs"
/dev/nvme0n1p2: PARTLABEL="cryptswap" PARTUUID="c6d7de4d-59fc-4dfc-8191-d4613851dd15"

^ That's from a booting after reverting to systemd-254.

@yuwata yuwata added the regression ⚠️ A bug in something that used to work correctly and broke through some recent commit label Dec 10, 2023
@yuwata yuwata added this to the v256 milestone Dec 10, 2023
@harrythezomby
Copy link

Wow, this just happend to me as well... After a few hours of troubleshooting I finally fixed it by downgrading to 254. In my case I'm not using any sort of encryption, and all my fstab looked good (afterall, the disk it was trying to mount wasn't even in it). Happened on both the arch 6.6.3 and 6.6.6 kernel.

@keszybz
Copy link
Member

keszybz commented Dec 12, 2023

@YHNdnzj suggested that #30438 might fix this.

Do you have /sys/firmware/efi/efivars/HibernateLocation-8cf2644b-4b0b-428f-9387-6d876050dc67 present? Please paste that and /proc/cmdline.

@keszybz keszybz added the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Dec 12, 2023
@schultetwin
Copy link

schultetwin commented Dec 12, 2023

Yep, I think I came to similar conclusion while trying to troubleshoot.

$ sudo efivar -p --name 8cf2644b-4b0b-428f-9387-6d876050dc67-HibernateLocation
[sudo] password for mark:
GUID: 8cf2644b-4b0b-428f-9387-6d876050dc67
Name: "HibernateLocation"
Attributes:
	Non-Volatile
	Boot Service Access
	Runtime Service Access
Value:
00000000  7b 00 22 00 75 00 75 00  69 00 64 00 22 00 3a 00  |{.".u.u.i.d.".:.|
00000010  22 00 33 00 62 00 35 00  61 00 61 00 63 00 65 00  |".3.b.5.a.a.c.e.|
00000020  65 00 2d 00 64 00 36 00  66 00 30 00 2d 00 34 00  |e.-.d.6.f.0.-.4.|
00000030  33 00 64 00 33 00 2d 00  39 00 34 00 62 00 37 00  |3.d.3.-.9.4.b.7.|
00000040  2d 00 62 00 35 00 30 00  63 00 35 00 63 00 34 00  |-.b.5.0.c.5.c.4.|
00000050  65 00 30 00 31 00 37 00  63 00 22 00 2c 00 22 00  |e.0.1.7.c.".,.".|
00000060  6f 00 66 00 66 00 73 00  65 00 74 00 22 00 3a 00  |o.f.f.s.e.t.".:.|
00000070  30 00 2c 00 22 00 6b 00  65 00 72 00 6e 00 65 00  |0.,.".k.e.r.n.e.|
00000080  6c 00 56 00 65 00 72 00  73 00 69 00 6f 00 6e 00  |l.V.e.r.s.i.o.n.|
00000090  22 00 3a 00 22 00 36 00  2e 00 35 00 2e 00 34 00  |".:.".6...5...4.|
000000a0  2d 00 61 00 72 00 63 00  68 00 32 00 2d 00 31 00  |-.a.r.c.h.2.-.1.|
000000b0  22 00 2c 00 22 00 6f 00  73 00 52 00 65 00 6c 00  |".,.".o.s.R.e.l.|
000000c0  65 00 61 00 73 00 65 00  49 00 64 00 22 00 3a 00  |e.a.s.e.I.d.".:.|
000000d0  22 00 61 00 72 00 63 00  68 00 22 00 7d 00 00 00  |".a.r.c.h.".}...|

Adding noresume to my kernel options fixed it for me.

@YHNdnzj
Copy link
Member

YHNdnzj commented Dec 12, 2023

So assuming your UEFI implementation is good (we do successfully erase the variable after resuming), this means that the swap used for previous hibernation is missing? Does that ring a bell?

@YHNdnzj
Copy link
Member

YHNdnzj commented Dec 12, 2023

@harrythezomby: Hmm, you removed the disk to which you hibernated? That's pretty much unsupported and would cause data corruption.

@YHNdnzj
Copy link
Member

YHNdnzj commented Dec 12, 2023

@keszybz: thank you for replying the thread for me

#30438 should mitigate this a bit, since we won't be waiting infinitely anymore. But the original problem is more likely to come from your EFI impl or user misconfiguration.

@ghost
Copy link
Author

ghost commented Dec 12, 2023

@keszybz Sure! I actually have that file too.
The HibernateLocation-... file: {"uuid":"cc047a82-e175-426f-a6ac-3be92a617f47","offset":886784,"kernelVersion":"6.5.3-arch1-1","osReleaseId":"arch"}

cmdline file: rd.luks.name=67e719fb-eb19-43f7-81f5-99817ded6087=root root=/dev/mapper/root zswap.enabled=0 rw rootfstype=ext4 loglevel=3

The first file has some characters that I have trouble pasting in here, so here's the file if those characters are of importance (I changed the extension to txt since github won't allow uploading files without extension, but otherwise it's the same file):
HibernateLocation-8cf2644b-4b0b-428f-9387-6d876050dc67.txt

Edit: It's strange though. I haven't setup hibernation for this system, and I installed it fresh with a 6.6.4 kernel. I remember setting up hibernation a few months ago though on a kernel around the version specified in the HibernateLocation file, but since then I have formatted the disk entirely several times and installed Arch from scratch. Is systemd picking up on deleted files or configurations on the disk that are not overwritten by newer files? And my config should be right I think since the previous versions of systemd seem to be working.

Adding noresume to kernel parameters fixes the issue for me.

@YHNdnzj
Copy link
Member

YHNdnzj commented Dec 12, 2023

I haven't setup hibernation for this system, and I installed it fresh with a 6.6.4 kernel. I remember setting up hibernation a few months ago though on a kernel around the version specified in the HibernateLocation file, but since then I have formatted the disk entirely several times and installed Arch from scratch. Is systemd picking up on deleted files or configurations on the disk that are not overwritten by newer files? And my config should be right I think since the previous versions of systemd seem to be working.

With systemd >= 254, the resume configuration is passed through HibernateLocation EFI variable (as you are seeing here). So even if you haven't set resume= manually, hibernation should automatically work. And since it's stored in the EFI, it's installation-independent (I mean, the whole resuming thing must be independent from the OS, i.e. passed by boot loader or acquired from EFI and such)

However, we tuned the variable clearance logic several times, so it's possible that an ancient record is never cleared/used by v254, but got picked up by new logic in v255.

Please try to remove /sys/firmware/efi/efivars/HibernateLocation-8cf2644b-4b0b-428f-9387-6d876050dc67. Then, things should work again even without noresume.

@YHNdnzj YHNdnzj removed the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Dec 12, 2023
@ghost
Copy link
Author

ghost commented Dec 12, 2023

I see. That's interesting! Thank you for explaining it.
I'll do just that then.

Edit: To whomever is reading this because of running into the same issue: The HibernateLocation file is likely to be immutable by default, so you can't delete it at first. Change the immutable attribute of the file so that it isn't immutable anymore (sudo chattr -i /path/to/HibernateLocationFile), and then remove it with "sudo rm".

@YHNdnzj YHNdnzj added hibernate-resume and removed bug 🐛 Programming errors, that need preferential fixing pid1 regression ⚠️ A bug in something that used to work correctly and broke through some recent commit labels Dec 12, 2023
@YHNdnzj
Copy link
Member

YHNdnzj commented Dec 12, 2023

I'll close this here. #30438 is enough to make this less significant, and once you're in the system you can remove the EFI variable through efivarfs.

@YHNdnzj YHNdnzj closed this as completed Dec 12, 2023
@keszybz keszybz added bug 🐛 Programming errors, that need preferential fixing pid1 regression ⚠️ A bug in something that used to work correctly and broke through some recent commit and removed hibernate-resume labels Dec 13, 2023
@YHNdnzj
Copy link
Member

YHNdnzj commented Dec 13, 2023

@keszybz Hmm, at least the component involved is hibernate-resume rather than pid1. Why change the tag back?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Programming errors, that need preferential fixing pid1 regression ⚠️ A bug in something that used to work correctly and broke through some recent commit
Development

No branches or pull requests

5 participants