Skip to content

Bug: kernel installs produce unbootable BLS entries after package removal on migrated systems #17

@rocketman-code

Description

@rocketman-code

Current Behavior

After atomic-rollback is uninstalled from a system that has had migrate
applied, any subsequent kernel install via dnf or kernel-install produces a
BLS entry with paths that GRUB cannot resolve. The kernel package installs
successfully, but on next reboot GRUB fails to load it and drops to the GRUB
menu with the new kernel marked unbootable.

The standard Fedora kernel-install pipeline (90-loaderentry.install
invoking grub2-mkrelpath) generates linux and initrd BLS fields of the
form /root/boot/vmlinuz-X and /root/boot/initramfs-X.img for new kernels
on a migrated layout. The 90-atomic-rollback.install kernel-install hook
normally rewrites these paths via fix_bls_paths in src/kernel_hook.rs,
but the hook is removed when the package is removed, so the rewrite never
happens for kernels installed after that point.

GRUB error on reboot:

Booting `Fedora Linux (X.Y.Z-200.fc43.x86_64) 43 (Cloud Edition)'

error: ../../grub-core/fs/btrfs.c:2153:file
`/root/boot/vmlinuz-X.Y.Z-200.fc43.x86_64'
not found.
error: ../../grub-core/loader/i386/efi/linux.c:260:you need to load the
kernel first.

Press any key to continue...

Failed to boot both default and fallback entries.

Expected Behavior

A system that has had migrate applied should produce bootable BLS entries
for future kernel installs. The migration permanently changes the on-disk
layout (consumes /boot, creates symlinks at the root subvolume, updates ESP
grub.cfg, rewrites existing BLS entries). After such a permanent change,
subsequent kernel installs should not produce broken boot configurations,
regardless of whether the package that performed the migration is still
installed.

Context

Old kernels installed before package removal continue to boot correctly
because their BLS entries were rewritten at install time when the hook was
still active. The first failure surfaces only when a new kernel is installed
after removing the package. The failure is not visible until reboot, and at
that point the new kernel is the default, GRUB fails to load it, and only
manual selection of the previous kernel from the GRUB menu allows recovery.

The system is recoverable from the GRUB menu by selecting an older kernel.
This is not a brick. But every subsequent kernel install fails the same way
until either atomic-rollback is reinstalled and a kernel install is
retriggered, or the layout is manually un-migrated.

Reproduced on a fresh Fedora 43 cloud image with atomic-rollback v0.3.7-1.fc43
installed from the COPR.

Technical Details

Reproduction

On a fresh Fedora 43 system with the default layout (separate ext4 /boot,
btrfs root with subvolumes):

  1. sudo dnf copr enable -y rocketman-code/atomic-rollback

  2. sudo dnf install -y atomic-rollback

  3. sudo atomic-rollback setup

  4. sudo atomic-rollback migrate

  5. Confirm migration: sudo atomic-rollback check (should pass), and
    confirm /etc/fstab shows #MIGRATED: for /boot.

  6. sudo dnf remove -y atomic-rollback

  7. Confirm the kernel-install hook is gone:
    ls /usr/lib/kernel/install.d/ | grep atomic (should be empty).

  8. Install any kernel that is not already installed, for example from koji:
    sudo dnf install -y https://kojipkgs.fedoraproject.org/packages/kernel/<v>/200.fc43/x86_64/kernel-{,core-,modules-,modules-core-}<v>-200.fc43.x86_64.rpm

  9. Inspect the new BLS entry:
    sudo cat /boot/loader/entries/*<v>*.conf | grep -E '^linux|^initrd'

    The output will show paths beginning with /root/boot/:

    linux /root/boot/vmlinuz-<v>-200.fc43.x86_64
    initrd /root/boot/initramfs-<v>-200.fc43.x86_64.img
    
  10. Reboot. GRUB will report file '/root/boot/vmlinuz-<v>...' not found
    and refuse to load the new kernel. Older kernels remain selectable from
    the GRUB menu as a fallback.

Relevant Code

src/kernel_hook.rs, the has_bad_path closure inside fix_bls_paths
explicitly identifies the exact pattern that the standard Fedora
kernel-install hook generates on a migrated layout:

let has_bad_path = |v: &str| -> bool {
    v.contains("/root/boot/") || v.contains("/boot/vmlinuz-") || v.contains("/boot/initramfs-")
};

The rewrite that this closure gates is the only mechanism preventing the
standard kernel-install pipeline from producing unbootable BLS entries on a
migrated system. It runs only when kernel-install add invokes
90-atomic-rollback.install, which is installed by the atomic-rollback RPM
and removed when that RPM is uninstalled.

Root Cause

The migration writes a permanent change to the on-disk layout. The
maintenance code that the migration depends on for ongoing correctness is
owned by a removable package. That asymmetry is the root of the problem:
removing the package removes the maintenance code, but the layout it was
maintaining remains in place. New kernel installs then run through the
default Fedora kernel-install pipeline, which generates paths appropriate
for the unmigrated layout, against a filesystem that is no longer in that
layout.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions