Skip to content

Unexpected Linux kernel version upgrade when install cuda driver in GPU VMs #13433

Open
@yanchenzhang0304

Description

@yanchenzhang0304

Describe the bug
We have an AKS 1.32.3 cluster with Azure Linux V3, all GPU nodes nvidia-device-plugin are crashloop backoff after node reboot, and we noticed GPU node OS kernel version was upgrade from 6.6.78.1-3.azl3 to 6.6.82.1-1.azl3 after reboot.
From AKS node /var/log/azure/cluster-provision.log (see as Screenshots), it shows that install cuda-0:560.35.03-1_6.6.78.1.3.azl3.x86_64 upgraded kernel version to 6.6.82.1-1.azl3 unexpectlly.
The cuda version is not compatible with the kernel version 6.6.82 which caused GPU drive cannot be loaded, and then nvidia-device-plugin pods are crash.

To Reproduce
Steps to reproduce the behavior:

  1. Create AKS cluster 1.32 with AZLinux 3.0, and add a gpu nodepool
  2. Reboot the node in the gpu nodepool

Expected behavior
keep the linux kernel version in 6.6.78.1.3

Screenshots

+ dnf install -y cuda-0:560.35.03-1_6.6.78.1.3.azl3.x86_64
Last metadata expiration check: 0:00:01 ago on Tue Apr 15 07:01:25 2025.
Dependencies resolved.
================================================================================================
 Package                   Arch    Version                     Repository                   Size
================================================================================================
Installing:
 cuda                      x86_64  560.35.03-1_6.6.78.1.3.azl3 azurelinux-official-nvidia  246 M
 kernel                    x86_64  6.6.82.1-1.azl3             azurelinux-official-base     41 M
Installing dependencies:
 kernel-drivers-gpu        x86_64  6.6.78.1-3.azl3             azurelinux-official-base    3.0 M
 libunwind                 x86_64  1.6.2-2.azl3                azurelinux-official-base     75 k
 mlnx-ofa_kernel           x86_64  24.10-13.azl3               azurelinux-official-base     44 k
 mlnx-ofa_kernel-modules   x86_64  24.10-13.azl3               azurelinux-official-base    1.6 M
 mlnx-tools                x86_64  24.10-1.azl3                azurelinux-official-base     80 k
 ofed-scripts              x86_64  24.10-1.azl3                azurelinux-official-base     71 k
 pciutils                  x86_64  3.11.1-1.azl3               azurelinux-official-base    449 k
 pciutils-libs             x86_64  3.11.1-1.azl3               azurelinux-official-base     57 k

Transaction Summary
================================================================================================
Install  10 Packages

Total download size: 293 M
Installed size: 792 M
Downloading Packages:
(1/10): kernel-drivers-gpu-6.6.78.1-3.azl3.x86_  11 MB/s | 3.0 MB     00:00    
(2/10): libunwind-1.6.2-2.azl3.x86_64.rpm       1.7 MB/s |  75 kB     00:00    
(3/10): mlnx-ofa_kernel-24.10-13.azl3.x86_64.rp 2.1 MB/s |  44 kB     00:00    
(4/10): mlnx-ofa_kernel-modules-24.10-13.azl3.x  36 MB/s | 1.6 MB     00:00    
(5/10): mlnx-tools-24.10-1.azl3.x86_64.rpm      2.3 MB/s |  80 kB     00:00    
(6/10): ofed-scripts-24.10-1.azl3.x86_64.rpm    2.0 MB/s |  71 kB     00:00    
(7/10): pciutils-3.11.1-1.azl3.x86_64.rpm        12 MB/s | 449 kB     00:00    
(8/10): pciutils-libs-3.11.1-1.azl3.x86_64.rpm  2.7 MB/s |  57 kB     00:00    
(9/10): kernel-6.6.82.1-1.azl3.x86_64.rpm        23 MB/s |  41 MB     00:01    
(10/10): cuda-560.35.03-1_6.6.78.1.3.azl3.x86_6  65 MB/s | 246 MB     00:03    
--------------------------------------------------------------------------------
Total                                            78 MB/s | 293 MB     00:03    
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                        1/1
  Installing       : ofed-scripts-24.10-1.azl3.x86_64                      1/10
  Running scriptlet: ofed-scripts-24.10-1.azl3.x86_64                      1/10
  Installing       : mlnx-tools-24.10-1.azl3.x86_64                        2/10
  Installing       : libunwind-1.6.2-2.azl3.x86_64                         3/10
  Installing       : kernel-6.6.82.1-1.azl3.x86_64                         4/10
  Running scriptlet: kernel-6.6.82.1-1.azl3.x86_64                         4/10
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.6.82.1-1.azl3
fgrep: warning: fgrep is obsolescent; using grep -F
Found linux image: /boot/vmlinuz-6.6.78.1-3.azl3
Found initrd image: /boot/initramfs-6.6.78.1-3.azl3.img
fgrep: warning: fgrep is obsolescent; using grep -F
Warning: os-prober will not be executed to detect other bootable partitions.
Systems on them will not be added to the GRUB boot configuration.
Check GRUB_DISABLE_OS_PROBER documentation entry.
Adding boot menu entry for UEFI Firmware Settings ...
done

Metadata

Metadata

Assignees

No one assigned

    Labels

    3.0PRs Destined for 3.0AKSbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions