Open
Description
Describe the bug
We have an AKS 1.32.3 cluster with Azure Linux V3, all GPU nodes nvidia-device-plugin are crashloop backoff after node reboot, and we noticed GPU node OS kernel version was upgrade from 6.6.78.1-3.azl3 to 6.6.82.1-1.azl3 after reboot.
From AKS node /var/log/azure/cluster-provision.log (see as Screenshots), it shows that install cuda-0:560.35.03-1_6.6.78.1.3.azl3.x86_64 upgraded kernel version to 6.6.82.1-1.azl3 unexpectlly.
The cuda version is not compatible with the kernel version 6.6.82 which caused GPU drive cannot be loaded, and then nvidia-device-plugin pods are crash.
To Reproduce
Steps to reproduce the behavior:
- Create AKS cluster 1.32 with AZLinux 3.0, and add a gpu nodepool
- Reboot the node in the gpu nodepool
Expected behavior
keep the linux kernel version in 6.6.78.1.3
Screenshots
+ dnf install -y cuda-0:560.35.03-1_6.6.78.1.3.azl3.x86_64
Last metadata expiration check: 0:00:01 ago on Tue Apr 15 07:01:25 2025.
Dependencies resolved.
================================================================================================
Package Arch Version Repository Size
================================================================================================
Installing:
cuda x86_64 560.35.03-1_6.6.78.1.3.azl3 azurelinux-official-nvidia 246 M
kernel x86_64 6.6.82.1-1.azl3 azurelinux-official-base 41 M
Installing dependencies:
kernel-drivers-gpu x86_64 6.6.78.1-3.azl3 azurelinux-official-base 3.0 M
libunwind x86_64 1.6.2-2.azl3 azurelinux-official-base 75 k
mlnx-ofa_kernel x86_64 24.10-13.azl3 azurelinux-official-base 44 k
mlnx-ofa_kernel-modules x86_64 24.10-13.azl3 azurelinux-official-base 1.6 M
mlnx-tools x86_64 24.10-1.azl3 azurelinux-official-base 80 k
ofed-scripts x86_64 24.10-1.azl3 azurelinux-official-base 71 k
pciutils x86_64 3.11.1-1.azl3 azurelinux-official-base 449 k
pciutils-libs x86_64 3.11.1-1.azl3 azurelinux-official-base 57 k
Transaction Summary
================================================================================================
Install 10 Packages
Total download size: 293 M
Installed size: 792 M
Downloading Packages:
(1/10): kernel-drivers-gpu-6.6.78.1-3.azl3.x86_ 11 MB/s | 3.0 MB 00:00
(2/10): libunwind-1.6.2-2.azl3.x86_64.rpm 1.7 MB/s | 75 kB 00:00
(3/10): mlnx-ofa_kernel-24.10-13.azl3.x86_64.rp 2.1 MB/s | 44 kB 00:00
(4/10): mlnx-ofa_kernel-modules-24.10-13.azl3.x 36 MB/s | 1.6 MB 00:00
(5/10): mlnx-tools-24.10-1.azl3.x86_64.rpm 2.3 MB/s | 80 kB 00:00
(6/10): ofed-scripts-24.10-1.azl3.x86_64.rpm 2.0 MB/s | 71 kB 00:00
(7/10): pciutils-3.11.1-1.azl3.x86_64.rpm 12 MB/s | 449 kB 00:00
(8/10): pciutils-libs-3.11.1-1.azl3.x86_64.rpm 2.7 MB/s | 57 kB 00:00
(9/10): kernel-6.6.82.1-1.azl3.x86_64.rpm 23 MB/s | 41 MB 00:01
(10/10): cuda-560.35.03-1_6.6.78.1.3.azl3.x86_6 65 MB/s | 246 MB 00:03
--------------------------------------------------------------------------------
Total 78 MB/s | 293 MB 00:03
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : ofed-scripts-24.10-1.azl3.x86_64 1/10
Running scriptlet: ofed-scripts-24.10-1.azl3.x86_64 1/10
Installing : mlnx-tools-24.10-1.azl3.x86_64 2/10
Installing : libunwind-1.6.2-2.azl3.x86_64 3/10
Installing : kernel-6.6.82.1-1.azl3.x86_64 4/10
Running scriptlet: kernel-6.6.82.1-1.azl3.x86_64 4/10
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.6.82.1-1.azl3
fgrep: warning: fgrep is obsolescent; using grep -F
Found linux image: /boot/vmlinuz-6.6.78.1-3.azl3
Found initrd image: /boot/initramfs-6.6.78.1-3.azl3.img
fgrep: warning: fgrep is obsolescent; using grep -F
Warning: os-prober will not be executed to detect other bootable partitions.
Systems on them will not be added to the GRUB boot configuration.
Check GRUB_DISABLE_OS_PROBER documentation entry.
Adding boot menu entry for UEFI Firmware Settings ...
done