Skip to content

Unexpected Linux kernel version upgrade when install cuda driver in GPU VMs #13433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yanchenzhang0304 opened this issue Apr 16, 2025 · 4 comments
Labels
3.0 PRs Destined for 3.0 AKS bug Something isn't working

Comments

@yanchenzhang0304
Copy link

yanchenzhang0304 commented Apr 16, 2025

Describe the bug
We have an AKS 1.32.3 cluster with Azure Linux V3, all GPU nodes nvidia-device-plugin are crashloop backoff after node reboot, and we noticed GPU node OS kernel version was upgrade from 6.6.78.1-3.azl3 to 6.6.82.1-1.azl3 after reboot.
From AKS node /var/log/azure/cluster-provision.log (see as Screenshots), it shows that install cuda-0:560.35.03-1_6.6.78.1.3.azl3.x86_64 upgraded kernel version to 6.6.82.1-1.azl3 unexpectlly.
The cuda version is not compatible with the kernel version 6.6.82 which caused GPU drive cannot be loaded, and then nvidia-device-plugin pods are crash.

To Reproduce
Steps to reproduce the behavior:

  1. Create AKS cluster 1.32 with AZLinux 3.0, and add a gpu nodepool
  2. Reboot the node in the gpu nodepool

Expected behavior
keep the linux kernel version in 6.6.78.1.3

Screenshots

+ dnf install -y cuda-0:560.35.03-1_6.6.78.1.3.azl3.x86_64
Last metadata expiration check: 0:00:01 ago on Tue Apr 15 07:01:25 2025.
Dependencies resolved.
================================================================================================
 Package                   Arch    Version                     Repository                   Size
================================================================================================
Installing:
 cuda                      x86_64  560.35.03-1_6.6.78.1.3.azl3 azurelinux-official-nvidia  246 M
 kernel                    x86_64  6.6.82.1-1.azl3             azurelinux-official-base     41 M
Installing dependencies:
 kernel-drivers-gpu        x86_64  6.6.78.1-3.azl3             azurelinux-official-base    3.0 M
 libunwind                 x86_64  1.6.2-2.azl3                azurelinux-official-base     75 k
 mlnx-ofa_kernel           x86_64  24.10-13.azl3               azurelinux-official-base     44 k
 mlnx-ofa_kernel-modules   x86_64  24.10-13.azl3               azurelinux-official-base    1.6 M
 mlnx-tools                x86_64  24.10-1.azl3                azurelinux-official-base     80 k
 ofed-scripts              x86_64  24.10-1.azl3                azurelinux-official-base     71 k
 pciutils                  x86_64  3.11.1-1.azl3               azurelinux-official-base    449 k
 pciutils-libs             x86_64  3.11.1-1.azl3               azurelinux-official-base     57 k

Transaction Summary
================================================================================================
Install  10 Packages

Total download size: 293 M
Installed size: 792 M
Downloading Packages:
(1/10): kernel-drivers-gpu-6.6.78.1-3.azl3.x86_  11 MB/s | 3.0 MB     00:00    
(2/10): libunwind-1.6.2-2.azl3.x86_64.rpm       1.7 MB/s |  75 kB     00:00    
(3/10): mlnx-ofa_kernel-24.10-13.azl3.x86_64.rp 2.1 MB/s |  44 kB     00:00    
(4/10): mlnx-ofa_kernel-modules-24.10-13.azl3.x  36 MB/s | 1.6 MB     00:00    
(5/10): mlnx-tools-24.10-1.azl3.x86_64.rpm      2.3 MB/s |  80 kB     00:00    
(6/10): ofed-scripts-24.10-1.azl3.x86_64.rpm    2.0 MB/s |  71 kB     00:00    
(7/10): pciutils-3.11.1-1.azl3.x86_64.rpm        12 MB/s | 449 kB     00:00    
(8/10): pciutils-libs-3.11.1-1.azl3.x86_64.rpm  2.7 MB/s |  57 kB     00:00    
(9/10): kernel-6.6.82.1-1.azl3.x86_64.rpm        23 MB/s |  41 MB     00:01    
(10/10): cuda-560.35.03-1_6.6.78.1.3.azl3.x86_6  65 MB/s | 246 MB     00:03    
--------------------------------------------------------------------------------
Total                                            78 MB/s | 293 MB     00:03    
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                        1/1
  Installing       : ofed-scripts-24.10-1.azl3.x86_64                      1/10
  Running scriptlet: ofed-scripts-24.10-1.azl3.x86_64                      1/10
  Installing       : mlnx-tools-24.10-1.azl3.x86_64                        2/10
  Installing       : libunwind-1.6.2-2.azl3.x86_64                         3/10
  Installing       : kernel-6.6.82.1-1.azl3.x86_64                         4/10
  Running scriptlet: kernel-6.6.82.1-1.azl3.x86_64                         4/10
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.6.82.1-1.azl3
fgrep: warning: fgrep is obsolescent; using grep -F
Found linux image: /boot/vmlinuz-6.6.78.1-3.azl3
Found initrd image: /boot/initramfs-6.6.78.1-3.azl3.img
fgrep: warning: fgrep is obsolescent; using grep -F
Warning: os-prober will not be executed to detect other bootable partitions.
Systems on them will not be added to the GRUB boot configuration.
Check GRUB_DISABLE_OS_PROBER documentation entry.
Adding boot menu entry for UEFI Firmware Settings ...
done

@yanchenzhang0304 yanchenzhang0304 added the bug Something isn't working label Apr 16, 2025
@smith1511
Copy link

A couple of potential options here,

sudo dnf install \ cuda-0:560.35.03-1_6.6.78.1.3.azl3.x86_64 \ --exclude=kernel\* \ --skip-broken

Or using the versionLock plugin during VHD build time.

sudo dnf install -y dnf-plugins-core

# this pins your *running* kernel and its headers/modules
sudo dnf versionlock add kernel\*$(uname -r)

@yanchenzhang0304
Copy link
Author

A couple of potential options here,

sudo dnf install \ cuda-0:560.35.03-1_6.6.78.1.3.azl3.x86_64 \ --exclude=kernel\* \ --skip-broken

Or using the versionLock plugin during VHD build time.

sudo dnf install -y dnf-plugins-core

# this pins your *running* kernel and its headers/modules
sudo dnf versionlock add kernel\*$(uname -r)

I believe it's an AzLinux RPM bug because a certain version of cuda package shouldn't depend on the different kernel version.
Do you have any ETA to fix it?

@mfrw mfrw added AKS 3.0 PRs Destined for 3.0 labels Apr 17, 2025
@mfrw
Copy link
Member

mfrw commented Apr 17, 2025

/cc @microsoft/cbl-mariner-kernel @christopherco

@christopherco
Copy link
Contributor

Root Cause:
The cuda package in Azure Linux has a dependency list that specifies the compatible kernel but does not specify any version for mlnx-ofa_kernel. Consequently, when the cuda package is installed, it downloads the latest mlnx-ofa_kernel, which itself has a strict dependency on a specific kernel version (in this case kernel-6.6.82). This results in kernel-6.6.82 being downloaded to the system.

We are actively working on a fix for this issue.

Mitigations:
Since the issue is rooted in the current installed kernel version, there are a couple mitigation options:

  1. Perform an update prior to installing cuda to get the latest kernel on the system. This will sync up the version of the active kernel with the latest one published, and then pull the appropriate cuda package for this newer kernel.
  2. Alternatively, one could directly specify the mlnx-ofa_kernel and mlnx-ofa_kernel-modules packages with versions that correspond to the specific kernel installed already.
    i.e., for kernel-6.6.78.1-3.azl3, specify mlnx-ofa_kernel-24.10-11.azl3 and mlnx-ofa_kernel-modules-24.10-11.azl3 before or during the package install invocation for cuda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.0 PRs Destined for 3.0 AKS bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants