-
Notifications
You must be signed in to change notification settings - Fork 575
Unexpected Linux kernel version upgrade when install cuda driver in GPU VMs #13433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A couple of potential options here,
Or using the versionLock plugin during VHD build time.
|
I believe it's an AzLinux RPM bug because a certain version of cuda package shouldn't depend on the different kernel version. |
/cc @microsoft/cbl-mariner-kernel @christopherco |
Root Cause: We are actively working on a fix for this issue. Mitigations:
|
Describe the bug
We have an AKS 1.32.3 cluster with Azure Linux V3, all GPU nodes nvidia-device-plugin are crashloop backoff after node reboot, and we noticed GPU node OS kernel version was upgrade from 6.6.78.1-3.azl3 to 6.6.82.1-1.azl3 after reboot.
From AKS node /var/log/azure/cluster-provision.log (see as Screenshots), it shows that install cuda-0:560.35.03-1_6.6.78.1.3.azl3.x86_64 upgraded kernel version to 6.6.82.1-1.azl3 unexpectlly.
The cuda version is not compatible with the kernel version 6.6.82 which caused GPU drive cannot be loaded, and then nvidia-device-plugin pods are crash.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
keep the linux kernel version in 6.6.78.1.3
Screenshots
The text was updated successfully, but these errors were encountered: