Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods crash on GPU nodes with Ubuntu 22 #73

Open
zonca opened this issue Feb 9, 2024 · 1 comment
Open

Pods crash on GPU nodes with Ubuntu 22 #73

zonca opened this issue Feb 9, 2024 · 1 comment

Comments

@zonca
Copy link
Owner

zonca commented Feb 9, 2024

On Ubuntu 20 GPU nodes work fine, however on Ubuntu 22, all system pods intermittently fail. If a node is rebooted, they seem to be working fine for a few minutes then crash.

NAMESPACE     NAME                                           READY   STATUS             RESTARTS         AGE   IP             NODE                       NOMINATED NODE   READINESS GATES
kube-system   coredns-588bb58b94-8jdjw                       0/1     CrashLoopBackOff   12 (10s ago)     51m   10.233.65.49   kubejetstream-k8s-node-1   <none>           <none>
kube-system   csi-cinder-controllerplugin-648ffdc6db-88b2v   0/6     CrashLoopBackOff   72 (61s ago)     50m   10.233.65.47   kubejetstream-k8s-node-1   <none>           <none>
kube-system   csi-cinder-nodeplugin-tccts                    0/3     CrashLoopBackOff   37 (23s ago)     50m   10.0.74.64     kubejetstream-k8s-node-1   <none>           <none>
kube-system   kube-flannel-x85zd                             1/1     Running            13 (3m53s ago)   51m   10.0.74.64     kubejetstream-k8s-node-1   <none>           <none>
kube-system   kube-proxy-hq9nf                               0/1     CrashLoopBackOff   13 (36s ago)     52m   10.0.74.64     kubejetstream-k8s-node-1   <none>           <none>
kube-system   nginx-proxy-kubejetstream-k8s-node-1           1/1     Running            14 (6m39s ago)   51m   10.0.74.64     kubejetstream-k8s-node-1   <none>           <none>
kube-system   nodelocaldns-kn4r6                             0/1     CrashLoopBackOff   12 (2m25s ago)   51m   10.0.74.64     kubejetstream-k8s-node-1   <none>           <none>
kube-system   nvidia-device-plugin-daemonset-h7khg           0/1     CrashLoopBackOff   5 (2m14s ago)    11m   10.233.65.46   kubejetstream-k8s-node-1   <none>           <none>
kube-system   snapshot-controller-7d445c66c9-v9z66           0/1     CrashLoopBackOff   11 (4m40s ago)   50m   10.233.65.45   kubejetstream-k8s-node-1   <none>           <none>

See minimal debugging performed here: zonca/jetstream_kubespray#29 (comment)

@zonca
Copy link
Owner Author

zonca commented Mar 9, 2024

tested a CPU-only deployment with Ubuntu 22 nodes and it worked fine. it seems like something specific to GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant