-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet.yaml set cgroupDriver: systemd instead of cgroupDriver: cgroupfs in AL2-GPU instances #3005
Labels
Comments
Callisto13
added
the
priority/critical
Should be investigated as soon as possible
label
Dec 31, 2020
A temporary solution is to add the following lines to the ClusterConfig yaml (only in GPU node groups):
|
After some digging here is what's going on:
So, what can be done?:
(note: Docker 20.01 has the cgroupDriver set to systemd by default, so this problem may be solved in future k8s versions) |
5 tasks
6 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What happened?
launch unmanaged node group with p3.2xlarge gpu (ami-0f23f1b20f58cc97f)
however it failed to start -
error message:
failed to run Kubelet: misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs
cat /etc/eksctl/kubelet.yaml
points thatcgroupDriver: systemd
however I suspect it should becgroupDriver: cgroupfs
docker cgroup in Amazon Linux 2 (GPU) is set to "cgroupfs" (vs. "systemd" in non GPU versions)
How to reproduce it?
launch gpu group node via eksctl v0.35.0
Anything else we need to know?
What OS are you using, are you using a downloaded binary or did you compile eksctl, what type of AWS credentials are you using (i.e. default/named profile, MFA) - please don't include actual credentials though!
Versions
Addiional info
I also tried to set an old GPU AMI version = "ami-0969f51a73874a795" (and even unset) - the same disappointing result.
When manually changing
/etc/systemd/system/kubelet.service.d/10-eksclt.al2.conf
to include
--cgroup-driver=cgroupfs
and restart the service I could see the node registered successfully to my cluster.The text was updated successfully, but these errors were encountered: