Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet fails to get rootfs information error for dir /var/lib/kubelet on RancherOS #9848

Closed
galal-hussein opened this issue Sep 8, 2017 · 9 comments
Assignees
Labels
area/kubernetes kind/bug Issues that are defects reported by users or that we know have reached a real release
Milestone

Comments

@galal-hussein
Copy link
Contributor

galal-hussein commented Sep 8, 2017

Rancher versions:
rancher/server: v1.6.8
kubernetes (if applicable): v1.7.4-rancher2

Docker version: (docker version,docker info preferred)
1.12.6
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
ROS 1.0.3 and 1.0.4
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
AWS
Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)
single node rancher
Environment Template: (Cattle/Kubernetes/Swarm/Mesos)
Kubernetes

Cadvisor fails to get metrics for pods, Kubelet is showing the following errors on ROS 1.0.3:

9/8/2017 1:49:04 PME0908 10:49:04.189387    3602 container_manager_linux.go:543] [ContainerManager]: Fail to get rootfs information error trying to get filesystem Device for dir /var/lib/kubelet: err: could not find device with major: 0, minor: 67 in cached partitions map
9/8/2017 1:49:05 PME0908 10:49:05.189684    3602 container_manager_linux.go:543] [ContainerManager]: Fail to get rootfs information error trying to get filesystem Device for dir /var/lib/kubelet: err: could not find device with major: 0, minor: 67 in cached partitions map
9/8/2017 1:49:05 PME0908 10:49:05.888345    3602 kubelet.go:1737] Failed to check if disk space is available on the root partition: failed to get fs info for "root": error trying to get filesystem Device for dir /var/lib/kubelet: err: could not find device with major: 0, minor: 67 in cached partitions map
9/8/2017 1:49:06 PME0908 10:49:06.189867    3602 container_manager_linux.go:543] [ContainerManager]: Fail to get rootfs information error trying to get filesystem Device for dir /var/lib/kubelet: err: could not find device with major: 0, minor: 67 in cached partitions map
9/8/2017 1:49:07 PME0908 10:49:07.190220    3602 container_manager_linux.go:543] [ContainerManager]: Fail to get rootfs information error trying to get filesystem Device for dir /var/lib/kubelet: err: could not find device with major: 0, minor: 67 in cached partitions map
9/8/2017 1:49:08 PME0908 10:49:08.190499    3602 container_manager_linux.go:543] [ContainerManager]: Fail to get rootfs information error trying to get filesystem Device for dir /var/lib/kubelet: err: could not find device with major: 0, minor: 67 in cached partitions map

Maybe related to kubernetes/kubernetes#44059

@galal-hussein galal-hussein added area/kubernetes kind/bug Issues that are defects reported by users or that we know have reached a real release labels Sep 8, 2017
@galal-hussein galal-hussein added this to the September 2017 milestone Sep 8, 2017
@galal-hussein
Copy link
Contributor Author

galal-hussein commented Sep 11, 2017

The problem can be summarized in the following points:

  • CAdvisor construct a partition map from all mounts in kubelet
  • CAdvisor doesn't seem to support overlayfs mounts yet, also it skips bind mounts.
  • /var/lib/kubelet and / are mounted as an overlay mount which is the case of every normal mount outside of /home and /opt in ROS because user docker in ROS runs in a container and is using overlayfs backend storage.
  • After adding some debugging value i can see that kubelet tries to check /var/lib/kubelet stats (the kubelet root dir), "/" is picked up by the process mounts function in cadvisor by a catchall function addSystemRootLabel, but /var/lib/kubelet is skipped because its considered an overlayfs:
Filesystem      Size  Used Avail Use% Mounted on
overlay          29G  6.4G   21G  24% /
.....
overlay          29G  6.4G   21G  24% /var/lib/kubelet

@galal-hussein
Copy link
Contributor Author

galal-hussein commented Sep 14, 2017

A fix for this will be included in future RancherOS releases, as a work around user can run the following command to upgrade ROS from 1.0.3 with new volume parameters for kubelet:

ros os upgrade -i rancher/os:v1.0.4 --append "rancher.services.user-volumes.volumes=[/home:/home,/opt:/opt,/var/lib/kubelet:/var/lib/kubelet]

Applying the workaround will fix the kubelet problem, and cadvisor will be able to collect metrics for pods correctly.

CC @SvenDowideit

@alialshamali
Copy link

RancherOS version: 1.1.0
Kubernetes version: gcr.io/google_containers/hyperkube:v1.7.4
Docker version: 1.12.6
Type/provider of hosts: Baremetal
Setup details: Custom Service Kubelet and Manifests
Environment Template: Kuberenetes

Hi I tried the work-around with with RancherOS 1.1.0 and I am still getting the same issue.

I confirmed that the User-Volumes have been appended:

sudo ros config get rancher.services.user-volumes.volumes
- /home:/home
- /opt:/opt
- /var/lib/kubelet:/var/lib/kubelet

I am still seeing the error:

[ContainerManager]: Fail to get rootfs information error trying to get filesystem Device for dir /var/lib/kubelet: err: could not find device with major: 0, minor: 34 in cached partitions map
kubelet.go:1737] Failed to check if disk space is available on the root partition: failed to get fs info for "root": error trying to get filesystem Device for dir /var/lib/kubelet: err: could not find device with major: 0, minor: 34 in cached partitions map

@alialshamali
Copy link

alialshamali commented Sep 26, 2017

Confirmed its seems to be working with RancherOS 1.0.4 for some reason but I am getting the same error with 1.1.0 with the workaround.

@alialshamali
Copy link

Confirmed after downgrade to 1.0.4 and then upgrading to 1.1.0 workaround works.

@jc1518
Copy link

jc1518 commented Oct 6, 2017

@shamalco Have your figured out why downgrade to 1.0.4 then upgrade to 1.1.0 workaround works?

@galal-hussein
Copy link
Contributor Author

@shamalco I am not able to reproduce the issue exactly, i was able to downgrade from 1.1.0 to 1.0.4 and add /var/lib/kubelet as user volume and run kubernetes, and i can see cadvisor running normally and can get pods metrics, i can also see metrics being gathered and shown in grafana

@ScottEAdams
Copy link

Also had to bounce down to 1.0.4 and back up to 1.1.0 with workaround for this to work

@cyphrsonic
Copy link

cyphrsonic commented Feb 20, 2018

Has there been any further investigation into this issue for RancherOS 1.1.0? I'm using Auto-scaling groups, and having my machines roll from 1.1.0 to 1.0.4 to 1.1.0 isn't practical for my situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubernetes kind/bug Issues that are defects reported by users or that we know have reached a real release
Projects
None yet
Development

No branches or pull requests

5 participants