"sharing" the rkt data directory across a shared bindmount breaks rkt rm #3181

euank · 2016-09-09T22:38:59Z

This was uncovered while @pbx0 was working on kubelet-wrapper + rktnetes.
I'm reviving the discussion since I'd like to have a definitive answer as to whether creating shared bindmounts of the rkt data-directory should be supported or not in light of containerized-kubelet + rktlet.

I think that this is something we should try to support for the sort of "rkt in rkt, but with a shared data directory" cases (e.g. one pod which wants to get information about other pods).

This issue is related to #1922 and #1856

The issue can be easily reproduced by doing the following:

# mkdir -p /var/lib/rkt2
# mount --rbind --make-shared /var/lib/rkt /var/lib/rkt2
# rkt run --uuid-file-save=/tmp/uuid --insecure-options=image docker://busybox
...
# rkt rm --uuid-file=/tmp/uuid
networking: teardown - executing net-plugin ptp
rm: unable to remove pod "d45d7728-b3d6-44e5-a95f-92fe34111b3f": remove /var/lib/rkt/pods/exited-garbage/d45d7728-b3d6-44e5-a95f-92fe34111b3f/stage1/rootfs: device or resource busy
# mount | grep $(cat /tmp/uuid)
overlay on /var/lib/rkt2/pods/exited-garbage/d45d7728-b3d6-44e5-a95f-92fe34111b3f/stage1/rootfs type overlay .....

The reason this breaks as such is that rkt rm and rkt gc explicitly do a remount unshare of each mount under a pod. I think this was added because of some issue with systmed-udevd holding mounts, but I forget the details, sorry!

This working could be useful for containerized-kubelet / kubelet-wrapper and will impact some design choices.

@pbx0 and @yifan-gu have some context on this as well.

cc @lucab @s-urbaniak, this relates to rktlet since depending on whether this could or should be supported, we might have to break out of any container we're in for all rkt cli calls.

The text was updated successfully, but these errors were encountered:

alban · 2016-09-12T16:02:17Z

The "remount unshare" (MS_PRIVATE) is done in MountGC. It was done initially for rkt fly.

I can reproduce the error "device or resource busy" without rkt with the following:

mkdir /var/lib/foo /var/lib/foo2
mount --rbind --make-shared /var/lib/foo /var/lib/foo2
cd /var/lib/foo

mkdir -p rootfs
mount -t tmpfs none rootfs
mkdir -p rootfs/rootfs
mount -t tmpfs none rootfs/rootfs

mount --make-private rootfs/rootfs
mount --make-private rootfs
umount rootfs/rootfs
umount rootfs
mountpoint rootfs # rootfs is not a mountpoint

rmdir rootfs/ # Device or resource busy

/cc @steveej

alban · 2016-09-12T17:17:01Z

@euank You're right, it is the same issue as the thing with systemd-udevd again:

$ cat /proc/self/mountinfo |grep /var/lib/foo/
$ cat /proc/$(pidof systemd-udevd)/mountinfo |grep /var/lib/foo/
242 45 0:49 / /var/lib/foo/rootfs rw,relatime master:87 - tmpfs none rw,seclabel
265 242 0:62 / /var/lib/foo/rootfs/rootfs rw,relatime master:94 - tmpfs none rw,seclabel

Not only systemd-udevd, but any services that is started with MountFlags=slave or equivalent:

$ for i in $(grep /var/lib/foo/ /proc/*/mountinfo|cut -d/ -f3) ; do cat /proc/$i/comm ; done|uniq
bluetoothd
NetworkManager
dhclient
colord
systemd-udevd

peebs · 2016-09-13T20:00:32Z

Just to add, this is the current work around we add to our rktnetes configuration in coreos-kuberetes: coreos/coreos-kubernetes@78acfe9

Any extra configuration steps required just make rktnetes more fragile to install by people not using that repo's install scripts.

alban · 2016-09-14T10:05:10Z

My previous comment was wrong: this is not related to systemd-udevd, bluetoothd or others. I tested again without services like systemd-udevd that "captures" the mount and the same issue still happens.

This is caused by the way vfs_rmdir() and __is_local_mountpoint() are implemented: rmdir is allowed only when the dentry is not used as a mountpoint in the same mount namespace.

the mountpoints in the systemd-udevd mount namespace are not relevant because only the mounts in the same mount namespace are checked (after Linux v3.18 with commit torvalds/linux@8ed936b and others).
__is_local_mountpoint() checks the dentry and not the struct path. This means the mount where the path being deleted is not checked. So we cannot delete the path /var/lib/foo/rootfs because although it is not used as a mount point at this path, the same dentry is used as a mount point under /var/lib/foo2/rootfs.

It might be possible to fix this in the kernel by checking the path in addition to the dentry, but it's not easy because __is_local_mountpoint() does not have access to the struct path, and it would involve lots of changes in vfs (rmdir, unlink, rename), p9, nfsd...

euank · 2016-09-28T19:05:34Z

@alban what's the reason we needed to remount with MS_PRIVATE bit in MountGC?
If we can get away without that, it seems like the simplest fix, and if it was only needed for some specific case in rkt fly maybe we could detect needsRemount better and fix this that way.

squeed · 2016-10-05T16:25:11Z

IIRC, the MS_PRIVATE remount is so we don't do something silly like unmount /proc when tearing down a rkt fly pod.

euank · 2016-12-10T00:20:44Z

Pinging this issue since I was reminded of it over here coreos/coreos-kubernetes#768 (comment) and it still has real impact.

I think we should definitely make non-fly stage1s not unshare mounts during rm (and that's an easy / not very scary fix).

bassam · 2017-01-03T17:08:19Z

is there an update on this one? I still see it happening even with the nsenter workaround. My kubelet log is full off

Jan 03 16:58:42 c1 kubelet-wrapper[1569]: stderr: rm: unable to remove pod "97a1eab5-0822-475b-a06e-32d47a4c3bc9": remove /var/lib/rkt/pods/exited-garbage/97a1eab5-0822-475b-a06e-32d47a4c3bc9/stage1/rootfs: device or resource busy
Jan 03 16:58:42 c1 kubelet-wrapper[1569]: , rkt: Failed to clean up rkt pod "b6d875ab-bd04-40c5-8080-eae43b09b50e": rkt: Failed to remove pod "b6d875ab-bd04-40c5-8080-eae43b09b50e": failed to run [rm b6d875ab-bd04-40c5-8080-eae43b09b50e]: exit status 254
Jan 03 16:58:42 c1 kubelet-wrapper[1569]: stdout:

and rkt list is unhappy:

core@c1 ~ $ rkt list
UUID		APP			IMAGE NAME									STATE	CREATED		STARTED		NETWORKS
022c32e8	kube-controller-manager	quay.io/coreos/hyperkube:v1.5.1_coreos.0					running	9 minutes ago	9 minutes ago	
11be175f	checkpoint		quay.io/coreos/pod-checkpointer:b4f0353cc12d95737628b8815625cc8e5cedb6fc	running	9 minutes ago	9 minutes ago	
313cd2f0	checkpoint-installer	quay.io/coreos/pod-checkpointer:b4f0353cc12d95737628b8815625cc8e5cedb6fc	running	9 minutes ago	9 minutes ago	
44e32c00	hyperkube		quay.io/coreos/hyperkube:v1.5.1_coreos.0					running	10 minutes ago	10 minutes ago	
499433c7	flannel			quay.io/coreos/flannel:v0.6.2							exited	11 minutes ago	11 minutes ago	
7b3afaa9	kube-apiserver		quay.io/coreos/hyperkube:v1.5.1_coreos.0					running	9 minutes ago	9 minutes ago	
87b0a719	kube-proxy		quay.io/coreos/hyperkube:v1.5.1_coreos.0					running	10 minutes ago	10 minutes ago	
bd5b28d7	kube-scheduler		quay.io/coreos/hyperkube:v1.5.1_coreos.0					running	9 minutes ago	9 minutes ago	
ed867da4	flannel			quay.io/coreos/flannel:v0.6.2							running	11 minutes ago	11 minutes ago	
list: 15 error(s) encountered when listing pods:
list: ----------------------------------------
list: Unable to read pod 02186a27-ae1d-47f2-b42f-446a9d68facd manifest:
  error reading pod manifest
list: ----------------------------------------
list: Unable to read pod 35d69a29-eaec-41da-9ff1-c4d0990c3082 manifest:
  error reading pod manifest
list: ----------------------------------------
list: Unable to read pod 3fcd45aa-cd40-44ad-9f85-5cc5986abc40 manifest:
  error reading pod manifest

currently broken due to kubernetes/kubernetes#24576 and this one rkt/rkt#3181

rkt has a gnarly bug (rkt/rkt#3181) that won't be fixed in a hurry (rkt/rkt#3486). It least to continuous task failures that eventually totally wreak worker nodes (kubernetes-retired#244). In the meantime we can use docker just as easily for this simple task. This work around was discussed in kubernetes-retired#199.

kfirufk · 2017-01-14T10:28:01Z

same problem. http://stackoverflow.com/questions/41648460/error-reading-pod-manifest-on-exited-garbage-pods

s-urbaniak · 2017-01-19T08:13:57Z

@kfirufk Do you mind to recheck with rkt v1.22.0? We landed #3486 and it should resolve this issue.

euank · 2017-01-19T10:42:57Z

@s-urbaniak unfortunately the base issue here (shared bindmounts + unsharing before umount in gc) isn't fixed by v1.22.0. It's possible some of the related problems are fixed, but if it's from kubelet wrapper / rkt fly / shared bindmounts, it's less likely to be resolved.

lucab · 2017-02-10T19:39:17Z

Comments above are on spot, the cleaning part should be much improved after #3486 (barring the mounts that are allowed to escape due to back-propagation). The remaining part/mess due to shared propagation is probably better tracked at #3465.

rbjorklin · 2017-08-07T06:11:59Z

Still seeing this problem while trying to use the work-around listed above together with:

$ rkt version
rkt Version: 1.25.0
appc Version: 0.8.10
Go Version: go1.7.5
Go OS/Arch: linux/amd64
Features: -TPM +SDJOURNAL

Any updates?

rkt has a gnarly bug (rkt/rkt#3181) that won't be fixed in a hurry (rkt/rkt#3486). It least to continuous task failures that eventually totally wreak worker nodes (kubernetes-retired#244). In the meantime we can use docker just as easily for this simple task. This work around was discussed in kubernetes-retired#199.

euank mentioned this issue Sep 9, 2016

kubelet-wrapper / rkt gc: 'device or resource busy' #2878

Closed

euank added component/stage0 area/cri labels Sep 9, 2016

euank mentioned this issue Sep 9, 2016

Containerized kubelet + rktlet kubernetes-retired/rktlet#13

Closed

euank mentioned this issue Dec 10, 2016

Updating Calico to hosted install coreos/coreos-kubernetes#768

Merged

euank mentioned this issue Dec 12, 2016

fly: full mount propagation causes leftovers and instability #3465

Open

lucab added this to the v1.22.0 milestone Dec 14, 2016

euank mentioned this issue Dec 16, 2016

gc: fix cleaning mounts and files #3486

Merged

bassam added a commit to bassam/bootkube that referenced this issue Jan 3, 2017

attempt to switch to rkt.

89d0567

currently broken due to kubernetes/kubernetes#24576 and this one rkt/rkt#3181

lucab modified the milestones: v1.23.0, v1.22.0 Jan 5, 2017

mumoshu mentioned this issue Jan 10, 2017

Controller node not being properly tainted kubernetes-retired/kube-aws#199

Merged

whereisaaron mentioned this issue Jan 10, 2017

Change taint-and-uncordon worker task to use docker for now kubernetes-retired/kube-aws#231

Merged

s-urbaniak modified the milestones: v1.24.0, v1.23.0 Jan 19, 2017

euank mentioned this issue Jan 19, 2017

gc: don't unshare shared mounts #3546

Closed

1 task

lucab modified the milestones: v1.25.0, v1.24.0 Feb 2, 2017

lucab modified the milestones: v1+, v1.25.0 Feb 17, 2017

rbjorklin mentioned this issue Aug 11, 2017

api-service complains about missing status file of app that is still running (never exited) #3504

Closed

dvoita mentioned this issue Jan 31, 2018

rkt gc fails with "device or resource busy" #3898

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"sharing" the rkt data directory across a shared bindmount breaks rkt rm #3181

"sharing" the rkt data directory across a shared bindmount breaks rkt rm #3181

euank commented Sep 9, 2016

alban commented Sep 12, 2016

alban commented Sep 12, 2016

peebs commented Sep 13, 2016

alban commented Sep 14, 2016

euank commented Sep 28, 2016

squeed commented Oct 5, 2016

euank commented Dec 10, 2016

bassam commented Jan 3, 2017

kfirufk commented Jan 14, 2017

s-urbaniak commented Jan 19, 2017

euank commented Jan 19, 2017

lucab commented Feb 10, 2017

rbjorklin commented Aug 7, 2017