Skip to content
This repository has been archived by the owner on Feb 24, 2020. It is now read-only.

"sharing" the rkt data directory across a shared bindmount breaks rkt rm #3181

Open
euank opened this issue Sep 9, 2016 · 13 comments
Open

Comments

@euank
Copy link
Member

euank commented Sep 9, 2016

This was uncovered while @pbx0 was working on kubelet-wrapper + rktnetes.
I'm reviving the discussion since I'd like to have a definitive answer as to whether creating shared bindmounts of the rkt data-directory should be supported or not in light of containerized-kubelet + rktlet.

I think that this is something we should try to support for the sort of "rkt in rkt, but with a shared data directory" cases (e.g. one pod which wants to get information about other pods).

This issue is related to #1922 and #1856

The issue can be easily reproduced by doing the following:

# mkdir -p /var/lib/rkt2
# mount --rbind --make-shared /var/lib/rkt /var/lib/rkt2
# rkt run --uuid-file-save=/tmp/uuid --insecure-options=image docker://busybox
...
# rkt rm --uuid-file=/tmp/uuid
networking: teardown - executing net-plugin ptp
rm: unable to remove pod "d45d7728-b3d6-44e5-a95f-92fe34111b3f": remove /var/lib/rkt/pods/exited-garbage/d45d7728-b3d6-44e5-a95f-92fe34111b3f/stage1/rootfs: device or resource busy
# mount | grep $(cat /tmp/uuid)
overlay on /var/lib/rkt2/pods/exited-garbage/d45d7728-b3d6-44e5-a95f-92fe34111b3f/stage1/rootfs type overlay .....

The reason this breaks as such is that rkt rm and rkt gc explicitly do a remount unshare of each mount under a pod. I think this was added because of some issue with systmed-udevd holding mounts, but I forget the details, sorry!

This working could be useful for containerized-kubelet / kubelet-wrapper and will impact some design choices.

@pbx0 and @yifan-gu have some context on this as well.

cc @lucab @s-urbaniak, this relates to rktlet since depending on whether this could or should be supported, we might have to break out of any container we're in for all rkt cli calls.

@alban
Copy link
Member

alban commented Sep 12, 2016

The "remount unshare" (MS_PRIVATE) is done in MountGC. It was done initially for rkt fly.

I can reproduce the error "device or resource busy" without rkt with the following:

mkdir /var/lib/foo /var/lib/foo2
mount --rbind --make-shared /var/lib/foo /var/lib/foo2
cd /var/lib/foo

mkdir -p rootfs
mount -t tmpfs none rootfs
mkdir -p rootfs/rootfs
mount -t tmpfs none rootfs/rootfs

mount --make-private rootfs/rootfs
mount --make-private rootfs
umount rootfs/rootfs
umount rootfs
mountpoint rootfs # rootfs is not a mountpoint

rmdir rootfs/ # Device or resource busy

/cc @steveej

@alban
Copy link
Member

alban commented Sep 12, 2016

@euank You're right, it is the same issue as the thing with systemd-udevd again:

$ cat /proc/self/mountinfo |grep /var/lib/foo/
$ cat /proc/$(pidof systemd-udevd)/mountinfo |grep /var/lib/foo/
242 45 0:49 / /var/lib/foo/rootfs rw,relatime master:87 - tmpfs none rw,seclabel
265 242 0:62 / /var/lib/foo/rootfs/rootfs rw,relatime master:94 - tmpfs none rw,seclabel

Not only systemd-udevd, but any services that is started with MountFlags=slave or equivalent:

$ for i in $(grep /var/lib/foo/ /proc/*/mountinfo|cut -d/ -f3) ; do cat /proc/$i/comm ; done|uniq
bluetoothd
NetworkManager
dhclient
colord
systemd-udevd

@peebs
Copy link
Contributor

peebs commented Sep 13, 2016

Just to add, this is the current work around we add to our rktnetes configuration in coreos-kuberetes: coreos/coreos-kubernetes@78acfe9

Any extra configuration steps required just make rktnetes more fragile to install by people not using that repo's install scripts.

@alban
Copy link
Member

alban commented Sep 14, 2016

My previous comment was wrong: this is not related to systemd-udevd, bluetoothd or others. I tested again without services like systemd-udevd that "captures" the mount and the same issue still happens.

This is caused by the way vfs_rmdir() and __is_local_mountpoint() are implemented: rmdir is allowed only when the dentry is not used as a mountpoint in the same mount namespace.

  • the mountpoints in the systemd-udevd mount namespace are not relevant because only the mounts in the same mount namespace are checked (after Linux v3.18 with commit torvalds/linux@8ed936b and others).
  • __is_local_mountpoint() checks the dentry and not the struct path. This means the mount where the path being deleted is not checked. So we cannot delete the path /var/lib/foo/rootfs because although it is not used as a mount point at this path, the same dentry is used as a mount point under /var/lib/foo2/rootfs.

It might be possible to fix this in the kernel by checking the path in addition to the dentry, but it's not easy because __is_local_mountpoint() does not have access to the struct path, and it would involve lots of changes in vfs (rmdir, unlink, rename), p9, nfsd...

@euank
Copy link
Member Author

euank commented Sep 28, 2016

@alban what's the reason we needed to remount with MS_PRIVATE bit in MountGC?
If we can get away without that, it seems like the simplest fix, and if it was only needed for some specific case in rkt fly maybe we could detect needsRemount better and fix this that way.

@squeed
Copy link
Contributor

squeed commented Oct 5, 2016

IIRC, the MS_PRIVATE remount is so we don't do something silly like unmount /proc when tearing down a rkt fly pod.

@euank
Copy link
Member Author

euank commented Dec 10, 2016

Pinging this issue since I was reminded of it over here coreos/coreos-kubernetes#768 (comment) and it still has real impact.

I think we should definitely make non-fly stage1s not unshare mounts during rm (and that's an easy / not very scary fix).

@bassam
Copy link

bassam commented Jan 3, 2017

is there an update on this one? I still see it happening even with the nsenter workaround. My kubelet log is full off

Jan 03 16:58:42 c1 kubelet-wrapper[1569]: stderr: rm: unable to remove pod "97a1eab5-0822-475b-a06e-32d47a4c3bc9": remove /var/lib/rkt/pods/exited-garbage/97a1eab5-0822-475b-a06e-32d47a4c3bc9/stage1/rootfs: device or resource busy
Jan 03 16:58:42 c1 kubelet-wrapper[1569]: , rkt: Failed to clean up rkt pod "b6d875ab-bd04-40c5-8080-eae43b09b50e": rkt: Failed to remove pod "b6d875ab-bd04-40c5-8080-eae43b09b50e": failed to run [rm b6d875ab-bd04-40c5-8080-eae43b09b50e]: exit status 254
Jan 03 16:58:42 c1 kubelet-wrapper[1569]: stdout:

and rkt list is unhappy:

core@c1 ~ $ rkt list
UUID		APP			IMAGE NAME									STATE	CREATED		STARTED		NETWORKS
022c32e8	kube-controller-manager	quay.io/coreos/hyperkube:v1.5.1_coreos.0					running	9 minutes ago	9 minutes ago	
11be175f	checkpoint		quay.io/coreos/pod-checkpointer:b4f0353cc12d95737628b8815625cc8e5cedb6fc	running	9 minutes ago	9 minutes ago	
313cd2f0	checkpoint-installer	quay.io/coreos/pod-checkpointer:b4f0353cc12d95737628b8815625cc8e5cedb6fc	running	9 minutes ago	9 minutes ago	
44e32c00	hyperkube		quay.io/coreos/hyperkube:v1.5.1_coreos.0					running	10 minutes ago	10 minutes ago	
499433c7	flannel			quay.io/coreos/flannel:v0.6.2							exited	11 minutes ago	11 minutes ago	
7b3afaa9	kube-apiserver		quay.io/coreos/hyperkube:v1.5.1_coreos.0					running	9 minutes ago	9 minutes ago	
87b0a719	kube-proxy		quay.io/coreos/hyperkube:v1.5.1_coreos.0					running	10 minutes ago	10 minutes ago	
bd5b28d7	kube-scheduler		quay.io/coreos/hyperkube:v1.5.1_coreos.0					running	9 minutes ago	9 minutes ago	
ed867da4	flannel			quay.io/coreos/flannel:v0.6.2							running	11 minutes ago	11 minutes ago	
list: 15 error(s) encountered when listing pods:
list: ----------------------------------------
list: Unable to read pod 02186a27-ae1d-47f2-b42f-446a9d68facd manifest:
  error reading pod manifest
list: ----------------------------------------
list: Unable to read pod 35d69a29-eaec-41da-9ff1-c4d0990c3082 manifest:
  error reading pod manifest
list: ----------------------------------------
list: Unable to read pod 3fcd45aa-cd40-44ad-9f85-5cc5986abc40 manifest:
  error reading pod manifest

bassam added a commit to bassam/bootkube that referenced this issue Jan 3, 2017
@lucab lucab modified the milestones: v1.23.0, v1.22.0 Jan 5, 2017
whereisaaron added a commit to whereisaaron/kube-aws that referenced this issue Jan 10, 2017
rkt has a gnarly bug (rkt/rkt#3181) that won't be fixed in a hurry (rkt/rkt#3486). It least to continuous task failures that eventually totally wreak worker nodes (kubernetes-retired#244). In the meantime we can use docker just as easily for this simple task. This work around was discussed in kubernetes-retired#199.
@kfirufk
Copy link

kfirufk commented Jan 14, 2017

@s-urbaniak s-urbaniak modified the milestones: v1.24.0, v1.23.0 Jan 19, 2017
@s-urbaniak
Copy link
Contributor

@kfirufk Do you mind to recheck with rkt v1.22.0? We landed #3486 and it should resolve this issue.

@euank
Copy link
Member Author

euank commented Jan 19, 2017

@s-urbaniak unfortunately the base issue here (shared bindmounts + unsharing before umount in gc) isn't fixed by v1.22.0. It's possible some of the related problems are fixed, but if it's from kubelet wrapper / rkt fly / shared bindmounts, it's less likely to be resolved.

@lucab lucab modified the milestones: v1.25.0, v1.24.0 Feb 2, 2017
@lucab
Copy link
Member

lucab commented Feb 10, 2017

Comments above are on spot, the cleaning part should be much improved after #3486 (barring the mounts that are allowed to escape due to back-propagation). The remaining part/mess due to shared propagation is probably better tracked at #3465.

@lucab lucab modified the milestones: v1+, v1.25.0 Feb 17, 2017
@rbjorklin
Copy link

Still seeing this problem while trying to use the work-around listed above together with:

$ rkt version
rkt Version: 1.25.0
appc Version: 0.8.10
Go Version: go1.7.5
Go OS/Arch: linux/amd64
Features: -TPM +SDJOURNAL

Any updates?

kylehodgetts pushed a commit to HotelsDotCom/kube-aws that referenced this issue Mar 27, 2018
rkt has a gnarly bug (rkt/rkt#3181) that won't be fixed in a hurry (rkt/rkt#3486). It least to continuous task failures that eventually totally wreak worker nodes (kubernetes-retired#244). In the meantime we can use docker just as easily for this simple task. This work around was discussed in kubernetes-retired#199.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants