Cgroupv2 - cgroup.subtree_control is empty in the root cgroup and can't be populated #126
Labels
area: nsjail
Related to NsJail and its configuration
priority: 1 - high
status: planning
Discussing details
type: bug
Something isn't working
By default, Docker uses a private cgroup namespace when the host system uses cgroupv2. This results in the root cgroup within the container having an empty
cgroup.subtree_control
, which means the child cgroups NsJail creates will not have any controllers enabled. Attempting to write to the rootcgroup.subtree_control
results in a "device or resource busy" error, which seems to be because the cgroup already has processes in it (it's the root cgroup, after all). However, the exact cause for this error has not been confirmed.docker run
has a--cgroupns
option which can be set tohost
to use the host's cgroup namespace instead of a private one. This works around the emptycgroup.subtree_control
but at the cost of not having a private namespace. This is in fact the default behaviour when cgroupv1 is used. Another downside is that this option cannot be configured in the Docker Compose file (compose-spec/compose-spec#148). To still use Docker Compose, this setting would need to be set globally via thedefault-cgroupns-mode
Docker daemon option. Otherwise, the container would have to be started withdocker run
instead.I looked into the
--cgroup-parent
option as well. I created a cgroup/sys/fs/cgroup/NSJAIL.slice
and then enabled some controllers in itscgroup.subtree_control
before starting the container. However, the cgroup Docker creates withinNSJAIL.slice
still ends up having an emptycgroup.subtree_control
. That makes sense, since when I createdNSJAIL.slice
,cgroup.subtree_control
also started out empty (as is documented by the various manpages on cgroupv2). I was hoping thatNSJAIL.slice
would become the root cgroup in the container, which is not the case.As mentioned in the cgroup namespaces manpage,
However, Docker seems to create a new cgroup for each container with some hash in its name right before starting the container. Thus, there seems to be no opportunity to write to
cgroup.subtree_control
before the cgroup is populated with processes. Thus, the only solution I see currently is to rely on--cgroupns host
.The text was updated successfully, but these errors were encountered: