New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A multitude of cgroup fixes #10894

Merged
merged 29 commits into from Nov 26, 2018

Conversation

2 participants
@poettering
Member

poettering commented Nov 23, 2018

Mostly fixes to get #9512 fixed, but many other fixes too.

And don't be put too off by the number of patches in this patch set, most of these are fairly trivial.

/cc @htejun @cdown

poettering added some commits Nov 20, 2018

cgroup: only install cgroup release agent when we own the root cgroup
If we run in a container we shouldn't patch around this, and most likely
we can't anyway, and there's not much point in complaining about this.
Hence let's strictly say: the agent is private property of the host's
system instance, nothing else.
cgroup: add a new macro for determining log level for cgroup attr wri…
…te failures

For now, let's use it only at one place, but a follow-up commit will
make more use of it.
cgroup: add a common routine for writing to attributes, and logging a…
…bout it

We can use this at quite a few places, and this allows us to shorten our
code quite a bit.
cgroup: rename {manager_owns|unit_has}_root_cgroup() → .._host_root_c…
…group()

Let's emphasize that this function checks for the host root cgroup, i.e.
returns false for the root cgroup when we run in a container where
CLONE_NEWCGROUP is used. There has been some confusion around this
already, for example cgroup_context_apply() uses the function
incorrectly (which we'll fix in a later commit).

Just some refactoring, not change in behaviour.
cgroup: tighten manager_owns_host_root_cgroup() a bit
This tightening is not strictly necessary (as the m->cgroup_root check
further down does the same), but let's make this explicit.
cgroup: append \n to static strings we write to cgroup attributes
This is a bit cleaner since we when we format numeric limits we append
it. And this way write_string_file() doesn't have to append it.
cgroup: fine tune when to apply cgroup attributes to the root cgroup
Let's tweak when precisely to apply cgroup attributes on the root
cgroup.

With this we now follow the following rules:

1. On cgroupsv2 we never apply any regular cgroups to the host root,
   since the attributes generally do not exist there.

2. On cgroupsv1 we do not apply any "weight" or "shares" style
   attributes to the host root cgroup, since they don't make much sense
   on the top level where there's only one group, hence no need to
   compare weights against each other. The other attributes are applied
   to the host root cgroup however.

3. In any case we don't apply attributes to the root of container
   environments (and --user roots), under the assumption that this is
   managed by the manager further up. (Note that on cgroupsv2 this is
   even enforced by the kernel)

4. BPF pseudo-attributes are applied in all cases (since we can have as
   many of them as we want)
cgroup: units that aren't loaded properly should not result in cgroup…
… controllers being pulled in

This shouldn't make much difference in real life, but is a bit cleaner.
@cdown

This looks great, thank you! Just a couple of questions. The EBUSY one I think is harmless, but I wonder if we should also just check fewer things if we can get away with it to reduce the surface area for bugs

* minimize surprises here and reduce triggers for re-realization by always saying we fully
* succeeded. */
if (ret_result_mask)
*ret_result_mask = mask & supported & CGROUP_MASK_V2;

This comment has been minimized.

@cdown

cdown Nov 23, 2018

Member

Why & CGROUP_MASK_V2?

This comment has been minimized.

@cdown

cdown Nov 23, 2018

Member

Oh whoops, I was confused by the comment talking about legacy hierarchy so I forgot this is r == 0 case

This comment has been minimized.

@poettering

poettering Nov 23, 2018

Member

yeah, we never want to report that we can enable V1 or bpf pseudo-controllers this way.

i will add a comment about this

/* If we can't turn off a controller, leave it on in the reported resulting mask. This
* happens for example when we attempt to turn off a controller up in the tree that is
* used down in the tree. */
if (!FLAGS_SET(mask, bit) && r == -EBUSY)

This comment has been minimized.

@cdown

cdown Nov 23, 2018

Member

Hmm, isn't this enough without the EBUSY check? Not sure it buys us much here

This comment has been minimized.

@poettering

poettering Nov 23, 2018

Member

so, when disabling a controller EBUSY indicates "controller is enabled, but we can't disable it because something further down the tree still uses it". An error like EINVAL or EOPNOTSUPP or so very likely means "i never heard of this controller, what are you talking about?". In the former case this means we can assume a-posteriori that the controller continues to be enabled, in the latter case we cannot assume that, it probably doesn't exist at all..

i will also add a comment about that

@cdown

cdown approved these changes Nov 23, 2018

* minimize surprises here and reduce triggers for re-realization by always saying we fully
* succeeded. */
if (ret_result_mask)
*ret_result_mask = mask & supported & CGROUP_MASK_V2;

This comment has been minimized.

@cdown

cdown Nov 23, 2018

Member

Oh whoops, I was confused by the comment talking about legacy hierarchy so I forgot this is r == 0 case

poettering added some commits Nov 22, 2018

cgroup: be more careful with which controllers we can enable/disable …
…on a cgroup

This changes cg_enable_everywhere() to return which controllers are
enabled for the specified cgroup. This information is then used to
correctly track the enablement mask currently in effect for a unit.
Moreover, when we try to turn off a controller, and this works, then
this is indicates that the parent unit might succesfully turn it off
now, too as our unit might have kept it busy.

So far, when realizing cgroups, i.e. when syncing up the kernel
representation of relevant cgroups with our own idea we would strictly
work from the root to the leaves. This is generally a good approach, as
when controllers are enabled this has to happen in root-to-leaves order.
However, when controllers are disabled this has to happen in the
opposite order: in leaves-to-root order (this is because controllers can
only be enabled in a child if it is already enabled in the parent, and
if it shall be disabled in the parent then it has to be disabled in the
child first, otherwise it is considered busy when it is attempted to
remove it in the parent).

To make things complicated when invalidating a unit's cgroup membershup
systemd can actually turn off some controllers previously turned on at
the very same time as it turns on other controllers previously turned
off. In such a case we have to work up leaves-to-root *and*
root-to-leaves right after each other. With this patch this is
implemented: we still generally operate root-to-leaves, but as soon as
we noticed we successfully turned off a controller previously turned on
for a cgroup we'll re-enqueue the cgroup realization for all parents of
a unit, thus implementing leaves-to-root where necessary.
cgroup: in unit_invalidate_cgroup() actually modify invalidation mask
Previously this would manipulate the realization mask for invalidating
the realization. This is a bit ugly though as the realization mask's
primary purpose to is to reflect in which hierarchies a cgroup currently
exists, and it's probably a good idea to keep that in sync with
realities.

We nowadays have the an explicit fields for invalidating cgroup
controller information, the "cgroup_invalidated_mask", let's use this
one instead.

The effect is pretty much the same, as the main consumer of these masks
(unit_has_mask_realize()) checks both anyway.
cgroup: extend reasons when we realize the enable mask
After creating a cgroup we need to initialize its
"cgroup.subtree_control" file with the controllers its children want to
use. Currently we do so whenever the mkdir() on the cgroup succeeded,
i.e. when we know the cgroup is "fresh". Let's update the condition
slightly that we also do so when internally we assume a cgroup doesn't
exist yet, even if it already does (maybe left-over from a previous
run).

This shouldn't change anything IRL but make things a bit more robust.

poettering added some commits Nov 23, 2018

cgroup: drastically simplify caching of cgroups members mask
Previously we tried to be smart: when a new unit appeared and it only
added controllers to the cgroup mask we'd update the cached members mask
in all parents by ORing in the controller flags in their cached values.
Unfortunately this was quite broken, as we missed some conditions when
this cache had to be reset (for example, when a unit got unloaded),
moreover the optimization doesn't work when a controller is removed
anyway (as in that case there's no other way for the parent to iterate
though all children if any other, remaining child unit still needs it).
Hence, let's simplify the logic substantially: instead of updating the
cache on the right events (which we didn't get right), let's simply
invalidate the cache, and generate it lazily when we encounter it later.
This should actually result in better behaviour as we don't have to
calculate the new members mask for a whole subtree whever we have the
suspicion something changed, but can delay it to the point where we
actually need the members mask.

This allows us to simplify things quite a bit, which is good, since
validating this cache for correctness is hard enough.

Fixes: #9512
cgroup: when we unload a unit, also update all its parent's members mask
This way we can corectly ensure that when a unit that requires some
controller goes away, we propagate the removal of it all the way up, so
that the controller is turned off in all the parents too.
@poettering

This comment has been minimized.

Member

poettering commented Nov 23, 2018

Force pushed a new version, with additional comments added in addressing @cdown's questions. I figure if this comes up here it's a good thing to also comment this in the sources, for the next one reading this code.

@poettering poettering force-pushed the poettering:root-cgroup-fix branch from 599ee10 to 43738e0 Nov 23, 2018

@cdown

This comment has been minimized.

Member

cdown commented Nov 23, 2018

Thanks for the comment updates! Looks good to me.

@poettering

This comment has been minimized.

Member

poettering commented Nov 26, 2018

@cdown thanks for the review! let's merge hence.

@poettering poettering merged commit 9630d4d into systemd:master Nov 26, 2018

6 of 9 checks passed

bionic-amd64 autopkgtest finished (failure)
Details
bionic-arm64 autopkgtest finished (failure)
Details
bionic-i386 autopkgtest finished (failure)
Details
Fedora Rawhide CI x86_64 rpm build [succeeded]
Details
LGTM analysis: C/C++ No alert changes
Details
LGTM analysis: JavaScript No code changes detected
Details
bionic-s390x autopkgtest finished (success)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
semaphoreci The build passed on Semaphore.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment