New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User Manager for user fails to start every second time #12386
Comments
When part of the cgroup hierarchy can not be deleted (e.g. because there are still processes in it), do not exit unit_prune_cgroup early, but continue so that u->cgroup_realized is reset. Change the log severity to warning, so that we can monitor the issue of locked cgroups. Fixes systemd#12386
The patch buczek/systemd@f36092b fixed it for us. I don't suggest it as a solution, because I think people with a better overview can come up with something better. |
Addendum: It might play a role, that we have systemd configured with |
i think your fix is actually ok, can you please submit as pr? |
and the msg should probably be downgraded to debug level in the ENOTEMPTY case (i assume that's the errno you see, right?) |
When part of the cgroup hierarchy cannot be deleted (e.g. because there are still processes in it), do not exit unit_prune_cgroup early, but continue so that u->cgroup_realized is reset. Log the known case of non-empty cgroups at debug level and other errors at warning level. Fixes systemd#12386
When part of the cgroup hierarchy cannot be deleted (e.g. because there are still processes in it), do not exit unit_prune_cgroup early, but continue so that u->cgroup_realized is reset. Log the known case of non-empty cgroups at debug level and other errors at warning level. Fixes systemd#12386
When part of the cgroup hierarchy cannot be deleted (e.g. because there are still processes in it), do not exit unit_prune_cgroup early, but continue so that u->cgroup_realized is reset. Log the known case of non-empty cgroups at debug level and other errors at warning level. Fixes systemd#12386
When part of the cgroup hierarchy cannot be deleted (e.g. because there are still processes in it), do not exit unit_prune_cgroup early, but continue so that u->cgroup_realized is reset. Log the known case of non-empty cgroups at debug level and other errors at warning level. Fixes systemd#12386
When part of the cgroup hierarchy cannot be deleted (e.g. because there are still processes in it), do not exit unit_prune_cgroup early, but continue so that u->cgroup_realized is reset. Log the known case of non-empty cgroups at debug level and other errors at warning level. Fixes systemd#12386
When part of the cgroup hierarchy cannot be deleted (e.g. because there are still processes in it), do not exit unit_prune_cgroup early, but continue so that u->cgroup_realized is reset. Log the known case of non-empty cgroups at debug level and other errors at warning level. Fixes systemd#12386
When part of the cgroup hierarchy cannot be deleted (e.g. because there are still processes in it), do not exit unit_prune_cgroup early, but continue so that u->cgroup_realized is reset. Log the known case of non-empty cgroups at debug level and other errors at warning level. Fixes systemd#12386
When part of the cgroup hierarchy cannot be deleted (e.g. because there are still processes in it), do not exit unit_prune_cgroup early, but continue so that u->cgroup_realized is reset. Log the known case of non-empty cgroups at debug level and other errors at warning level. Fixes systemd/systemd#12386 (cherry picked from commit 0219b35)
When part of the cgroup hierarchy cannot be deleted (e.g. because there are still processes in it), do not exit unit_prune_cgroup early, but continue so that u->cgroup_realized is reset. Log the known case of non-empty cgroups at debug level and other errors at warning level. Fixes systemd/systemd#12386 (cherry picked from commit 0219b35) (cherry picked from commit 7f7b786)
When part of the cgroup hierarchy cannot be deleted (e.g. because there are still processes in it), do not exit unit_prune_cgroup early, but continue so that u->cgroup_realized is reset. Log the known case of non-empty cgroups at debug level and other errors at warning level. Fixes systemd/systemd#12386
When part of the cgroup hierarchy cannot be deleted (e.g. because there are still processes in it), do not exit unit_prune_cgroup early, but continue so that u->cgroup_realized is reset. Log the known case of non-empty cgroups at debug level and other errors at warning level. Fixes systemd/systemd#12386 (cherry picked from commit 0219b3524f414e23589e63c6de6a759811ef8474) (cherry picked from commit 7f7b7865eefe6577fad4f94bdc5e7c83df62ac60)
In our environment we use non-interactive public-key authenticated ssh-session a lot for management purposes. Since several weeks, sometime ssh commands (
ssh HOST do-somehing
) executed by management scripts fail. with ssh returning a non-zero exit status but nothing on stderr.We've found, that
pam_systemd.so
was failing and mitigated the problem by changing it fromrequired
tooptional
inpam.d/sshd
. However the logfiles still indicated that there is a sporadic problem and, for example, the User Manager (systemd --user
) is not started for every session. For systemd v242 the errors logged are:We figured out, that this happens on some systems (10 out of 200, but the set changes) and, if it happens at all, then typically on every second ssh session.
There is a long chain of problems involved. Here is an overview of what I think is happening:
ssh HOST systemd-cgls
may look like this:systemd --user
completes while(sd-pam)
still exists. Both, the usersystemd --user
andsystemd[1]
try to remove the user@0.service slice, but both fail, because there is still a process in it. Here's some output of systemd withlog_level=debug
plus some lines added by my favorite high-level debugging tool printf:unit_prune_cgroup() [1] gets an error from cg_trim_everywhere() and returns early without executing
u->cgroup_realized = false;
. The related error is only shown, when systemd runs withlog_level=debug
(or in the above output, because I changed the log level to warning).After user@0.service is stopped,
user-runtime-dir@0.service
is about to be stopped, too. When this unit completes,user-0.slice
is cleaned up recusivly. This time it succeeds, because sd-pam had some more time to exit:user@0.service
still hasu-> cgroup_realized = true
and doesn't attempt to set up the cgroups, although the tree is no longer available:There are multiple problems involved. E.g. Why isn't ssh delivering an error message on stderr when a required pam module fails? Should system@0.service terminate, while the sd-pam process is still alive?
Anyway, what can be identified as a bug here is that
u-> cgroup_realized
doesn't correctly track the state of the cgroup hierarchy. An easy fix might be to putu->cgroup_realized = false;
at the beginning ofunit_prune_cgroup()
before we actually attempt to delete the cgroup tree and possibly fail, maybe after succeeding in parts.But on top of that: If other units are allowed to recursively clean up the tree from a higher position (e.g. from the user slice) , a unit can never know about the state of the cgroup tree based on its own actions only.
I will apply some quickfix for our environment but I don't want to attempt a fix for upstream here, because I'm unsure of the design principles and the vision regarding
u->cgroup_realized
. Is it just some kind of cache and can be removed ( looking at/sys/fs/cgroup
should be fast enough in the context of spawning external prgramms ) or is it required for another reason?[1]
systemd/src/core/cgroup.c
Line 2360 in 2d6888c
The text was updated successfully, but these errors were encountered: