cgroup: do 'catchup' for unit cgroup inotify watch files #20199

ddstreet · 2021-07-12T01:10:07Z

While reexec/reload, we drop the inotify watch on cgroup file(s), so
we need to re-check them in case they changed and we missed the event.

Fixes: #20198

ddstreet · 2021-07-12T01:11:45Z

At first, I tried to serialize the cgroup.events mtime across reexec/reload, so it could be checked to see if the cgroup.events file had changed, but unfortunately it seems that the kernel doesn't change the mtime when it actually changes the cgroup.events file content, so the only thing I could see to do it just check the file content for all files we're interested in.

Werkov

Your approach seems good, more robust than timestamp checking and simpler than passing inotify fd across reload/reexec barrier (which may save doing the check work for each unit on the reload path).

src/core/unit.c

Werkov · 2021-07-14T16:01:47Z

LGTM (assuming you've successfully tested this resolves the missed exits during reload).

ddstreet · 2021-07-14T16:22:39Z

LGTM (assuming you've successfully tested this resolves the missed exits during reload).

thanks, yep I have test it, and I just pushed a rebase to main (no changes) and am about to test once more (edit: just tested the latest rebase and confirmed it fixes the leak)

anitazha · 2021-07-21T09:29:34Z

src/core/cgroup.c

+         * file modification times, so we can't just serialize and then check
+         * the mtime for file(s) we are interested in. */
+        (void) unit_check_cgroup_events(u);
+        (void) unit_check_oom(u);


Calling unit_check_oom() bypasses the cgroup_oom_queue that the inotify watch would normally put the OOM-ed unit into. Shouldn't this at least check and remove the unit from cgroup_oom_queue so it keeps the list consistent?

I'm thinking of a situation where cgroup_oom_queue could have multiple items before daemon-reload, but afterwards the units would all be handled by unit_check_oom() without being removed from cgroup_oom_queue.

AFAICS, cgroup_oom_queue isn't handled similarly to other queues, i.e. it's not gradually emptied from manager_clear_jobs_and_units/unit_free. (That looks like reload with a unit in in_cgroup_oom_queue can result in UAF throught the m->cgroup_oom_queue.)

So your comment is a relevant remark. I'd suggest handling the cgroup_oom_queue in accordance with other similar queues, i.e. empty it (w/out processing) before the reload and sync the unit's state anew after the reload.

Calling unit_check_oom() bypasses the cgroup_oom_queue

It's definitely strange that the unit_check_oom()/unit_add_to_cgroup_oom_queue() function pair is backwards from unit_check_cgroup_events()/unit_add_to_cgroup_empty_queue() function pair.

unit_check_cgroup_events actually does a check on cgroup.events and adds (or removes) the unit from the cgroup empty queue.

unit_check_oom however does the opposite; it expects to be called from on_cgroup_oom_event and does the actual handling of that event. This function seems misnamed to me.

I pushed a new patch that changes unit_check_oom to unit_add_to_cgroup_oom_queue, which seems correct, though the unit_check_oom and related functions could probably use some renaming/refactoring to be more consistent and clear.

I pushed a new patch that changes unit_check_oom to unit_add_to_cgroup_oom_queue,

to clarify i didn't actually change those functions, only which one is now called from the unit_cgroup_catchup function

cjohnston1158 · 2021-07-23T16:41:08Z

I have tested this with another scenario and here's my results:

root@juju-be253b-7-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 21:49:25 UTC 2021
0
root@juju-be253b-7-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
[1] 10320
[2] 10321
[3] 10322
[4] 10323
[5] 10324
[6] 10325
[7] 10326
[8] 10327
[9] 10328
[10] 10329
[11] 10330
[12] 10331
[13] 10332
[14] 10333
[15] 10334
[16] 10335
[17] 10336
[18] 10349
[19] 10351
...
root@juju-be253b-7-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 21:51:36 UTC 2021
27
root@juju-be253b-7-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
...
root@juju-be253b-7-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 21:52:31 UTC 2021
68
root@juju-be253b-7-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
...
root@juju-be253b-7-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 21:54:51 UTC 2021
100
root@juju-be253b-7-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
...
root@juju-be253b-7-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 21:56:32 UTC 2021
145
root@juju-be253b-7-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 22:05:18 UTC 2021
145

Then I applied the patch from @ddstreet via his PPA:

Possible fix:
root@juju-be253b-8-lxd-5:~# sudo add-apt-repository ppa:ddstreet/lp1934147

 More info: https://launchpad.net/~ddstreet/+archive/ubuntu/lp1934147
Press [ENTER] to continue or Ctrl-c to cancel adding it.
...
root@juju-be253b-8-lxd-5:~# sudo apt install systemd=237-3ubuntu10.49~bug1934147v20210711b1 libsystemd0=237-3ubuntu10.49~bug1934147v20210711b1 libnss-systemd=237-3ubuntu10.49~bug1934147v20210711b1 libpam-systemd=237-3ubuntu10.49~bug1934147v20210711b1 libudev1=237-3ubuntu10.49~bug1934147v20210711b1 systemd-sysv=237-3ubuntu10.49~bug1934147v20210711b1 udev=237-3ubuntu10.49~bug1934147v20210711b1
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following package was automatically installed and is no longer required:
  libfreetype6
Use 'sudo apt autoremove' to remove it.
Suggested packages:
  systemd-container
The following packages will be DOWNGRADED:
  libnss-systemd libpam-systemd libsystemd0 libudev1 systemd systemd-sysv udev
0 upgraded, 0 newly installed, 7 downgraded, 0 to remove and 13 not upgraded.
Need to get 5,201 kB of archives.
After this operation, 735 kB disk space will be freed.
Do you want to continue? [Y/n] y
...
root@juju-be253b-8-lxd-5:~# dpkg -l | grep systemd
ii libnss-systemd:amd64 237-3ubuntu10.49~bug1934147v20210711b1 amd64 nss module providing dynamic user and group name resolution
ii libpam-systemd:amd64 237-3ubuntu10.49~bug1934147v20210711b1 amd64 system and service manager - PAM module
ii libsystemd0:amd64 237-3ubuntu10.49~bug1934147v20210711b1 amd64 systemd utility library
ii networkd-dispatcher 1.7-0ubuntu3.3 all Dispatcher service for systemd-networkd connection status changes
ii python3-systemd 234-1build1 amd64 Python 3 bindings for systemd
ii systemd 237-3ubuntu10.49~bug1934147v20210711b1 amd64 system and service manager
ii systemd-sysv 237-3ubuntu10.49~bug1934147v20210711b1 amd64 system and service manager - SysV links
root@juju-be253b-8-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 22:12:10 UTC 2021
0
root@juju-be253b-8-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
[1] 4679
[2] 4680
[3] 4681
[4] 4682
[5] 4683
[6] 4684
[7] 4685
[8] 4686
...
root@juju-be253b-8-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 22:13:08 UTC 2021
0
root@juju-be253b-8-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
root@juju-be253b-8-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 22:13:59 UTC 2021
0
root@juju-be253b-8-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
root@juju-be253b-8-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 22:14:46 UTC 2021
0
root@juju-be253b-8-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
root@juju-be253b-8-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 22:15:40 UTC 2021
0

https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1934147

Werkov · 2021-08-04T17:54:36Z

As I mentioned in comment above, cgroup_oom_queue should be flushed before reload (especially after these changes).
I put the change into a commit based on your branch. I think it makes sense to batch these in this PR, @ddstreet could you cherry-pick/reset that into your branch? (Alternatively, let me know you if wish to route this via a separate PR.)

ddstreet · 2021-08-04T21:40:38Z

As I mentioned in comment above, cgroup_oom_queue should be flushed before reload (especially after these changes).
I put the change into a commit based on your branch. I think it makes sense to batch these in this PR, @ddstreet could you cherry-pick/reset that into your branch? (Alternatively, let me know you if wish to route this via a separate PR.)

yep that LGTM, cherry-picked it, thanks

While reexec/reload, we drop the inotify watch on cgroup file(s), so we need to re-check them in case they changed and we missed the event. Fixes: systemd#20198

The unit queues are not serialized/deserialized (they are recreated after reexec/reload instead). The destroyed units are not removed from the cgroup_oom_queue. That means the queue may contain possibly invalid pointers to released units. Fix this by removing the units from cgroup_oom_queue as we do for others. When at it, sync assert checks with currently existing queues and put them in order in the manager cleanup code.

ddstreet · 2021-08-05T14:36:24Z

rebased on main in the latest push

poettering · 2021-08-12T14:04:14Z

double ouch

poettering · 2021-08-12T14:04:32Z

CI failure appears unrelated

Werkov reviewed Jul 12, 2021

View reviewed changes

src/core/unit.c Show resolved Hide resolved

ddstreet force-pushed the unit_cgroup_catchup branch from 019d63a to 28e466f Compare July 12, 2021 15:42

ddstreet force-pushed the unit_cgroup_catchup branch from 28e466f to cdbfddc Compare July 14, 2021 16:18

anitazha reviewed Jul 21, 2021

View reviewed changes

ddstreet force-pushed the unit_cgroup_catchup branch from cdbfddc to 714dbc9 Compare July 23, 2021 18:57

ddstreet force-pushed the unit_cgroup_catchup branch from 714dbc9 to e7ad02f Compare August 4, 2021 21:40

Dan Streetman and others added 2 commits August 5, 2021 10:35

cgroup: do 'catchup' for unit cgroup inotify watch files

869f52f

While reexec/reload, we drop the inotify watch on cgroup file(s), so we need to re-check them in case they changed and we missed the event. Fixes: systemd#20198

ddstreet force-pushed the unit_cgroup_catchup branch from e7ad02f to 13e7210 Compare August 5, 2021 14:36

poettering added cgroups pid1 labels Aug 12, 2021

poettering merged commit ced10d4 into systemd:main Aug 12, 2021

ddstreet deleted the unit_cgroup_catchup branch August 20, 2021 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cgroup: do 'catchup' for unit cgroup inotify watch files #20199

cgroup: do 'catchup' for unit cgroup inotify watch files #20199

ddstreet commented Jul 12, 2021

ddstreet commented Jul 12, 2021

Werkov left a comment

Werkov commented Jul 14, 2021

ddstreet commented Jul 14, 2021 •

edited

anitazha Jul 21, 2021

Werkov Jul 21, 2021

ddstreet Jul 23, 2021

ddstreet Jul 29, 2021

cjohnston1158 commented Jul 23, 2021

Werkov commented Aug 4, 2021 •

edited

ddstreet commented Aug 4, 2021

ddstreet commented Aug 5, 2021

poettering commented Aug 12, 2021

poettering commented Aug 12, 2021

cgroup: do 'catchup' for unit cgroup inotify watch files #20199

cgroup: do 'catchup' for unit cgroup inotify watch files #20199

Conversation

ddstreet commented Jul 12, 2021

ddstreet commented Jul 12, 2021

Werkov left a comment

Choose a reason for hiding this comment

Werkov commented Jul 14, 2021

ddstreet commented Jul 14, 2021 • edited

anitazha Jul 21, 2021

Choose a reason for hiding this comment

Werkov Jul 21, 2021

Choose a reason for hiding this comment

ddstreet Jul 23, 2021

Choose a reason for hiding this comment

ddstreet Jul 29, 2021

Choose a reason for hiding this comment

cjohnston1158 commented Jul 23, 2021

Werkov commented Aug 4, 2021 • edited

ddstreet commented Aug 4, 2021

ddstreet commented Aug 5, 2021

poettering commented Aug 12, 2021

poettering commented Aug 12, 2021

ddstreet commented Jul 14, 2021 •

edited

Werkov commented Aug 4, 2021 •

edited