Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroup: do 'catchup' for unit cgroup inotify watch files #20199

Merged
merged 2 commits into from Aug 12, 2021

Conversation

ddstreet
Copy link
Contributor

While reexec/reload, we drop the inotify watch on cgroup file(s), so
we need to re-check them in case they changed and we missed the event.

Fixes: #20198

@ddstreet
Copy link
Contributor Author

At first, I tried to serialize the cgroup.events mtime across reexec/reload, so it could be checked to see if the cgroup.events file had changed, but unfortunately it seems that the kernel doesn't change the mtime when it actually changes the cgroup.events file content, so the only thing I could see to do it just check the file content for all files we're interested in.

Copy link
Contributor

@Werkov Werkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your approach seems good, more robust than timestamp checking and simpler than passing inotify fd across reload/reexec barrier (which may save doing the check work for each unit on the reload path).

src/core/unit.c Show resolved Hide resolved
@Werkov
Copy link
Contributor

Werkov commented Jul 14, 2021

LGTM (assuming you've successfully tested this resolves the missed exits during reload).

@ddstreet
Copy link
Contributor Author

ddstreet commented Jul 14, 2021

LGTM (assuming you've successfully tested this resolves the missed exits during reload).

thanks, yep I have test it, and I just pushed a rebase to main (no changes) and am about to test once more (edit: just tested the latest rebase and confirmed it fixes the leak)

* file modification times, so we can't just serialize and then check
* the mtime for file(s) we are interested in. */
(void) unit_check_cgroup_events(u);
(void) unit_check_oom(u);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling unit_check_oom() bypasses the cgroup_oom_queue that the inotify watch would normally put the OOM-ed unit into. Shouldn't this at least check and remove the unit from cgroup_oom_queue so it keeps the list consistent?

I'm thinking of a situation where cgroup_oom_queue could have multiple items before daemon-reload, but afterwards the units would all be handled by unit_check_oom() without being removed from cgroup_oom_queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS, cgroup_oom_queue isn't handled similarly to other queues, i.e. it's not gradually emptied from manager_clear_jobs_and_units/unit_free. (That looks like reload with a unit in in_cgroup_oom_queue can result in UAF throught the m->cgroup_oom_queue.)

So your comment is a relevant remark. I'd suggest handling the cgroup_oom_queue in accordance with other similar queues, i.e. empty it (w/out processing) before the reload and sync the unit's state anew after the reload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling unit_check_oom() bypasses the cgroup_oom_queue

It's definitely strange that the unit_check_oom()/unit_add_to_cgroup_oom_queue() function pair is backwards from unit_check_cgroup_events()/unit_add_to_cgroup_empty_queue() function pair.

unit_check_cgroup_events actually does a check on cgroup.events and adds (or removes) the unit from the cgroup empty queue.

unit_check_oom however does the opposite; it expects to be called from on_cgroup_oom_event and does the actual handling of that event. This function seems misnamed to me.

I pushed a new patch that changes unit_check_oom to unit_add_to_cgroup_oom_queue, which seems correct, though the unit_check_oom and related functions could probably use some renaming/refactoring to be more consistent and clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a new patch that changes unit_check_oom to unit_add_to_cgroup_oom_queue,

to clarify i didn't actually change those functions, only which one is now called from the unit_cgroup_catchup function

@cjohnston1158
Copy link

I have tested this with another scenario and here's my results:

root@juju-be253b-7-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 21:49:25 UTC 2021
0
root@juju-be253b-7-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
[1] 10320
[2] 10321
[3] 10322
[4] 10323
[5] 10324
[6] 10325
[7] 10326
[8] 10327
[9] 10328
[10] 10329
[11] 10330
[12] 10331
[13] 10332
[14] 10333
[15] 10334
[16] 10335
[17] 10336
[18] 10349
[19] 10351
...
root@juju-be253b-7-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 21:51:36 UTC 2021
27
root@juju-be253b-7-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
...
root@juju-be253b-7-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 21:52:31 UTC 2021
68
root@juju-be253b-7-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
...
root@juju-be253b-7-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 21:54:51 UTC 2021
100
root@juju-be253b-7-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
...
root@juju-be253b-7-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 21:56:32 UTC 2021
145
root@juju-be253b-7-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 22:05:18 UTC 2021
145

Then I applied the patch from @ddstreet via his PPA:

Possible fix:
root@juju-be253b-8-lxd-5:~# sudo add-apt-repository ppa:ddstreet/lp1934147

 More info: https://launchpad.net/~ddstreet/+archive/ubuntu/lp1934147
Press [ENTER] to continue or Ctrl-c to cancel adding it.
...
root@juju-be253b-8-lxd-5:~# sudo apt install systemd=237-3ubuntu10.49~bug1934147v20210711b1 libsystemd0=237-3ubuntu10.49~bug1934147v20210711b1 libnss-systemd=237-3ubuntu10.49~bug1934147v20210711b1 libpam-systemd=237-3ubuntu10.49~bug1934147v20210711b1 libudev1=237-3ubuntu10.49~bug1934147v20210711b1 systemd-sysv=237-3ubuntu10.49~bug1934147v20210711b1 udev=237-3ubuntu10.49~bug1934147v20210711b1
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following package was automatically installed and is no longer required:
  libfreetype6
Use 'sudo apt autoremove' to remove it.
Suggested packages:
  systemd-container
The following packages will be DOWNGRADED:
  libnss-systemd libpam-systemd libsystemd0 libudev1 systemd systemd-sysv udev
0 upgraded, 0 newly installed, 7 downgraded, 0 to remove and 13 not upgraded.
Need to get 5,201 kB of archives.
After this operation, 735 kB disk space will be freed.
Do you want to continue? [Y/n] y
...
root@juju-be253b-8-lxd-5:~# dpkg -l | grep systemd
ii libnss-systemd:amd64 237-3ubuntu10.49~bug1934147v20210711b1 amd64 nss module providing dynamic user and group name resolution
ii libpam-systemd:amd64 237-3ubuntu10.49~bug1934147v20210711b1 amd64 system and service manager - PAM module
ii libsystemd0:amd64 237-3ubuntu10.49~bug1934147v20210711b1 amd64 systemd utility library
ii networkd-dispatcher 1.7-0ubuntu3.3 all Dispatcher service for systemd-networkd connection status changes
ii python3-systemd 234-1build1 amd64 Python 3 bindings for systemd
ii systemd 237-3ubuntu10.49~bug1934147v20210711b1 amd64 system and service manager
ii systemd-sysv 237-3ubuntu10.49~bug1934147v20210711b1 amd64 system and service manager - SysV links
root@juju-be253b-8-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 22:12:10 UTC 2021
0
root@juju-be253b-8-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
[1] 4679
[2] 4680
[3] 4681
[4] 4682
[5] 4683
[6] 4684
[7] 4685
[8] 4686
...
root@juju-be253b-8-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 22:13:08 UTC 2021
0
root@juju-be253b-8-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
root@juju-be253b-8-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 22:13:59 UTC 2021
0
root@juju-be253b-8-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
root@juju-be253b-8-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 22:14:46 UTC 2021
0
root@juju-be253b-8-lxd-5:~# for i in {1..100}; do /snap/bin/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json & done; for i in {1..20}; do echo 'Reloading...'; sudo systemctl daemon-reload; done
root@juju-be253b-8-lxd-5:~# date ; systemctl list-units --type scope | grep snap | wc -l
Thu Jul 22 22:15:40 UTC 2021
0

https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1934147

@Werkov
Copy link
Contributor

Werkov commented Aug 4, 2021

As I mentioned in comment above, cgroup_oom_queue should be flushed before reload (especially after these changes).
I put the change into a commit based on your branch. I think it makes sense to batch these in this PR, @ddstreet could you cherry-pick/reset that into your branch? (Alternatively, let me know you if wish to route this via a separate PR.)

@ddstreet
Copy link
Contributor Author

ddstreet commented Aug 4, 2021

As I mentioned in comment above, cgroup_oom_queue should be flushed before reload (especially after these changes).
I put the change into a commit based on your branch. I think it makes sense to batch these in this PR, @ddstreet could you cherry-pick/reset that into your branch? (Alternatively, let me know you if wish to route this via a separate PR.)

yep that LGTM, cherry-picked it, thanks

Dan Streetman and others added 2 commits August 5, 2021 10:35
While reexec/reload, we drop the inotify watch on cgroup file(s), so
we need to re-check them in case they changed and we missed the event.

Fixes: systemd#20198
The unit queues are not serialized/deserialized (they are recreated
after reexec/reload instead). The destroyed units are not removed from
the cgroup_oom_queue. That means the queue may contain possibly invalid
pointers to released units.

Fix this by removing the units from cgroup_oom_queue as we do for
others. When at it, sync assert checks with currently existing queues
and put them in order in the manager cleanup code.
@ddstreet
Copy link
Contributor Author

ddstreet commented Aug 5, 2021

rebased on main in the latest push

@poettering
Copy link
Member

double ouch

@poettering
Copy link
Member

CI failure appears unrelated

@poettering poettering merged commit ced10d4 into systemd:main Aug 12, 2021
@ddstreet ddstreet deleted the unit_cgroup_catchup branch August 20, 2021 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

session leaked due to missed inotify of cgroup becoming empty
5 participants