New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgroup: do 'catchup' for unit cgroup inotify watch files #20199
Conversation
At first, I tried to serialize the cgroup.events mtime across reexec/reload, so it could be checked to see if the cgroup.events file had changed, but unfortunately it seems that the kernel doesn't change the mtime when it actually changes the cgroup.events file content, so the only thing I could see to do it just check the file content for all files we're interested in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your approach seems good, more robust than timestamp checking and simpler than passing inotify fd across reload/reexec barrier (which may save doing the check work for each unit on the reload path).
019d63a
to
28e466f
Compare
LGTM (assuming you've successfully tested this resolves the missed exits during reload). |
28e466f
to
cdbfddc
Compare
thanks, yep I have test it, and I just pushed a rebase to main (no changes) and am about to test once more (edit: just tested the latest rebase and confirmed it fixes the leak) |
src/core/cgroup.c
Outdated
* file modification times, so we can't just serialize and then check | ||
* the mtime for file(s) we are interested in. */ | ||
(void) unit_check_cgroup_events(u); | ||
(void) unit_check_oom(u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling unit_check_oom()
bypasses the cgroup_oom_queue
that the inotify watch would normally put the OOM-ed unit into. Shouldn't this at least check and remove the unit from cgroup_oom_queue
so it keeps the list consistent?
I'm thinking of a situation where cgroup_oom_queue
could have multiple items before daemon-reload
, but afterwards the units would all be handled by unit_check_oom()
without being removed from cgroup_oom_queue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS, cgroup_oom_queue
isn't handled similarly to other queues, i.e. it's not gradually emptied from manager_clear_jobs_and_units/unit_free
. (That looks like reload with a unit in in_cgroup_oom_queue
can result in UAF throught the m->cgroup_oom_queue
.)
So your comment is a relevant remark. I'd suggest handling the cgroup_oom_queue
in accordance with other similar queues, i.e. empty it (w/out processing) before the reload and sync the unit's state anew after the reload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling unit_check_oom() bypasses the cgroup_oom_queue
It's definitely strange that the unit_check_oom
()/unit_add_to_cgroup_oom_queue
() function pair is backwards from unit_check_cgroup_events
()/unit_add_to_cgroup_empty_queue
() function pair.
unit_check_cgroup_events
actually does a check on cgroup.events
and adds (or removes) the unit from the cgroup empty queue.
unit_check_oom
however does the opposite; it expects to be called from on_cgroup_oom_event
and does the actual handling of that event. This function seems misnamed to me.
I pushed a new patch that changes unit_check_oom
to unit_add_to_cgroup_oom_queue
, which seems correct, though the unit_check_oom
and related functions could probably use some renaming/refactoring to be more consistent and clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a new patch that changes unit_check_oom to unit_add_to_cgroup_oom_queue,
to clarify i didn't actually change those functions, only which one is now called from the unit_cgroup_catchup
function
I have tested this with another scenario and here's my results:
Then I applied the patch from @ddstreet via his PPA:
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1934147 |
cdbfddc
to
714dbc9
Compare
As I mentioned in comment above, |
714dbc9
to
e7ad02f
Compare
yep that LGTM, cherry-picked it, thanks |
While reexec/reload, we drop the inotify watch on cgroup file(s), so we need to re-check them in case they changed and we missed the event. Fixes: systemd#20198
The unit queues are not serialized/deserialized (they are recreated after reexec/reload instead). The destroyed units are not removed from the cgroup_oom_queue. That means the queue may contain possibly invalid pointers to released units. Fix this by removing the units from cgroup_oom_queue as we do for others. When at it, sync assert checks with currently existing queues and put them in order in the manager cleanup code.
e7ad02f
to
13e7210
Compare
rebased on main in the latest push |
double ouch |
CI failure appears unrelated |
While reexec/reload, we drop the inotify watch on cgroup file(s), so
we need to re-check them in case they changed and we missed the event.
Fixes: #20198