manager: always reap first the process for which we already got SIGCHLD #12919

msekletar · 2019-07-01T14:09:17Z

If we get a SIGCHLD we enable and eventually dispatch
sigchld_event_source where we actually reap the process. We received
SIGCHLD for the specific PID so wait for that process first.

Motivation to do this is to prevent problem due to our state machine for
mount units relying on the fact that we always dispatch mountinfo
notifications before dispatching sigchld handler for the
mount. Previously, this was racy because we might have called
manager_dispatch_sigchld() for completely unrelated process but we would
actually reap the mount process which completed in the meantime. sigchld
handler for the mount unit would then fail the mount unit because we
haven't dispatched mountinfo notification yet.

event| mount         kernel              PID 1
------------------------------------------------------------------------
1    |                                   forks off mount as PID x
------------------------------------------------------------------------
2    |                                   receives SIGCHLD for PID y
------------------------------------------------------------------------
3    |                                   enables sigchld_event_source
------------------------------------------------------------------------
4    |                                   dispatches sigchld_event_source
------------------------------------------------------------------------
5    | mount()       mountinfo_notif
------------------------------------------------------------------------
6    | exit()
------------------------------------------------------------------------
7    |                                   calls waitid() with P_ALL
------------------------------------------------------------------------
8    |                                   calls sigchld_handler for mount
------------------------------------------------------------------------
9    |                                   fails the mount unit since
     |                                   mountinfo_notif wasn't
     |                                   processed yet
------------------------------------------------------------------------

Fixes #10872

msekletar · 2019-07-01T14:12:15Z

Do note that I was not able to reproduce the race on my test system so I couldn't verify that patch actually fixes the issue. I think it might, but YMMV.

Also whoever is able to reproduce #10872, it would be cool if you could give it a try.

If we get a SIGCHLD we enable and eventually dispatch sigchld_event_source where we actually reap the process. We received SIGCHLD for the specific PID so wait for that process first. Motivation to do this is to prevent problem due to our state machine for mount units relying on the fact that we always dispatch mountinfo notifications before dispatching sigchld handler for the mount. Previously, this was racy because we might have called manager_dispatch_sigchld() for completely unrelated process but we would actually reap the mount process which completed in the meantime. sigchld handler for the mount unit would then fail the mount unit because we haven't dispatched mountinfo notification yet. event| mount kernel PID 1 ------------------------------------------------------------------------ 1 | forks off mount as PID x ------------------------------------------------------------------------ 2 | receives SIGCHLD for PID y ------------------------------------------------------------------------ 3 | enables sigchld_event_source ------------------------------------------------------------------------ 4 | dispatches sigchld_event_source ------------------------------------------------------------------------ 5 | mount() mountinfo_notif ------------------------------------------------------------------------ 6 | exit() ------------------------------------------------------------------------ 7 | calls waitid() with P_ALL ------------------------------------------------------------------------ 8 | calls sigchld_handler for mount ------------------------------------------------------------------------ 9 | fails the mount unit since | mountinfo_notif wasn't | processed yet ------------------------------------------------------------------------ Fixes systemd#10872

poettering · 2019-07-08T23:36:53Z

this doesn't work... SIGCHLD is not a queue. i.e. if fifteen child processes die all in a very short time window then in theory we should get fifteen seperate SIGCHLD delivered you'd say. But that's not how this works in the kernel, unfortunately: for each unix signal (excluding realtime signals, which are different) only a single field exists per process: if the field is empty, the SIGCHLD metadata is stored there. But if it is already set, then every new SIGCHLD just overrides the earlier data. This means if fifteen children die at once, then PID 1 might only process the SIGCHLD at a time where only the last process is actually still stored in the field, and all earlier ones have been overwritten.

Yes, UNIX is stupid.

But this means you patch doesn't fix the bug unfortunately: it might very well happen that the SIGCHLD we care for is actually on eof the overwritten ones...

poettering · 2019-07-11T08:59:30Z

As discussed elsewhere, I figure this should work, as long as we get a guarantee that the metadata we get on the SIGCHLD is indeed the oldest metadata around, i.e. the kernel drops any new SIGCHLD, and never the already pending one if multiple are seen without them being handled.

@msekletar volunteered to prep a man page patch to document this kernel behaviour ;-)

poettering · 2019-07-11T15:16:01Z

Hmm, so I wonder, does the ordering thing really make this work?

Let's say this this happens:

Random process X dies, PID 1 gets SIGCHLD
PID 1 enters waitid() event handler, in order to start to process process X
3.1 PID 1 enters waitid() event handler a second time, to call waitid(P_ALL), to process pending processes whose SIGCHLD might have been dropped
3.2 (parallel to 2.1, early on) Our /bin/mount process (M) dies, PID 1 gets another SIGCHLD
3.3 Now the event handler that begin in 2.1 gets to the part where it actually calls waitid(P_ALL), and gets the result of our process M, and handles that first
PID 1 processes /p/s/mi, too late

i.e. the change to the man page is good, but the behaviour it describes is not sufficient to make this PR here work, or what am I missing?

msekletar · 2019-07-11T17:06:21Z

Hmm, I assumed that we would dispatch signalfd event source with higher priority if mount already exited and we have a pending SIGCHLD, i.e. before the second round of waitid() with P_ALL. The idea is to always reap the process for which we got explicit SIGCHLD and only then waitd() with P_ALL.

But if that is not the case, then the question is whether we can make it so w/o breaking anything?

msekletar · 2019-07-15T10:44:31Z

@poettering you are probably right. This patch would probably improve the situation a bit but is not sufficient in itself.

yuwata added the pid1 label Jul 3, 2019

msekletar force-pushed the mountinfo-sigchld-race branch from 0c13bfb to eb653ed Compare July 3, 2019 10:56

msekletar closed this Jul 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

manager: always reap first the process for which we already got SIGCHLD #12919

manager: always reap first the process for which we already got SIGCHLD #12919

msekletar commented Jul 1, 2019

msekletar commented Jul 1, 2019

poettering commented Jul 8, 2019

poettering commented Jul 11, 2019

poettering commented Jul 11, 2019

msekletar commented Jul 11, 2019

msekletar commented Jul 15, 2019

manager: always reap first the process for which we already got SIGCHLD #12919

manager: always reap first the process for which we already got SIGCHLD #12919

Conversation

msekletar commented Jul 1, 2019

msekletar commented Jul 1, 2019

poettering commented Jul 8, 2019

poettering commented Jul 11, 2019

poettering commented Jul 11, 2019

msekletar commented Jul 11, 2019

msekletar commented Jul 15, 2019