Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-33613, test_semaphore_tracker_sigint: fix race condition #7850

Merged
merged 21 commits into from Sep 4, 2018

Conversation

pablogsal
Copy link
Member

@pablogsal pablogsal commented Jun 21, 2018

Fail test_semaphore_tracker_sigint if no warnings are expected and
one is received.

Fix race condition when the child receives SIGINT
before it can register signal handlers for it.

The race condition occurs when the parent calls
_semaphore_tracker.ensure_running() (which in turn spawns the
semaphore_tracker using _posixsubprocess.fork_exec), the child
registers the signal handlers and the parent tries to kill the child.
What seems to happen is that in some slow systems, the parent sends the
signal to kill the child before the child protects against the signal.

There is no reliable and portable solution for the parent to wait until
the child has register the signal handlers to send the signal to kill
the child so a sleep is introduced between the spawning of the child
and the parent sending the signal to give time to the child to register
the handlers.

https://bugs.python.org/issue33613

Fail `test_semaphore_tracker_sigint` if no warnings are expected and
one is received.

Fix race condition when the child receives SIGINT
before it can register signal handlers for it.

The race condition occurs when the parent calls
`_semaphore_tracker.ensure_running()` (which in turn spawns the
semaphore_tracker using `_posixsubprocess.fork_exec`), the child
registers the signal handlers and the parent tries to kill the child.
What seem to happen is that in some slow systems, the parent sends the
signal to kill the child before the child protects against the signal.

There is no reliable and portable solution for the parent to wait until
the child has register the signal handlers to send the signal to kill
the child so a `sleep` is introduced between the spawning of the child
and the parent sending the signal to give time to the child to register
the handlers.
@taleinat
Copy link
Contributor

Regarding the unrelated change of the warning check, FWIW I find the existing code (without the change in this PR) clearer.

@pablogsal
Copy link
Member Author

pablogsal commented Jun 24, 2018

@taleinat I am happy to undo that change. The problem is that the test was silently failing when SIGINT was being delivered as the test does not check that no warnings are raised. Notice that the presence of a warning means that the process died and therefore SIGINT did kill the process, which is precisely what the test checks that does not happen.

I think in order to check that the test works as intended without the need to run the suite with -Wall, we need to modify the test.

@pablogsal pablogsal force-pushed the bpo33613 branch 3 times, most recently from 4b8f76f to 9408236 Compare June 24, 2018 18:35
time.sleep(0.5) # give it time to die
old_stderr = sys.stderr
r, w = os.pipe()
try:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that we cannot use test.support.captured_stderr() as it does not support fileno().

old_stderr = sys.stderr
r, w = os.pipe()
try:
sys.stderr = open(w, "bw")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be before the try:.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d075473

# information.
_semaphore_tracker._send("PING", "")
with open(r, "rb") as pipe:
data = pipe.readline()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we okay with this potentially hanging indefinitely if something goes wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit d075473 implements a way to fail if reading takes too long.

@pablogsal
Copy link
Member Author

@pitrou Could you take a look at this?

Here is a summary of all the changes to make reviewing this easier:

  1. Updated the tracker to answer to a PING command to we know when is alive.

  2. The test function check_semaphore_tracker_death restarts the tracker every time to avoid interference between test runs and allow parallel testing.

  3. The test function check_semaphore_tracker_death now checks for warnings as these are indicative of the tracker being dead and restarted (this was one of the main errors and the reason tests were passing without the -W error option).

  4. Fixed the race condition in the tests: the parent was killing the child before the child could protect itself against the signals. I was able to reproduce the problem on one of the slowest buildbots (gcc112.fsffrance.org and gcc110.fsffrance.org) and I can confirm that this PR fixes the problem.

@pitrou
Copy link
Member

pitrou commented Jul 24, 2018

From a high-level POV:

Fixed the race condition in the tests: the parent was killing the child before the child could protect itself against the signals. I was able to reproduce the problem on one of the slowest buildbots (gcc112.fsffrance.org and gcc110.fsffrance.org) and I can confirm that this PR fixes the problem.

Is that important? If we have very slow buildbots, we could skip the test on them. Testing this piece of functionality on all buildbots is not critical.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we validate this approach, there are still problems with the implementation.

r, w = os.pipe()
try:
fds_to_pass.append(r)
# process will out live us, so no need to wait on pid
exe = spawn.get_executable()
args = [exe] + util._args_from_interpreter_flags()
args += ['-c', cmd % r]
args += ['-c', cmd.format(r, sys.stderr.fileno())]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work if sys.stderr isn't an actual file:

>>> io.StringIO().fileno()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
io.UnsupportedOperation: fileno

Copy link
Member Author

@pablogsal pablogsal Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was relaying on the fact that the tracker seems to work under the assumption that sys.stderr has a file descriptor associated:

https://github.com/python/cpython/blob/master/Lib/multiprocessing/semaphore_tracker.py#L60

Relevant lines:

            fds_to_pass = []
            try:
                fds_to_pass.append(sys.stderr.fileno())
            except Exception:
                pass

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou I am missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the except Exception should be clear, no? :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ups, my bad. :)

Let me investigate what options do we have. Do you have a preferred approach on how to handle this?

@@ -128,6 +128,8 @@ def main(fd):
cache.add(name)
elif cmd == b'UNREGISTER':
cache.remove(name)
elif cmd == b'PING':
os.write(fd_write, b"PONG\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.write may not write all bytes. You probably want to wrap fd_write in a buffered writer as is done for fd above.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@pablogsal pablogsal force-pushed the bpo33613 branch 2 times, most recently from e16e027 to 9460da4 Compare July 24, 2018 12:29
@pablogsal
Copy link
Member Author

Is that important? If we have very slow buildbots, we could skip the test on them. Testing this piece of functionality on all buildbots is not critical.

That was one of the original problems in the issue. It does not happen only on the slowest buildbots, but these are a reliable place to test for the race condition to happen. In my humble opinion, the existence of the race makes the tests more unreliable and we should fix that, but I you think there is a better approach or a different compromise, I am more than happy to go for that.

@vstinner
Copy link
Member

vstinner commented Sep 3, 2018

@pitrou wrote: "I'm a bit uneasy with this. What if spawnv_passfds takes a very long time for some reason? The user will try to stop it using ^C... and nothing will happen."

You are true that SIGINT is blocked while spawnv_passfds() is running, but spawnv_passfds() should be quick in the parent: to oversimplify, it just calls fork() which should be very quick (it's not a O(n) operating thanks to copy-on-write). I prefer to see this race condition fixed. IMHO the short time window where CTRL+c is blocked is small enough to be acceptable.

Note: signals are not lost or ignored, it's just that signals are only handled once spawnv_passfds() completes (once the signals are unblocked).

I suggest to backport the change to 2.7, 3.6 and 3.7 branches to make our buildbots more reliable.

@vstinner
Copy link
Member

vstinner commented Sep 3, 2018

@pitrou: @pablogsal updated his PR, and it now LGTM. Would you mind to have a second look?

Note: this PR has a long history since @pablogsal chose to rewrite his PR with a different approach ("ping" then pthread_sigmask) rather than creating a new PR.

IMHO pthread_sigmask is smpler and more reliable than the "ping" idea. There is no need to establish (and then close, reliably) a new communication channel with pthread_sigmask.

@pitrou
Copy link
Member

pitrou commented Sep 4, 2018

Fair enough. I think this is good to go.

@pitrou pitrou merged commit ec74d18 into python:master Sep 4, 2018
@bedevere-bot
Copy link

@pitrou: Please replace # with GH- in the commit message next time. Thanks!

@miss-islington
Copy link
Contributor

Thanks @pablogsal for the PR, and @pitrou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.6.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

@miss-islington
Copy link
Contributor

Thanks @pablogsal for the PR, and @pitrou for merging it 🌮🎉.. I'm working now to backport this PR to: 2.7.
🐍🍒⛏🤖

@miss-islington
Copy link
Contributor

Thanks @pablogsal for the PR, and @pitrou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

@miss-islington
Copy link
Contributor

Sorry, @pablogsal and @pitrou, I could not cleanly backport this to 2.7 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker ec74d187f50a8a48f94eb37023300583fbd644cc 2.7

@bedevere-bot
Copy link

GH-9055 is a backport of this pull request to the 3.7 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Sep 4, 2018
…H-7850)

Fail `test_semaphore_tracker_sigint` if no warnings are expected and one is received.

Fix race condition when the child receives SIGINT before it can register signal handlers for it.

The race condition occurs when the parent calls
`_semaphore_tracker.ensure_running()` (which in turn spawns the
semaphore_tracker using `_posixsubprocess.fork_exec`), the child
registers the signal handlers and the parent tries to kill the child.
What seem to happen is that in some slow systems, the parent sends the
signal to kill the child before the child protects against the signal.
(cherry picked from commit ec74d18)

Co-authored-by: Pablo Galindo <Pablogsal@gmail.com>
@miss-islington
Copy link
Contributor

Sorry, @pablogsal and @pitrou, I could not cleanly backport this to 3.6 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker ec74d187f50a8a48f94eb37023300583fbd644cc 3.6

@pitrou
Copy link
Member

pitrou commented Sep 4, 2018

Please don't backport this. This isn't fixing a user-visible bug.

@vstinner
Copy link
Member

vstinner commented Sep 4, 2018

Please don't backport this. This isn't fixing a user-visible bug.

It hurts on the buildbots, and so it makes our life harder (Pablo and me who watch the random failures on buildbots).

@pitrou
Copy link
Member

pitrou commented Sep 4, 2018

Right, but we shouldn't modify library code to heal the buildbots (and risk introducing regressions).

@vstinner
Copy link
Member

vstinner commented Sep 4, 2018

If you don't want to backport the code, I suggest to remove or skip the test in 3.6 and 3.7 branches. I don't want to have known race condition in our test suite.

@pitrou
Copy link
Member

pitrou commented Sep 4, 2018

I'm ok with skipping the tests on the buildbots, as long as it's not skipped unconditionally.

@pablogsal pablogsal deleted the bpo33613 branch May 19, 2021 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants