New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_os causes delayed failure on x86 gentoo buildbot: Unknown signal 32 #49220
Comments
The x86 gentoo 3.0 and 3.x buildbots have been failing for a while at the test stage, with: make: *** [buildbottest] Unknown signal 32 I noticed a common denominator with these failures, which is that they always seem to occur a few tests after Can anyone reproduce this (I can't on any of the machines I have access to), and (e.g., by trial and error) identify |
Another observation is that after test_os has been run, the first test to |
...and there's a related message from Neal Norwitz at: http://mail.python.org/pipermail/python-3000/2007-August/009944.html I suspect that python Lib/test/regrtest.py test_os test_wait3 (possibly with some additional flags to regrtest.py) should Neal, does this help to identify the problem at all? Any suggestions |
It would also be interesting to know whether Neal's system is using the |
I was unable to reproduce this using the suggested regrtest pair, even |
[...] |
srid, I'm not sure why you added your comment; a couple of sentences If you're able to reproduce this failure and have time to figure out where |
Sorry about the late response; have been busy of late. I believe this error ("Unknown signal 32") appears consistently in I am attaching the entire log file. I don't have much time to investigate into this relatively less- |
.. and here are the machine details: apy@gila: |
libc used is of version 2.3.2. apy@gila: |
Prelude has had the same problem with signal 32: To know which threads implementation your glibc is using, you can run (of course, the question is, since the signal is used by linuxthreads, |
Sridhar, Neal, I would advocate disabling (commenting out) That's the one really dirty test in test_os, it might close a file |
Forget the last comment, test_closerange is fine... |
This type of failure appears again in current builds: http://www.python.org/dev/buildbot/builders/x86 gentoo 3.x/builds/2160/steps/test/logs/stdio |
Unfortunately, I think you mean 'still' rather than 'again'. :) Maybe it's time to do something. I propose we:
I'll try to start this this evening (no ssh access at the moment) unless someone else beats me to it. |
Should also modify regrtest to print out the result of the command getconf GNU_LIBPTHREAD_VERSION that Antoine suggested. |
This bugreport http://bugs.gentoo.org/28193 indeed suggests that Would it be possible for someone with an affected system to run |
Signal 32 is the first real-time signal, and is indeed used by linuxthreads, so it's very likely a linuxthreads bug, since this signal shouldn't leak to application. |
Agreed. But I think it's still worth trying to narrow down (and possibly work around) the cause of failure, just so that we can make this buildbot useful again on 3.x. Perhaps we can get by with a conditional skip of one of the test_os tests, but we have to figure out which one first. :) |
Extract of the Prelude ticket https://dev.prelude-ids.com/issues/show/133 : "commenting out sigprocmask(SIG_SETMASK, &set, NULL) seems to fixes the problem (...)" |
Results of my simple-minded strategy (see r80033-4, r80037, r80042, r80045, r80047-51): test_execvpe_with_bad_program in ExecTests by itself is enough to trigger the signal 32 error (in combination with test_wait3). See: http://www.python.org/dev/buildbot/builders/x86%20gentoo%203.x/builds/2174 If just this single test is disabled and all other tests in test_os are allowed to run, there's no problem (at least with test_wait3; (I haven't tried re-enabling *all* other tests in the test suite yet). See: http://www.python.org/dev/buildbot/builders/x86%20gentoo%203.x/builds/2176 |
Here's some fairly minimal Python code that produces the signal: ### begin example ###
import os
import time
import _thread try: def f():
time.sleep(1.0) # probably irrelevant to the failure
_thread.start_new(f, ())
### end example ### It looks as though the failed os.execv call messes something up internally, so that any attempt thereafter to start a thread produces this signal. I can't see anything obviously wrong with the os.execv implementation (see posix_execv in Modules/posixmodule.c). There's still the question of what changed between 2.x and 3.x: on 2.x, this buildbot seems perfectly happy. |
Okay, I think I've got as far as I can, but if anyone else wants to hack on this, please do. The branch name is py3k-issue4970 In that branch:
|
signal.__dict__: |
There are many references to "unknown signal 32" errors in Google. Gdb mailing list, December 2002: "SIG32/SIGTRAP issues" Gdb mailing list, September 2003: "pthread_create, Program received signal ?, Unknown signal" There is a thread "SIGRT_0 (Unknown signal 32)" in the Linux Kernel mailing list, in July 2005: Extract of a Debian bug report: open("/dev/sequencer", O_WRONLY) = 10 (...) This is not very helpful, as it means the thread is waiting for another |
NPTL was introduced in Linux kernel 2.6(.0). glibc 2.4 requires NPTL: "The LinuxThreads add-on, providing pthreads on Linux 2.4 kernels, is no longer supported. The new NPTL implementation requires Linux 2.6 kernels. For a libc and libpthread that works well on Linux 2.4 kernels, we recommend using the stable 2.3 branch." NPTL 0.1 was released in September 2002. So the bug requires a Linux kernel 2.4 (and gblic 2.3.x). |
I suggest simply skipping the "offending" test on linuxthread platforms. |
Good idea
I would prefer to rely on confstr(): import os
try:
# 'linuxthreads-0.10' or 'NPTL 2.10.2'
pthread = os.confstr("CS_GNU_LIBPTHREAD_VERSION")
linuxthreads = pthread.startswith("linuxthreads")
except ValueError:
linuxthreads = False ^^ this example requires attached patch for the two CS_GNU_* constants. Which tests should be disabled? |
Skipping test_execvpe_with_bad_program sounds good to me. I'd ideally like to understand why 3.x is failing where 2.x is happy, but life's too short to stuff a mushroom. |
Upon execve, signals handler are reset to default. So maybe the error makes the linuxthread API screw up latter when it tries to set up handlers for SIGRTMIN and friend. But what's weird is that when the executable given does not exist, the call should fail and return before having done anything...
I think it's simply because we didn't test a wrong program path with execve in 2.X version of test_os. |
Oh, we should add this test to Python2 ;-) |
D'oh! Thank you very much. I'm happy now: my mushroom's stuffed. :) |
Thanks for taking care of this guys. Sorry, I got swamped with mail Let me know if there's anything you need. I may not have access to |
Victor, that patch looks fine to me. Do you want to go ahead and apply it, and add the skip to test_execvpe_with_bad_program ? The fix should be backported to 3.1, but not to 2.x (I think), since we don't have a problem there, and arguably the new os.confstr items could be considered a new feature. |
And just for reference, http://www.python.org/dev/buildbot/builders/x86%20gentoo%203.x/builds/2192 shows that the relevant versions on this machine are: glibc 2.3.4 and from http://www.python.org/dev/buildbot/builders/x86%20gentoo%203.x, it looks like the kernel version is 2.6.9 (actually, 2.6.9-gentoo-r1). |
Commited as r80108 to py3k: "Add CS_GNU_LIBC_VERSION and CS_GNU_LIBPTHREAD_VERSION constants for constr(), and disable test_execvpe_with_bad_program() of test_os if the libc uses linuxthreads to avoid the "unknown signal 32" bug (see issue bpo-4970)." Wait for the buildbot to port it to trunk (and maybe 2.6 and 3.1). |
Fix merged to release31-maint in r80119. Thanks, Victor. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: