Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_os causes delayed failure on x86 gentoo buildbot: Unknown signal 32 #49220

Closed
mdickinson opened this issue Jan 17, 2009 · 37 comments
Closed
Assignees
Labels
tests Tests in the Lib/test dir type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@mdickinson
Copy link
Member

BPO 4970
Nosy @mdickinson, @pitrou, @vstinner, @bitdancer, @skrah
Files
  • apy3.1.1-linux-x86-apy_test.log: Log of test run using ActivePython 3.1.1.2
  • confstr_libpthread.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/vstinner'
    closed_at = <Date 2010-04-16.16:34:46.245>
    created_at = <Date 2009-01-17.11:09:45.100>
    labels = ['tests', 'type-crash']
    title = 'test_os causes delayed failure on x86 gentoo buildbot: Unknown signal 32'
    updated_at = <Date 2010-04-16.16:34:46.243>
    user = 'https://github.com/mdickinson'

    bugs.python.org fields:

    activity = <Date 2010-04-16.16:34:46.243>
    actor = 'mark.dickinson'
    assignee = 'vstinner'
    closed = True
    closed_date = <Date 2010-04-16.16:34:46.245>
    closer = 'mark.dickinson'
    components = ['Tests']
    creation = <Date 2009-01-17.11:09:45.100>
    creator = 'mark.dickinson'
    dependencies = []
    files = ['14797', '16912']
    hgrepos = []
    issue_num = 4970
    keywords = ['patch', 'buildbot']
    message_count = 37.0
    messages = ['80009', '80010', '80011', '90390', '90396', '91564', '91601', '92047', '92048', '92049', '94670', '94672', '94674', '102985', '103020', '103021', '103033', '103039', '103040', '103043', '103046', '103055', '103057', '103061', '103068', '103069', '103071', '103076', '103077', '103079', '103080', '103081', '103110', '103136', '103137', '103316', '103339']
    nosy_count = 8.0
    nosy_names = ['nnorwitz', 'mark.dickinson', 'pitrou', 'vstinner', 'r.david.murray', 'srid', 'skrah', 'neologix']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue4970'
    versions = ['Python 3.0', 'Python 3.1', 'Python 3.2', 'Python 3.3']

    @mdickinson
    Copy link
    Member Author

    The x86 gentoo 3.0 and 3.x buildbots have been failing for a while at the test stage, with:

    make: *** [buildbottest] Unknown signal 32
    program finished with exit code 2

    I noticed a common denominator with these failures, which is that they always seem to occur a few tests after
    test_os has been run. So it looks as though something in test_os is causing this.

    Can anyone reproduce this (I can't on any of the machines I have access to), and (e.g., by trial and error) identify
    which of the test_os tests is causing this?

    @mdickinson
    Copy link
    Member Author

    Another observation is that after test_os has been run, the first test to
    actually cause the 'unknown signal' failure always seems to be one that
    involves threads (e.g., test_wait3, or test_queue, or test_logging...)

    @mdickinson
    Copy link
    Member Author

    ...and there's a related message from Neal Norwitz at:

    http://mail.python.org/pipermail/python-3000/2007-August/009944.html

    I suspect that

    python Lib/test/regrtest.py test_os test_wait3

    (possibly with some additional flags to regrtest.py) should
    be enough to reliably reproduce the failure.

    Neal, does this help to identify the problem at all? Any suggestions
    about how to go about debugging this?

    @mdickinson
    Copy link
    Member Author

    It would also be interesting to know whether Neal's system is using the
    LinuxThreads library, or whether it's using NPTL. If it's the former, it
    might go some way to explaining the problem.

    @bitdancer
    Copy link
    Member

    I was unable to reproduce this using the suggested regrtest pair, even
    if I ran -R ::, on Gentoo, kernel 2.6.30, with nptl.

    @srid
    Copy link
    Mannequin

    srid mannequin commented Aug 14, 2009

    [...]
    test_poll
    test_popen
    test_poplib
    stub-asunix.sh: line 238: 25474 Unknown signal 32 $PYTHON
    $installdir/lib/python?.?/test/regrtest.py -w -u all,-curses,-audio,-
    network -x $SKIPS
    stub: core Python test suite FAILED (retval: 160)

    @mdickinson
    Copy link
    Member Author

    srid, I'm not sure why you added your comment; a couple of sentences
    explaining where the output you posted comes from (what machine, what
    version of Python, under what circumstances) would be really useful.

    If you're able to reproduce this failure and have time to figure out where
    it's coming from, that would be fantastic.

    @srid
    Copy link
    Mannequin

    srid mannequin commented Aug 28, 2009

    Sorry about the late response; have been busy of late.

    I believe this error ("Unknown signal 32") appears consistently in
    3.0.1, 3.1rc1, 3.1 and 3.1.1. It appears only on Linux x86. (64-bit has
    failures of different kind..)

    I am attaching the entire log file.

    I don't have much time to investigate into this relatively less-
    important issue in detail, but if you need any further information ..
    please let me know. I will be happy to provide.

    @srid
    Copy link
    Mannequin

    srid mannequin commented Aug 28, 2009

    .. and here are the machine details:

    apy@gila:> uname -a
    Linux gila 2.4.21-297-default #1 Sat Jul 23 07:47:39 UTC 2005 i686 i686
    i386 GNU/Linux
    apy@gila:
    > cat /etc/*release
    LSB_VERSION="1.3"
    DISTRIB_ID="SuSE"
    DISTRIB_RELEASE="9.0"
    DISTRIB_DESCRIPTION="SuSE Linux 9.0 (i586)"
    SuSE Linux 9.0 (i586)
    VERSION = 9.0
    apy@gila:~>

    @srid
    Copy link
    Mannequin

    srid mannequin commented Aug 28, 2009

    libc used is of version 2.3.2.


    apy@gila:> ldd rrun/tmp/autotest/ActivePython-3.1.1.2-linux-x86/
    INSTALLDIR/bin/python3
    libpthread.so.0 => /lib/i686/libpthread.so.0 (0x4002f000)
    libdl.so.2 => /lib/libdl.so.2 (0x40080000)
    libutil.so.1 => /lib/libutil.so.1 (0x40083000)
    libm.so.6 => /lib/i686/libm.so.6 (0x40086000)
    libc.so.6 => /lib/i686/libc.so.6 (0x400aa000)
    /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
    apy@gila:
    > /lib/i686/libc.so.6
    GNU C Library stable release version 2.3.2 (20030827), by Roland
    McGrath et al.
    Copyright (C) 2003 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.
    There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
    PARTICULAR PURPOSE.
    Configured for i686-suse-linux.
    Compiled by GNU CC version 3.3.1 (SuSE Linux).
    Compiled on a Linux 2.6.0-test3 system on 2003-09-23.
    Available extensions:
    GNU libio by Per Bothner
    crypt add-on version 2.1 by Michael Glad and others
    linuxthreads-0.10 by Xavier Leroy
    NoVersion patch for broken glibc 2.0 binaries
    BIND-8.2.3-T5B
    libthread_db work sponsored by Alpha Processor Inc
    NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
    Thread-local storage support included.
    pthread library is compiled with floating stack support enabled.
    Report bugs using the `glibcbug' script to <bugs@gnu.org>.
    apy@gila:~>

    @pitrou
    Copy link
    Member

    pitrou commented Oct 29, 2009

    Prelude has had the same problem with signal 32:
    https://dev.prelude-ids.com/issues/show/133
    According to their research, it is due to the linuxthreads
    implementation of the pthread API.

    To know which threads implementation your glibc is using, you can run
    "getconf GNU_LIBPTHREAD_VERSION" (on a modern system, it should print
    something like "NPTL 2.9").

    (of course, the question is, since the signal is used by linuxthreads,
    why doesn't it get caught instead of killing the process?)

    @pitrou
    Copy link
    Member

    pitrou commented Oct 29, 2009

    Sridhar, Neal, I would advocate disabling (commenting out)
    test_closerange in Lib/test/test_os.py and see what happens.

    That's the one really dirty test in test_os, it might close a file
    handle linuxthreads is relying on.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 29, 2009

    Forget the last comment, test_closerange is fine...

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Apr 12, 2010

    This type of failure appears again in current builds:

    http://www.python.org/dev/buildbot/builders/x86 gentoo 3.x/builds/2160/steps/test/logs/stdio

    @mdickinson
    Copy link
    Member Author

    This type of failure appears again in current builds:

    Unfortunately, I think you mean 'still' rather than 'again'. :)
    As far as I can tell, the failure's never gone away, though it may have been obscured by other failures from time to time.

    Maybe it's time to do something. I propose we:

    1. create a new branch py3k-issue4970
    2. Hack Lib/test/regrtest.py in that branch so that it runs
      *only* test_os and test_wait3, in that order (ignoring the -r
      flag). Check that we're still getting the failure.
    3. Do a binary search (remove half the test_os tests; trigger
      buildbot run; see if we're still getting the signal; rinse;
      repeat) to narrow down the cause to a particular test.
    4. While doing 3, ruthlessly kill all other non-trunk
      checkin-triggered buildbot runs on this machine
      to speed up the search process a bit. (Keeping trunk builds
      for the sake of the upcoming 2.7 release.)

    I'll try to start this this evening (no ssh access at the moment) unless someone else beats me to it.

    @mdickinson
    Copy link
    Member Author

    Should also modify regrtest to print out the result of the command

    getconf GNU_LIBPTHREAD_VERSION

    that Antoine suggested.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Apr 13, 2010

    This bugreport http://bugs.gentoo.org/28193 indeed suggests that
    the failure occurs on systems without nptl.

    Would it be possible for someone with an affected system to run
    the test program from the bug report?

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Apr 13, 2010

    Signal 32 is the first real-time signal, and is indeed used by linuxthreads, so it's very likely a linuxthreads bug, since this signal shouldn't leak to application.
    Since linuxthreads is no longer maintained, I'm afraid we can't do much about this, except check for the threading library used and say "linuxthreads is obsolete and has known issues - please upgrade your system for reliable threading support".

    @mdickinson
    Copy link
    Member Author

    Since linuxthreads is no longer maintained, I'm afraid we can't do much
    about this.

    Agreed. But I think it's still worth trying to narrow down (and possibly work around) the cause of failure, just so that we can make this buildbot useful again on 3.x. Perhaps we can get by with a conditional skip of one of the test_os tests, but we have to figure out which one first. :)

    @vstinner
    Copy link
    Member

    Extract of the Prelude ticket https://dev.prelude-ids.com/issues/show/133 : "commenting out sigprocmask(SIG_SETMASK, &set, NULL) seems to fixes the problem (...)"

    @mdickinson
    Copy link
    Member Author

    Results of my simple-minded strategy (see r80033-4, r80037, r80042, r80045, r80047-51):

    test_execvpe_with_bad_program in ExecTests by itself is enough to trigger the signal 32 error (in combination with test_wait3). See:

    http://www.python.org/dev/buildbot/builders/x86%20gentoo%203.x/builds/2174

    If just this single test is disabled and all other tests in test_os are allowed to run, there's no problem (at least with test_wait3; (I haven't tried re-enabling *all* other tests in the test suite yet). See:

    http://www.python.org/dev/buildbot/builders/x86%20gentoo%203.x/builds/2176

    @mdickinson
    Copy link
    Member Author

    Here's some fairly minimal Python code that produces the signal:

    ### begin example ###
    import os
    import time
    import _thread

    try:
    os.execv('/usr/bin/dorothyq', ['dorothyq'])
    except OSError:
    pass

    def f():
        time.sleep(1.0)  # probably irrelevant to the failure
    
    _thread.start_new(f, ())
    ### end example ###

    It looks as though the failed os.execv call messes something up internally, so that any attempt thereafter to start a thread produces this signal. I can't see anything obviously wrong with the os.execv implementation (see posix_execv in Modules/posixmodule.c).

    There's still the question of what changed between 2.x and 3.x: on 2.x, this buildbot seems perfectly happy.

    @mdickinson
    Copy link
    Member Author

    Okay, I think I've got as far as I can, but if anyone else wants to hack on this, please do.

    The branch name is py3k-issue4970

    In that branch:

    @vstinner
    Copy link
    Member

    signal.__dict__:
    {...
    'NSIG': 65,
    'SIGABRT': 6,
    'SIGALRM': 14,
    'SIGBUS': 7,
    'SIGCHLD': 17,
    'SIGCLD': 17,
    'SIGCONT': 18,
    'SIGFPE': 8,
    'SIGHUP': 1,
    'SIGILL': 4,
    'SIGINT': 2,
    'SIGIO': 29,
    'SIGIOT': 6,
    'SIGKILL': 9,
    'SIGPIPE': 13,
    'SIGPOLL': 29,
    'SIGPROF': 27,
    'SIGPWR': 30,
    'SIGQUIT': 3,
    'SIGRTMAX': 64,
    'SIGRTMIN': 35,
    'SIGSEGV': 11,
    'SIGSTOP': 19,
    'SIGSYS': 31,
    'SIGTERM': 15,
    'SIGTRAP': 5,
    'SIGTSTP': 20,
    'SIGTTIN': 21,
    'SIGTTOU': 22,
    'SIGURG': 23,
    'SIGUSR1': 10,
    'SIGUSR2': 12,
    'SIGVTALRM': 26,
    'SIGWINCH': 28,
    'SIGXCPU': 24,
    'SIGXFSZ': 25,
    'SIG_DFL': 0,
    'SIG_IGN': 1,
    ...}

    @vstinner
    Copy link
    Member

    There are many references to "unknown signal 32" errors in Google.

    Gdb mailing list, December 2002: "SIG32/SIGTRAP issues"
    http://sources.redhat.com/ml/gdb/2002-12/msg00057.html

    Gdb mailing list, September 2003: "pthread_create, Program received signal ?, Unknown signal"
    http://sources.redhat.com/ml/gdb/2003-09/msg00003.html
    => extract: "A change in the definition of SIGRTMIN
    causes this symptom."

    There is a thread "SIGRT_0 (Unknown signal 32)" in the Linux Kernel mailing list, in July 2005:
    http://lkml.org/lkml/2005/7/30/93

    Extract of a Debian bug report:
    http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=298982
    -------------
    There are actually three earlier instances in the trace of
    rt_sigsuspend([] <unfinished ...>
    in those cases it gets "SIGRTMIN (Unknown signal 32)" and carries on.
    The one prior to the hanging instance is

    open("/dev/sequencer", O_WRONLY) = 10
    ioctl(10, SNDCTL_SEQ_NRSYNTHS, 0x40095b20) = 0
    ioctl(10, SNDCTL_SEQ_NRMIDIS, 0x40094c20) = 0
    rt_sigprocmask(SIG_SETMASK, NULL, [RTMIN], 8) = 0
    write(9, " \357\24@\0\0\0\0P\370\377\277\240\341\16@\220\376\23\10"..., 148) = 148
    rt_sigprocmask(SIG_SETMASK, NULL, [RTMIN], 8) = 0
    rt_sigsuspend([] <unfinished ...>
    --- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
    <... rt_sigsuspend resumed> ) = -1 EINTR (Interrupted system call)
    sigreturn() = ? (mask now [RTMIN])

    (...)

    This is not very helpful, as it means the thread is waiting for another
    thread, nothing else. Could you run it with strace -f ?
    -------------

    @vstinner
    Copy link
    Member

    NPTL was introduced in Linux kernel 2.6(.0). glibc 2.4 requires NPTL:

    "The LinuxThreads add-on, providing pthreads on Linux 2.4 kernels, is no longer supported. The new NPTL implementation requires Linux 2.6 kernels. For a libc and libpthread that works well on Linux 2.4 kernels, we recommend using the stable 2.3 branch."

    NPTL 0.1 was released in September 2002. So the bug requires a Linux kernel 2.4 (and gblic 2.3.x).

    @pitrou
    Copy link
    Member

    pitrou commented Apr 13, 2010

    I suggest simply skipping the "offending" test on linuxthread platforms.
    (perhaps as simple as checking for sys.platform == "linux2" and signal.SIGRTMIN == 35)

    @vstinner
    Copy link
    Member

    I suggest simply skipping the "offending" test on linuxthread
    platforms.

    Good idea

    (perhaps as simple as checking for sys.platform == "linux2"
    and signal.SIGRTMIN == 35)

    I would prefer to rely on confstr():

    import os
    try:
        # 'linuxthreads-0.10' or 'NPTL 2.10.2'
        pthread = os.confstr("CS_GNU_LIBPTHREAD_VERSION")
        linuxthreads = pthread.startswith("linuxthreads")
    except ValueError:
        linuxthreads = False

    ^^ this example requires attached patch for the two CS_GNU_* constants.

    Which tests should be disabled?

    @mdickinson
    Copy link
    Member Author

    Skipping test_execvpe_with_bad_program sounds good to me.

    I'd ideally like to understand why 3.x is failing where 2.x is happy, but life's too short to stuff a mushroom.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Apr 13, 2010

    It looks as though the failed os.execv call messes something up internally, so that any attempt thereafter to start a thread produces this signal. I can't see anything obviously wrong with the os.execv implementation (see posix_execv in Modules/posixmodule.c).

    Upon execve, signals handler are reset to default. So maybe the error makes the linuxthread API screw up latter when it tries to set up handlers for SIGRTMIN and friend. But what's weird is that when the executable given does not exist, the call should fail and return before having done anything...

    There's still the question of what changed between 2.x and 3.x: on 2.x, this buildbot seems perfectly happy.

    I think it's simply because we didn't test a wrong program path with execve in 2.X version of test_os.

    @vstinner
    Copy link
    Member

    I think it's simply because we didn't test a wrong program path
    with execve in 2.X version of test_os.

    Oh, we should add this test to Python2 ;-)

    @mdickinson
    Copy link
    Member Author

    I think it's simply because we didn't test a wrong program path with execve in 2.X version of test_os.

    D'oh! Thank you very much.

    I'm happy now: my mushroom's stuffed. :)

    @nnorwitz
    Copy link
    Mannequin

    nnorwitz mannequin commented Apr 14, 2010

    Thanks for taking care of this guys. Sorry, I got swamped with mail
    and had to archive 3,000+ messages. It looks like it's in good hands.

    Let me know if there's anything you need. I may not have access to
    the box anymore, however, I can always contact Kurt.

    @mdickinson
    Copy link
    Member Author

    Victor, that patch looks fine to me. Do you want to go ahead and apply it, and add the skip to test_execvpe_with_bad_program ?

    The fix should be backported to 3.1, but not to 2.x (I think), since we don't have a problem there, and arguably the new os.confstr items could be considered a new feature.

    @mdickinson
    Copy link
    Member Author

    And just for reference,

    http://www.python.org/dev/buildbot/builders/x86%20gentoo%203.x/builds/2192

    shows that the relevant versions on this machine are:

    glibc 2.3.4
    linuxthreads-0.10

    and from http://www.python.org/dev/buildbot/builders/x86%20gentoo%203.x, it looks like the kernel version is 2.6.9 (actually, 2.6.9-gentoo-r1).

    @vstinner
    Copy link
    Member

    Commited as r80108 to py3k: "Add CS_GNU_LIBC_VERSION and CS_GNU_LIBPTHREAD_VERSION constants for constr(), and disable test_execvpe_with_bad_program() of test_os if the libc uses linuxthreads to avoid the "unknown signal 32" bug (see issue bpo-4970)."

    Wait for the buildbot to port it to trunk (and maybe 2.6 and 3.1).

    @mdickinson
    Copy link
    Member Author

    Fix merged to release31-maint in r80119. Thanks, Victor.

    @mdickinson mdickinson added the tests Tests in the Lib/test dir label Apr 16, 2010
    @mdickinson mdickinson added the type-crash A hard crash of the interpreter, possibly with a core dump label Apr 16, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    tests Tests in the Lib/test dir type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants