Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debian bigmem buildbot hanging in test_bigmem #58315

Closed
nadeemvawda mannequin opened this issue Feb 24, 2012 · 13 comments
Closed

Debian bigmem buildbot hanging in test_bigmem #58315

nadeemvawda mannequin opened this issue Feb 24, 2012 · 13 comments
Labels
tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error

Comments

@nadeemvawda
Copy link
Mannequin

nadeemvawda mannequin commented Feb 24, 2012

BPO 14107
Nosy @loewis, @pitrou, @vstinner
Files
  • mem_watchdog_nonblock.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-02-27.18:01:25.243>
    created_at = <Date 2012-02-24.15:17:25.851>
    labels = ['type-bug', 'tests']
    title = 'Debian bigmem buildbot hanging in test_bigmem'
    updated_at = <Date 2012-02-27.18:01:25.243>
    user = 'https://bugs.python.org/nadeemvawda'

    bugs.python.org fields:

    activity = <Date 2012-02-27.18:01:25.243>
    actor = 'neologix'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-02-27.18:01:25.243>
    closer = 'neologix'
    components = ['Tests']
    creation = <Date 2012-02-24.15:17:25.851>
    creator = 'nadeem.vawda'
    dependencies = []
    files = ['24645']
    hgrepos = []
    issue_num = 14107
    keywords = ['patch', 'buildbot']
    message_count = 13.0
    messages = ['154139', '154166', '154168', '154169', '154170', '154172', '154310', '154330', '154335', '154345', '154353', '154369', '154439']
    nosy_count = 6.0
    nosy_names = ['loewis', 'pitrou', 'vstinner', 'nadeem.vawda', 'neologix', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue14107'
    versions = ['Python 3.3']

    @nadeemvawda
    Copy link
    Mannequin Author

    nadeemvawda mannequin commented Feb 24, 2012

    On the debian bigmem buildbot, test_bigmem hangs until it gets killed
    by a timeout:

    http://www.python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/134/steps/test/logs/stdio
    http://www.python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/131/steps/test/logs/stdio
    http://www.python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/129/steps/test/logs/stdio
    http://www.python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/128/steps/test/logs/stdio
    http://www.python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/123/steps/test/logs/stdio
    http://www.python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/85/steps/test/logs/stdio
    

    This has happened on every run I've checked that hasn't crashed before
    reaching test_bigmem for some other reason (e.g. the recently-fixed zlib
    bus error), as far back as the buildbot history goes.

    bpo-5438 might be related.

    @nadeemvawda nadeemvawda mannequin added tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error labels Feb 24, 2012
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 24, 2012

    New changeset 4afcb25988c4 by Victor Stinner in branch 'default':
    Issue bpo-14107: fix bigmem tests on str.capitalize(), str.swapcase() and
    http://hg.python.org/cpython/rev/4afcb25988c4

    @vstinner
    Copy link
    Member

    I ran test_bigmem on my computer. The test pass but it is really slow.
    ----

    $ ./python -m test -v -M 8G test_bigmem
    (...
    Ran 164 tests in 3228.227s

    OK (skipped=44)
    1 test OK.
    ----

    One hour, whereas 44 tests are skipped. Skipped tests are the tests using the most quantity of memory, so I expect that they are even slower.

    Do you need these tests using so much memory (and are really slow)?
    ---
    test_capitalize (test.test_bigmem.StrTest) ... skipped 'not enough memory: 24.0G minimum needed'
    test_encode_utf32 (test.test_bigmem.StrTest) ... skipped 'not enough memory: 9.0G minimum needed'
    test_format (test.test_bigmem.StrTest) ... skipped 'not enough memory: 12.0G minimum needed'
    test_repr_large (test.test_bigmem.StrTest) ... skipped 'not enough memory: 10.0G minimum needed'
    test_split_large (test.test_bigmem.StrTest) ... skipped 'not enough memory: 20.0G minimum needed'
    test_swapcase (test.test_bigmem.StrTest) ... skipped 'not enough memory: 24.0G minimum needed'
    test_title (test.test_bigmem.StrTest) ... skipped 'not enough memory: 24.0G minimum needed'
    test_translate (test.test_bigmem.StrTest) ... skipped 'not enough memory: 12.0G minimum needed'
    test_split_large (test.test_bigmem.BytesTest) ... skipped 'not enough memory: 20.0G minimum needed'
    test_compare (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 32.0G minimum needed'
    test_concat_large (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 48.0G minimum needed'
    test_concat_small (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 24.0G minimum needed'
    test_contains (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 16.0G minimum needed'
    test_from_2G_generator (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 9.0G minimum needed'
    test_from_almost_2G_generator (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 9.0G minimum needed'
    test_hash (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 16.0G minimum needed'
    test_index_and_slice (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 16.0G minimum needed'
    test_repeat_large (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 48.0G minimum needed'
    test_repeat_large_2 (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 12.0G minimum needed'
    test_repeat_small (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 24.0G minimum needed'
    test_repr_large (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 22.0G minimum needed'
    test_append (test.test_bigmem.ListTest) ... skipped 'not enough memory: 18.0G minimum needed'
    test_compare (test.test_bigmem.ListTest) ... skipped 'not enough memory: 32.0G minimum needed'
    test_concat_large (test.test_bigmem.ListTest) ... skipped 'not enough memory: 48.0G minimum needed'
    test_concat_small (test.test_bigmem.ListTest) ... skipped 'not enough memory: 24.0G minimum needed'
    test_contains (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed'
    test_count (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed'
    test_extend_large (test.test_bigmem.ListTest) ... skipped 'not enough memory: 32.0G minimum needed'
    test_extend_small (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed'
    test_hash (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed'
    test_index (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed'
    test_index_and_slice (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed'
    test_inplace_concat_large (test.test_bigmem.ListTest) ... skipped 'not enough memory: 48.0G minimum needed'
    test_inplace_concat_small (test.test_bigmem.ListTest) ... skipped 'not enough memory: 24.0G minimum needed'
    test_inplace_repeat_large (test.test_bigmem.ListTest) ... skipped 'not enough memory: 32.0G minimum needed'
    test_inplace_repeat_small (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed'
    test_insert (test.test_bigmem.ListTest) ... skipped 'not enough memory: 18.0G minimum needed'
    test_pop (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed'
    test_remove (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed'
    test_repeat_large (test.test_bigmem.ListTest) ... skipped 'not enough memory: 48.0G minimum needed'
    test_repeat_small (test.test_bigmem.ListTest) ... skipped 'not enough memory: 24.0G minimum needed'
    test_repr_large (test.test_bigmem.ListTest) ... skipped 'not enough memory: 22.0G minimum needed'
    test_reverse (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed'
    test_sort (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed'
    ---

    @nadeemvawda
    Copy link
    Mannequin Author

    nadeemvawda mannequin commented Feb 25, 2012

    I ran test_bigmem on my computer. The test pass but it is really slow.

    Is this with or without the fix you just pushed? How much RAM does your
    system have? I tried running with "-M 4.5G" on my 8GB machine earlier
    today, and it ate through all of my RAM plus 4GB of swap before the OOM
    killer terminated it.

    Do you need these tests using so much memory (and are really slow)?

    In general, I think it's important to exercise cases where 32-bit
    truncation issues and the like might arise, because otherwise they'll lie
    around undetected until they bite some unsuspecting user at an
    inopportune moment. But I don't know much about those specific tests, so
    I can't really offer much of an opinion here.

    [Still, 48GB minimum? ೦_೦ ]

    @vstinner
    Copy link
    Member

    > I ran test_bigmem on my computer. The test pass but it is really slow.

    Is this with or without the fix you just pushed? How much RAM does your
    system have? I tried running with "-M 4.5G" on my 8GB machine earlier
    today, and it ate through all of my RAM plus 4GB of swap before the OOM
    killer terminated it.

    My PC has 12 GB of RAM and no swap. I ran the test after my commit.

    I tried to enable swap but writing to the swap does completly crash my
    kernel. My / uses btrfs and btrfs doesn't look to support swap files,
    even with a loopback device.

    @nadeemvawda
    Copy link
    Mannequin Author

    nadeemvawda mannequin commented Feb 25, 2012

    My PC has 12 GB of RAM and no swap. I ran the test after my commit.

    That explains. I ran it earlier (obviously), so it included a bunch of
    those more-demanding tests, which must be where the OOM killer hit it.

    @nadeemvawda
    Copy link
    Mannequin Author

    nadeemvawda mannequin commented Feb 26, 2012

    The buildbot still isn't happy:

    http://www.python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/144/steps/test/logs/stdio
    

    Also, I ran the test on my own machine and noticed two tests that are
    still exceeding their nominal peak memory usage:

    • test_encode_utf7: expected 4.8G, actual 7.2G

    • test_unicode_repr: expected 4.8G, actual 5.6G

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Feb 26, 2012

    """
    Thread 0x00002ba588709700:
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/support.py", line 1168 in consumer
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/threading.py", line 682 in run
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/threading.py", line 729 in _bootstrap_inner
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/threading.py", line 702 in _bootstrap

    Current thread 0x00002ba57b2d4260:
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/support.py", line 1198 in stop
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/support.py", line 1240 in wrapper
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/case.py", line 385 in _executeTestPart
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/case.py", line 440 in run
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/case.py", line 492 in __call__
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/suite.py", line 105 in run
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/suite.py", line 67 in __call__
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/suite.py", line 105 in run
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/suite.py", line 67 in __call__
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/runner.py", line 168 in run
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/support.py", line 1369 in _run_suite
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/support.py", line 1403 in run_unittest
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/test_bigmem.py", line 1252 in test_main
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/regrtest.py", line 1221 in runtest_inner
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/regrtest.py", line 907 in runtest
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/regrtest.py", line 710 in main
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/main.py", line 13 in <module>
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/runpy.py", line 73 in _run_code
    File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/runpy.py", line 160 in _run_module_as_main
    """

    There's a problem with the _file_watchdog thread:
    if the pipe gets full (because the consumer thread doesn't get to run
    often enough), the watchdog thread will block on the write() to the
    pipe.
    Then, the main thread tries to stop the watchdog:

    """
    static void
    cancel_file_watchdog(void)
    {
        /* Notify cancellation */
        PyThread_release_lock(watchdog.cancel_event);
    
        /* Wait for thread to join */
        PyThread_acquire_lock(watchdog.running, 1);
        PyThread_release_lock(watchdog.running);
    
        /* The main thread should always hold the cancel_event lock */
        PyThread_acquire_lock(watchdog.cancel_event, 1);
    }
    """

    The cancel_event lock is released, but the watchdog thread is stuck
    on the write().
    The only thing that could wake it up is a read() from the consumer
    thread, but the main thread - the one calling cancel_file_watchdog()

    • blocks when acquiring the running lock: since the GIL is not
      released, the consumer thread can't run, so it doesn't drain the pipe,
      and game over...

    """
    /* We can't do anything if the consumer is too slow, just bail out */
    if (write(watchdog.wfd, (void *) &x, sizeof(x)) < sizeof(x))
    break;
    if (write(watchdog.wfd, data, data_len) < data_len)
    break;
    """

    AFAICT, this can't happen, because the write end of the pipe is not in
    non-blocking mode (which would solve this issue).

    Otherwise, there are two things I don't understand:

    1. IIUC, the goal of the watchdog thread is to collect memory
      consumption in a timely manner: that's now the case, but since the
      information is printed in a standard thread, it doesn't bring any improvement (because it can be delayed for arbitrarily long), or am I
      missing something?
    2. instead of using a thread and the faulthandler infrastructure to run
      GIL-less, why not simply use a subprocess? It could then simply
      parse /proc/<PID>/statm at a regular interval, and print stats to
      stdout. It would also solve point 1.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Feb 26, 2012

    1. IIUC, the goal of the watchdog thread is to collect memory
      consumption in a timely manner: that's now the case, but since the
      information is printed in a standard thread, it doesn't bring any improvement (because it can be delayed for arbitrarily long), or am I
      missing something?

    Scratch that. It does bring an improvement for a post facto analysis
    (I thought at first it was used in a pseudo-realtime way).

    @pitrou
    Copy link
    Member

    pitrou commented Feb 26, 2012

    """
    /* We can't do anything if the consumer is too slow, just bail out */
    if (write(watchdog.wfd, (void *) &x, sizeof(x)) < sizeof(x))
    break;
    if (write(watchdog.wfd, data, data_len) < data_len)
    break;
    """

    AFAICT, this can't happen, because the write end of the pipe is not in
    non-blocking mode (which would solve this issue).

    I think my original plan was to put it in non-blocking mode, but I must
    have forgotten in the end.

    1. instead of using a thread and the faulthandler infrastructure to run
      GIL-less, why not simply use a subprocess? It could then simply
      parse /proc/<PID>/statm at a regular interval, and print stats to
      stdout. It would also solve point 1.

    I can't think of any drawback off the top of my head, so that sounds
    reasonable.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Feb 26, 2012

    I think my original plan was to put it in non-blocking mode, but I
    must have forgotten in the end.

    Here's a patch.

    I can't think of any drawback off the top of my head, so that sounds
    reasonable.

    I'll try to write up a patch (I have only 1.2GB RAM, so it won't be easy
    to test :-).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 26, 2012

    New changeset 52dc4fcd0d6f by Charles-François Natali in branch 'default':
    Issue bpo-14107: test: Fix a deadlock involving the memory watchdog thread.
    http://hg.python.org/cpython/rev/52dc4fcd0d6f

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Feb 27, 2012

    @neologix neologix mannequin closed this as completed Feb 27, 2012
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants