Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bsddb3 hash craps out with threads #38896

Closed
tim-one opened this issue Jul 22, 2003 · 38 comments
Closed

bsddb3 hash craps out with threads #38896

tim-one opened this issue Jul 22, 2003 · 38 comments
Assignees
Labels
extension-modules C modules in the Modules dir

Comments

@tim-one
Copy link
Member

tim-one commented Jul 22, 2003

BPO 775414
Nosy @tim-one, @mhammond, @smontanaro, @gpshead
Files
  • hammer.py: Derived from Richie's test driver
  • sleepy.txt
  • studly_hammer.py
  • deadlock_hammer.py: Version of studly_hammer with a deadlock detection thread.
  • wrapped_hammer.py: hammer.py with set_lk_detect and DeadlockWrap
  • wrapped_studly_hammer.py: studly_hammer w/set_lk_detect and DeadlockWrap
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gpshead'
    closed_at = <Date 2006-06-15.08:54:48.000>
    created_at = <Date 2003-07-22.02:29:12.000>
    labels = ['extension-modules']
    title = 'bsddb3 hash craps out with threads'
    updated_at = <Date 2006-06-15.08:54:48.000>
    user = 'https://github.com/tim-one'

    bugs.python.org fields:

    activity = <Date 2006-06-15.08:54:48.000>
    actor = 'gregory.p.smith'
    assignee = 'gregory.p.smith'
    closed = True
    closed_date = None
    closer = None
    components = ['Extension Modules']
    creation = <Date 2003-07-22.02:29:12.000>
    creator = 'tim.peters'
    dependencies = []
    files = ['974', '975', '976', '977', '978', '979']
    hgrepos = []
    issue_num = 775414
    keywords = []
    message_count = 38.0
    messages = ['17191', '17192', '17193', '17194', '17195', '17196', '17197', '17198', '17199', '17200', '17201', '17202', '17203', '17204', '17205', '17206', '17207', '17208', '17209', '17210', '17211', '17212', '17213', '17214', '17215', '17216', '17217', '17218', '17219', '17220', '17221', '17222', '17223', '17224', '17225', '17226', '17227', '17228']
    nosy_count = 8.0
    nosy_names = ['tim.peters', 'jhylton', 'mhammond', 'skip.montanaro', 'anthonybaxter', 'gregory.p.smith', 'richiehindle', 'roundeye']
    pr_nums = []
    priority = 'high'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue775414'
    versions = ['Python 2.5']

    @tim-one
    Copy link
    Member Author

    tim-one commented Jul 22, 2003

    Richie Hindle presented something like the attached
    (hammer.py) on the spambayes-dev mailing list. On
    Win98SE and Win2K w/ Python 2.3c1 I usually see this
    death pretty quickly:

    Traceback (most recent call last):
      File "hammer.py", line 36, in ?
        main()
      File "hammer.py", line 33, in main
        hammer(db)
      File "hammer.py", line 15, in hammer
        x = db[str(int(random.random() * 100000))]
      File "C:\CODE\PYTHON\lib\bsddb\__init__.py", line 86, 
    in __getitem__
        return self.db[key]
    bsddb._db.DBRunRecoveryError: (-30982,
         'DB_RUNRECOVERY: Fatal error, run database 
    recovery -- fatal region error detected; run recovery')

    Richie also reported "illegal operation" crashes on
    Win98SE.

    It's not clear whether a bsddb3 hash *can* be used
    with threads like this. If it can't, there's a doc bug. If it
    should be able to, there's a more serious problem. Note
    that it looks like hashopen() always merges DB_THREAD
    into the flags, so the absence of specifying DB_THREAD
    probably isn't the problem.

    @tim-one tim-one closed this as completed Jul 22, 2003
    @tim-one tim-one added the extension-modules C modules in the Modules dir label Jul 22, 2003
    @tim-one tim-one closed this as completed Jul 22, 2003
    @tim-one tim-one added the extension-modules C modules in the Modules dir label Jul 22, 2003
    @richiehindle
    Copy link
    Mannequin

    richiehindle mannequin commented Jul 22, 2003

    Logged In: YES
    user_id=85414

    Minor correction: I'm on Plain Old Win98, not SE.

    For what it's worth, the script seems more often than not
    to provoke an application error when there's background
    load, and a DBRunRecoveryError when there isn't.

    @gpshead
    Copy link
    Member

    gpshead commented Aug 13, 2003

    Logged In: YES
    user_id=413

    i'll try and reproduce this.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 12, 2003

    Logged In: YES
    user_id=31435

    Greg, any luck? We're starting to see the same error ("fatal
    region error detected") in some ZODB tests using bsddb3, and
    that's an infinitely more complicated setup than this little
    program. Jeremy Hylton also sees "fatal region" errors on
    Linux, in the ZODB context.

    @jhylton
    Copy link
    Mannequin

    jhylton mannequin commented Sep 12, 2003

    Logged In: YES
    user_id=31392

    I'm running this test with CVS Python (built on 9/11/03) on
    RH Linux 9 with bsddb 4.1.25. I see the same error although
    it takes a relatively long time to provoke -- a minute or two.

    @jhylton
    Copy link
    Mannequin

    jhylton mannequin commented Sep 12, 2003

    Logged In: YES
    user_id=31392

    How does the bsddb wrapper achieve thread safety?

    I know very little about the wrapper or the underlying bsddb
    libraries. I found the following comment in the C API docs:

    http://www.sleepycat.com/docs/ref/program/mt.html#2

    When using the non-cursor Berkeley DB calls to retrieve
    key/data items (for example, DB->get), the memory to which
    the
    pointer stored into the Dbt refers is valid only until the
    next call
    using the DB handle returned by DB->open. This includes any
    use of the returned DB handle, including by another thread
    within the process.

    This suggests that a call to a self->db->get() must process
    its results (copy them into Python-owned memory) before any
    other operation on the same db object can proceed. Is that
    right?

    The bsddb wrapper releases the GIL before calling the
    low-level DB API functions and the acquires it after the
    call returns. Is there some other lock that prevents
    multiple simultaneous calls from stomping on each other?

    @smontanaro
    Copy link
    Contributor

    Logged In: YES
    user_id=44345

    From what I got back from Sleepycat on this, I'm pretty sure the
    old bsddb interface is not going to be thread safe. Attached are
    two messages from Sleepycat.

    Is there some way for the old interface to create a default
    environment shared by all the bsddb.*open() calls and then set
    the DB_RECOVER flag in the low-level open() call?

    @gpshead
    Copy link
    Member

    gpshead commented Sep 12, 2003

    Logged In: YES
    user_id=413

    The old bsddb interface compatibility code could be modified to use a
    single DBEnv per process opened with the DB_SYSTEM_MEM flag. Do
    we want to do this? Shouldn't we encourage the use of the real
    pybsddb DB/DBEnv object interface for threads instead? AFAIK the old
    bsddb module + libs were not thread safe.

    @jhylton
    Copy link
    Mannequin

    jhylton mannequin commented Sep 12, 2003

    Logged In: YES
    user_id=31392

    Are the DB_mapping methods only used the old interface? My
    question is about those methods, which I assumed were used
    by the old and new interfaces.

    @gpshead
    Copy link
    Member

    gpshead commented Sep 12, 2003

    Logged In: YES
    user_id=413

    ah, Keith's response from sleepycat assumed that we were using the
    DB 1.85 compatibility interface. We do not. The bsddb module
    emulates the old bsddb module's 1.85-ish interface using modern
    DB/DBEnv objects underneath. So his comments about that not being
    threadsafe don't apply here.

    @smontanaro
    Copy link
    Contributor

    Logged In: YES
    user_id=44345

    In theory, yes, we could special case the bsddb stuff. However,
    the code currently is run indirectly via the anydbm module. It
    will take a little effort on our part to do something special for
    bsddb. It would be nice if other apps using the naive interface
    were able to use multiple threads.

    @smontanaro
    Copy link
    Contributor

    Logged In: YES
    user_id=44345

    The bsddb module emulates the old bsddb module's 1.85-ish
    interface using modern DB/DBEnv objects underneath.  So his
    comments about that not being threadsafe don't apply here.
    

    But the low-level open() call isn't made with a DBEnv argument
    is it? Nor is the DB_RECOVER flag set. Would the compatibility
    interface be able to do both things?

    @jhylton
    Copy link
    Mannequin

    jhylton mannequin commented Sep 12, 2003

    Logged In: YES
    user_id=31392

    I don't see Keith's response anywhere in this thread. Can
    you add it for the record? The only call to db->put() that
    I see is in _DB_put(). It does not look thread-safe to me.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 12, 2003

    Logged In: YES
    user_id=31435

    Jeremy, Keith's response is in the sleepy.txt file attached to
    the bug report.

    @gpshead
    Copy link
    Member

    gpshead commented Sep 12, 2003

    Logged In: YES
    user_id=413

    Looking at bsddb/init.py (where the old bsddb compatibility
    interface is implemented) I don't see why the hammer.py attached
    below should cause a problem. The database is opened with
    DB_THREAD using a private environment (no DBEnv passed to DB()).

    I definately see potential threading problems with the _DBWithCursor
    class defined there if any of the methods using a cursor are used (the
    cursor could be shared across threads; that's a no-no). But in the
    context of hammer.py that doesn't happen so I wouldn't have expected
    a problem. Unless perhaps creating the DB withou a DBEnv implies
    that the DB_THREAD flag won't work. The DB_RECOVER flag is only
    useful for opening existing DBEnv's; we have none.

    I've got to pop offline for a bit now but i'll try a hammer.py modified to
    use direct DB calls (for easier playing around with and bug reporting to
    sleepycat if turns out to be a bug on their end) later tonight.

    PS keiths response is in the sleepycat.txt attachment if you open the
    URL to this bug report on sourceforge.

    @jhylton
    Copy link
    Mannequin

    jhylton mannequin commented Sep 12, 2003

    Logged In: YES
    user_id=31392

    I don't want to sound like a broken record, but I will: Can
    anyone comment on the lack of thread-safety in _DB_put()?
    It appears that there is nothing to prevent the memory used
    by one call from being stomped on by another call in a
    different thread. This problem would exist even in an
    application using the modern interface and specifying DB_THREAD.

    @richiehindle
    Copy link
    Mannequin

    richiehindle mannequin commented Sep 12, 2003

    Logged In: YES
    user_id=85414

    Sorry to muddy the waters, but I'm 99% sure that this
    is not a threading issue. Today I had the same
    DBRunRecoveryError for my Spambayes POP3 proxy
    classifier database, which only ever gets accessed
    from the main program thread.

    @smontanaro
    Copy link
    Contributor

    Logged In: YES
    user_id=44345

    The sleepycat mails (there are two of them - Keith's is
    second) are in the attached sleepy.txt file.

    @gpshead
    Copy link
    Member

    gpshead commented Sep 12, 2003

    Logged In: YES
    user_id=413

    I don't see any problem in _bsddb.c:_DB_put(), what memory
    are you talking about? All of the DBT key and data
    parameters are allocated on the local stack on the various
    DB methods that call _DB_put. What do you see that could be
    clobbered?

    @smontanaro
    Copy link
    Contributor

    Logged In: YES
    user_id=44345

    If hammer.py fails for you, please try this slightly modified
    version (studly_hammer.py).

    @gpshead
    Copy link
    Member

    gpshead commented Sep 27, 2003

    Logged In: YES
    user_id=413

    I just committed a change to bsddb/init.py (file rev 1.10) that adds the creation of a thread-safe DBEnv object for each hashopen, btopen or rnopen database. hammer.py has been running for 5 minutes on my linux/alpha system using BerkeleyDB 4.1.25. (admittedly my test is running on python 2.2.2, but as this isn't a python core related change i doubt that matters).

    After others have tested this on other platforms with success I believe we can close this bug. This patch would probably be good for python 2.3.2.

    @anthonybaxter
    Copy link
    Mannequin

    anthonybaxter mannequin commented Sep 28, 2003

    Logged In: YES
    user_id=29957

    Could you check that it (and the test_bsddb3) works on
    Solaris? There's a couple of solaris boxes on the SF compile
    farm (cf.sf.net). I was unable to get test_bsddb3 to complete
    at all on Solaris 2.6, 7 or 8, when using DB 4.1.25.

    As far as 2.3.2, I really really don't think it's appropriate to
    throw it in at this late point. Particularly given the 2.3.1
    screwups, I don't want to risk it.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 29, 2003

    Logged In: YES
    user_id=31435

    About studly_hammer.py:

    [Skip Montanaro]

    ...
    Attached is a modified version of the hammer.py script
    which seems to
    not fail for me on either Windows run from IDLE (Python
    2.3, BDB
    4.1.6) or Mac OS X (Python CVS, BDB 4.2.1). The original
    script
    failed for me on Windows but not Mac OS X. Can some
    other people for
    whom the original script fails please try it? (I also attached
    it to
    bug bpo-775414.)

    On Win98SE with current Python 2.3.1, it doesn't fail, but it
    never seemed to finish for me either. Staring at WinTop
    showed that the Python process stopped accumulating
    cycles. Can't be killed with Ctrl+C (no visible effect). Can be
    killed with Ctrl+Break.

    Dumping

        print "%s %s" % (thread.get_ident(), i)
    

    at the top of the hammer loop showed that the threads get
    through several hundred iterations, then all printing stops.

    Attaching to a debug-build Python from the debugger when a
    freeze occurs isn't terribly illuminating. One thread's stack
    shows

    _BSDDB_D! __db_win32_mutex_lock + 134 bytes
    _BSDDB_D! __lock_get + 2264 bytes
    _BSDDB_D! __lock_get + 197 bytes
    _BSDDB_D! __ham_get_meta + 120 bytes
    _BSDDB_D! __ham_c_dup + 4201 bytes
    _BSDDB_D! __db_c_put + 2544 bytes
    _BSDDB_D! __db_put + 507 bytes
    _DB_put(DBObject * 0x016cff88, __db_txn * 0x016d0000,
    __db_dbt * 0x016cc000, __db_dbt * 0x50d751fe, int 0) line
    562 + 35 bytes

    The main thread's stack shows

    _BSDDB_D! __db_win32_mutex_lock + 134 bytes
    _BSDDB_D! __lock_get + 2264 bytes
    _BSDDB_D! __lock_get + 197 bytes
    _BSDDB_D! __db_lget + 365 bytes
    _BSDDB_D! __ham_lock_bucket + 105 bytes
    _BSDDB_D! __ham_get_cpage + 195 bytes
    _BSDDB_D! __ham_item_next + 25 bytes
    _BSDDB_D! __ham_call_hash + 2479 bytes
    _BSDDB_D! __ham_c_dup + 4307 bytes
    _BSDDB_D! __db_c_put + 2544 bytes
    _BSDDB_D! __db_put + 507 bytes
    _DB_put(DBObject * 0x008fe2e8, __db_txn * 0x00000000,
    __db_dbt * 0x0062f230, __db_dbt * 0x0062f248, int 0) line
    562 + 35 bytes
    DB_ass_sub(DBObject * 0x008fe2e8, _object * 0x00b83178,
    _object * 0x00b83370) line 2330 + 23 bytes
    PyObject_SetItem(_object * 0x008fe2e8, _object *
    0x00b83178, _object * 0x00b83370) line 123 + 18 bytes
    eval_frame(_frame * 0x00984948) line 1448 + 17 bytes
    ...

    The other threads are somewhere in the OS kernel and don't
    have useful tracebacks. This varies from run to run, but all
    threads with a useful stack are always stuck at the same
    place in __db_win32_mutex_lock.

    All in all, looks like it's simply deadlocked.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 29, 2003

    Logged In: YES
    user_id=31435

    Running the original hammer.py under current CVS Python
    freezes in the same way (as in my immediately preceding
    note) now too; again Win98SE.

    @gpshead
    Copy link
    Member

    gpshead commented Sep 29, 2003

    Logged In: YES
    user_id=413

    Deadlocks only occurring under DOS-based "windows"
    (win95/98/me) aren't something the python module can
    prevent. I suggest submitting the sample code and info from
    studly_hammer.py to sleepycat. They're usually very
    responsive to questions of that nature.

    btw, i'll give things a go on solaris later this week. if
    the test suite never completes i again suspect it is a
    berkeleydb library issue on that platform rather than python
    module.

    @gpshead
    Copy link
    Member

    gpshead commented Sep 29, 2003

    Logged In: YES
    user_id=413

    anthony - if we don't put this patch into python 2.3.2, the
    python 2.3.x bsddb module documentation should be updated to
    say that multithreaded access is not supported and will
    cause problems, possibly even python interpreter crashes.

    @anthonybaxter
    Copy link
    Mannequin

    anthonybaxter mannequin commented Sep 29, 2003

    Logged In: YES
    user_id=29957

    I'd be much happier with a documentation fix for 2.3.2.

    Note that when I said "fails to complete" on Solaris, I
    meant that it crashes out, not that it deadlocks. I can post
    the tracebacks here if you'd like.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 29, 2003

    Logged In: YES
    user_id=31435

    Greg, I'm in a constant state of debugging (in other apps)
    thread problems that *appear* unique to Win9x. But in years
    of this, I have yet to see one that actually is unique to
    Win9x -- in the end, they always turn out to be legit races in
    the app I'm debugging, and can always be reproduced on
    other platforms if the test is made stressful enough and/or
    run long enough. Win9x appears especially good at provoking
    thread problems just because its scheduling is erratic, often
    acting like a Linux system under extreme load that way.

    IOW, unless there's a bug in Sleepycat's implementation of
    locking on Win9x, I bet dollars to doughnuts this program will
    eventually deadlock everywhere. In Python's lifetime, across
    dozens of miserable thread problems, we haven't pinned the
    blame once on Win9x. That wasn't for lack of trying <wink>.

    @smontanaro
    Copy link
    Contributor

    Logged In: YES
    user_id=44345

    I built from CVS head on a Solaris machine. bsddb.__version__
    reports '4.2.1'. When run, the studly_hammer.py script
    completes the dbenv.open() call, but appears to hang during
    the hashopen() call. Adding some print statements to hashopen()
    indicates that it hangs during d.open().

    I don't know what to make of this. If others have some
    suggestions, I'll be happy to try them out.

    @smontanaro
    Copy link
    Contributor

    Logged In: YES
    user_id=44345

    Forgot to mention that without the DBEnv() object, it gets a
    segmentation violation on Solaris 8 seg faults pretty quickly
    (within 10,000 iterations for each thread) or raises
    bsddb._db.DBRunRecoveryError.

    @roundeye
    Copy link
    Mannequin

    roundeye mannequin commented Oct 5, 2003

    Logged In: YES
    user_id=58334

    This is also showing up in Syncato
    (http://www.syncato.org/), and the database isn't
    recoverable using the Berkeley DB db_recover utility (even
    using the "catastrophic" flag).

    Does anyone know of a reliable way to recover?

    Rick Bradley

    @gpshead
    Copy link
    Member

    gpshead commented Oct 6, 2003

    Logged In: YES
    user_id=413

    if you believe your application is properly using BerkeleyDB
    and you are having DB_RUNRECOVERY issues I suggest
    contacting sleepycat.

    @mhammond
    Copy link
    Contributor

    mhammond commented Nov 4, 2005

    Logged In: YES
    user_id=14198

    Sadly, I believe bsddb is working "as designed". Quoting
    from http://www.sleepycat.com/docs/api_c/env_open.html

    "When the DB_INIT_LOCK flag is specified, it is usually
    necessary to run a deadlock detector, as well."

    So I dig into my bsddb build tree, and found
    db_deadlock.exe. Sure enough, once studly_hammer.py had
    deadlocked, executing db_deadlock in the DB directory got
    things running again - although the threads all eventually
    died with:

    bsddb._db.DBLockDeadlockError: (-30996, 'DB_LOCK_DEADLOCK:
    Locker killed to resolve a deadlock')

    Obviously it is PITA to need to run an external daemon, and
    as Python doesn't distribute db_deadlock.exe, the sleepycat
    license may mean not all applications are allowed to
    distribute it. This program also polls for deadlocks,
    meaning your app may hang as long as the poll period. All
    in all, it seems to suck :)

    @mhammond
    Copy link
    Contributor

    mhammond commented Nov 4, 2005

    Logged In: YES
    user_id=14198

    The db_deadlock program ends up being equivalent to a thread
    repeatedly calling:
    dbenv.lock_detect(bsddb.db.DB_LOCK_DEFAULT, 0)

    For completeness, I attach deadlock_hammer.py - a version
    that uses yet another thread to perform this lock detection.
    It also catches the deadlock exceptions, printing but
    ignoring them. Also, due to the way shutdown is less than
    graceful, I found I needed to add DB_RECOVER_FATAL to the
    env flags, otherwise I would often hang on open unless I
    clobbered the DB directory. On both my box (where it took a
    little while to see a deadlock) and on a dual-processor box
    (which provoked it much quicker), this version seems to run
    forever (although with sporadic performance)

    @gpshead
    Copy link
    Member

    gpshead commented Nov 5, 2005

    Logged In: YES
    user_id=413

    oh good i see you already suggested the simple thread
    calling lock_detect that I was about to suggest. :)

    regardless a thread isn't needed. see dbenv.set_lk_detect which
    tells BerkeleyDB to run deadlock detection automatically
    anytime a lock conflict occurs.

    http://www.sleepycat.com/docs/api_c/env_set_lk_detect.html

    Just add e.set_lk_detect(db.DB_LOCK_DEFAULT) to
    bsddb/init.py's _openDBEnv() function.

    That causes hammer.py to get DBLockDeadlockError exceptions
    as expected (dying if the main thread gets one). No
    lock_detect thread needed.

    The bsddb legacy interface in __init__.py could have all of
    its database accesses wrapped in the
    bsddb.dbutils.DeadlockWrap function. to prevent this.
    (testing now)

    @gpshead
    Copy link
    Member

    gpshead commented Nov 5, 2005

    Logged In: YES
    user_id=413

    modifying bsddb/init.py to wrap all calls with
    DeadlockWrap will be a bit of a pita (but would be doable).
    I've attached an example wrapped_hammer.py that
    demonstrates the _openDBEnv change as well as DeadlockWrap
    wrapping to work properly.

    @gpshead
    Copy link
    Member

    gpshead commented Jun 8, 2006

    Logged In: YES
    user_id=413

    I've added the

     e.set_lk_detect(db.DB_LOCK_DEFAULT)

    To the _openDBEnv function so that at the very least the bad
    threaded behaviour is upgraded from a deadlock to an exception.

    I'm trying to decide if doing all the DeadlockWrap wrapping
    is worth it (and if theres an automatic python classish way
    to do it rather than manually identifying and wrapping all
    db calls)

    @gpshead
    Copy link
    Member

    gpshead commented Jun 15, 2006

    Logged In: YES
    user_id=413

    Fixed. I wrapped all DB calls in Lib/bsddb/init.py with
    bsddb.dbutils.DeadlockWrap() and the hammer.py test now
    works rather than deadlocking.

    python svn commit r46969

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants