SIGSEGV in _BTree_get() #21

ml31415 · 2016-01-17T12:01:32Z

I recently switched from running ZODB in-process to a client-server solution. Since I switched, I keep getting segfaults within the client process, that look like that:

#0 _BTree_get(self=0x7f298..., keyarg=0x7f2988... has_key=0) at BTrees/BTreeTemplate.c:268
    child = <optimized out>
    key = 0x7f29....
    result = 0x0
    copied = <optimized out>
#1 0x00000000000000... in PyEval_EvalFrameEx ()
.....

The client is using gevent and accesses the database via zlibstorage/clientstorage. Not sure if the the gevent stuff is relevant in terms of thread safety and the crash may be related to that, but looking at several of these stack traces, they all seem to happen in this _BTree_get function.

If I somehow can provide more helpful debug output, please let me know.

Edit:

Doing some more experiments, I caught another segfault, this time in rangeSearch:

Program received signal SIGSEGV, Segmentation fault.
BTree_rangeSearch (self=0x7fffe1f74c50, args=<optimized out>, kw=<optimized out>, type=<optimized out>) at src/BTrees/BTreeTemplate.c:1595
1595    src/BTrees/BTreeTemplate.c: Datei oder Verzeichnis nicht gefunden.
(gdb) info threads
  Id   Target Id         Frame 
  16   Thread 0x7fffbb7fe700 (LWP 22623) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffac000bc0) at sem_waitcommon.c:42
  15   Thread 0x7fffbbfff700 (LWP 22622) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffb4000e00) at sem_waitcommon.c:42
  14   Thread 0x7fffd8ff9700 (LWP 22621) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffc0000e30) at sem_waitcommon.c:42
  13   Thread 0x7fffd97fa700 (LWP 22620) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffbc000bf0) at sem_waitcommon.c:42
  12   Thread 0x7fffd9ffb700 (LWP 22619) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffc8000e00) at sem_waitcommon.c:42
  11   Thread 0x7fffda7fc700 (LWP 22618) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffc4001090) at sem_waitcommon.c:42
  10   Thread 0x7fffdaffd700 (LWP 22617) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffd0000bc0) at sem_waitcommon.c:42
  9    Thread 0x7fffdb7fe700 (LWP 22616) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffcc000e10) at sem_waitcommon.c:42
  8    Thread 0x7fffdbfff700 (LWP 22615) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffd4000bf0) at sem_waitcommon.c:42
  7    Thread 0x7fffe0ea6700 (LWP 22614) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffdc000cd0) at sem_waitcommon.c:42
  6    Thread 0x7fffee64a700 (LWP 22607) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  5    Thread 0x7fffeee4b700 (LWP 22606) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  4    Thread 0x7fffef64c700 (LWP 22605) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  3    Thread 0x7ffff41fb700 (LWP 22604) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
* 1    Thread 0x7ffff7fca700 (LWP 22596) "python" BTree_rangeSearch (self=0x7fffe1f74c50, args=<optimized out>, kw=<optimized out>, type=<optimized out>) at src/BTrees/BTreeTemplate.c:1595

So I'm still clueless as before in terms of how to fix it, but I guess it's rather gevent+ClientStorage not playing well together, rather than BTrees, and I'll close the issue here.

The text was updated successfully, but these errors were encountered:

jamadden · 2016-01-17T15:03:27Z

We use BTrees+gevent+ClientStorage in some configurations, and I haven't seen any crashes like this.

ml31415 · 2016-01-17T15:52:03Z

Any ideas how to deeper dig into this? It's fairly reproducible. As soon as there is some more activity going on with the db (like every program startup cycle), I see the usual startup, then the activity peak, I see a bunch of new threads spawned, and then the segfault:

(random last log line)
[New Thread 0x7fffdadfd700 (LWP 23777)]
[New Thread 0x7fffda5fc700 (LWP 23778)]
[New Thread 0x7fffd9dfb700 (LWP 23779)]
[New Thread 0x7fffd95fa700 (LWP 23780)]
[New Thread 0x7fffd8df9700 (LWP 23781)]
[New Thread 0x7fffbbfff700 (LWP 23782)]
[New Thread 0x7fffbb7fe700 (LWP 23783)]

Program received signal SIGSEGV, Segmentation fault.
_BTree_get (self=0x7fffe002aef0, keyarg=(('GRYDER', ('AARON', 'T')), 'M'), has_key=0) at BTrees/BTreeTemplate.c:268
268 BTrees/BTreeTemplate.c: Datei oder Verzeichnis nicht gefunden.

These new threads also get spawned, when I run the database in-process, so not sure, what they're doing, but I guess they don't seem to cause the issue.

What I also noted, especially on startup I get KeyErrors from BTrees on keys, that are supposed to be present, and also show up as being present just on retrying. Ever noticed something like that?

tseaver · 2016-01-17T16:03:20Z

@ml31415 if you can build Python with debug enabled, and run your app under pdb, you might be able to get some more clues (e.g., see the contents of self->data).

Another question: are you trying to share your database connection across threads? The ZODB doesn't expect that: in the stock model each thread would check out a connection as needed from the pool managed by the database (e.g., at the start of a web request), and then return it when finished (at the end of a request, for instance).

ml31415 · 2016-01-17T16:15:54Z

I use gevent in only one thread, no multithreading intended. The first line of the program does the monkeypatching, so all the spawned threads should happen somewhere on C-level. Not sure exactly, what they're doing. From gevent, I have about 50-100 microthreads, that access the database without further synchronisation. With plain filestorage, this worked flawlessly.

About self->data, I'm not sure about which object you're talking, but I'll give it a try with the debug symbols.

jamadden · 2016-01-17T16:21:00Z

all the spawned threads should happen somewhere on C-level. Not sure exactly, what they're doing.

By default, gevent uses a threadpool to handle hostname (DNS) lookups, and optionally certain types of I/O with FileObjectThread. Chances are the threads you see are pool threads that did DNS lookups, such as when you connect to the database.

ml31415 · 2016-01-17T16:36:33Z

Yeah, the program does indeed a bunch of DNS lookups at that time, though the DB itself is accessed via a socket. So I guess these threads are unrelated to the problem then.

ml31415 · 2016-01-22T17:33:59Z

Just for info the results of my further experiments with this:

Threading is definitely unrelated, setting the gevent resolver to ares disabled any threading, but the problem persisted
Problem not demonstrateable with a freshly initialized (nearly empty) database
Problem not demonstrateable with direct in-process FileStorage access
Problem still present after packing the database
Problem also happens with elder versions, tested with 3.10.5 and recent versions
Parallel to the segfault, there are errors with keys not found in the BTree, which are supposed to be there. This may or may not be related (again not present on direct FileStorage access)
ZEO server process never reports any error
It seems easier reproducible, the more greenlets are accessing and modifying the database in parallel, ranging from not reproducible with maybe less than 50 greenlets to nearly instant with 150+. Though, none of them are creating heavy load, just infrequently modifying a handful of objects each
All my tries to write a slim and simple reproducer failed so far

I'm afraid this isn't too helpful yet, so I'll have some more tries in getting something reproducible together. My thoughts so far, please correct me if I'm drawing wrong conclusions:

Data corruption seems unlikely, as the problem persists after the packing, and as the in-process FileStorage access works fine
There must be some rare? edge case, when BTree gets different data through ClientStorage/ZEOServer than directly from FileStorage, which causes the segfault in _BTree_get

ml31415 closed this as completed Jan 17, 2016

ml31415 mentioned this issue Jan 26, 2016

BTrees not finding contained keys on parallel access with gevent #22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV in _BTree_get() #21

SIGSEGV in _BTree_get() #21

ml31415 commented Jan 17, 2016

jamadden commented Jan 17, 2016

ml31415 commented Jan 17, 2016

tseaver commented Jan 17, 2016

ml31415 commented Jan 17, 2016

jamadden commented Jan 17, 2016

ml31415 commented Jan 17, 2016

ml31415 commented Jan 22, 2016

SIGSEGV in _BTree_get() #21

SIGSEGV in _BTree_get() #21

Comments

ml31415 commented Jan 17, 2016

jamadden commented Jan 17, 2016

ml31415 commented Jan 17, 2016

tseaver commented Jan 17, 2016

ml31415 commented Jan 17, 2016

jamadden commented Jan 17, 2016

ml31415 commented Jan 17, 2016

ml31415 commented Jan 22, 2016