Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV in _BTree_get() #21

Closed
ml31415 opened this issue Jan 17, 2016 · 7 comments
Closed

SIGSEGV in _BTree_get() #21

ml31415 opened this issue Jan 17, 2016 · 7 comments

Comments

@ml31415
Copy link

ml31415 commented Jan 17, 2016

I recently switched from running ZODB in-process to a client-server solution. Since I switched, I keep getting segfaults within the client process, that look like that:

#0 _BTree_get(self=0x7f298..., keyarg=0x7f2988... has_key=0) at BTrees/BTreeTemplate.c:268
    child = <optimized out>
    key = 0x7f29....
    result = 0x0
    copied = <optimized out>
#1 0x00000000000000... in PyEval_EvalFrameEx ()
.....

The client is using gevent and accesses the database via zlibstorage/clientstorage. Not sure if the the gevent stuff is relevant in terms of thread safety and the crash may be related to that, but looking at several of these stack traces, they all seem to happen in this _BTree_get function.

If I somehow can provide more helpful debug output, please let me know.

Edit:

Doing some more experiments, I caught another segfault, this time in rangeSearch:

Program received signal SIGSEGV, Segmentation fault.
BTree_rangeSearch (self=0x7fffe1f74c50, args=<optimized out>, kw=<optimized out>, type=<optimized out>) at src/BTrees/BTreeTemplate.c:1595
1595    src/BTrees/BTreeTemplate.c: Datei oder Verzeichnis nicht gefunden.
(gdb) info threads
  Id   Target Id         Frame 
  16   Thread 0x7fffbb7fe700 (LWP 22623) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffac000bc0) at sem_waitcommon.c:42
  15   Thread 0x7fffbbfff700 (LWP 22622) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffb4000e00) at sem_waitcommon.c:42
  14   Thread 0x7fffd8ff9700 (LWP 22621) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffc0000e30) at sem_waitcommon.c:42
  13   Thread 0x7fffd97fa700 (LWP 22620) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffbc000bf0) at sem_waitcommon.c:42
  12   Thread 0x7fffd9ffb700 (LWP 22619) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffc8000e00) at sem_waitcommon.c:42
  11   Thread 0x7fffda7fc700 (LWP 22618) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffc4001090) at sem_waitcommon.c:42
  10   Thread 0x7fffdaffd700 (LWP 22617) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffd0000bc0) at sem_waitcommon.c:42
  9    Thread 0x7fffdb7fe700 (LWP 22616) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffcc000e10) at sem_waitcommon.c:42
  8    Thread 0x7fffdbfff700 (LWP 22615) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffd4000bf0) at sem_waitcommon.c:42
  7    Thread 0x7fffe0ea6700 (LWP 22614) "python" 0x00007ffff7bca0c9 in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fffdc000cd0) at sem_waitcommon.c:42
  6    Thread 0x7fffee64a700 (LWP 22607) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  5    Thread 0x7fffeee4b700 (LWP 22606) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  4    Thread 0x7fffef64c700 (LWP 22605) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  3    Thread 0x7ffff41fb700 (LWP 22604) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
* 1    Thread 0x7ffff7fca700 (LWP 22596) "python" BTree_rangeSearch (self=0x7fffe1f74c50, args=<optimized out>, kw=<optimized out>, type=<optimized out>) at src/BTrees/BTreeTemplate.c:1595

So I'm still clueless as before in terms of how to fix it, but I guess it's rather gevent+ClientStorage not playing well together, rather than BTrees, and I'll close the issue here.

@ml31415 ml31415 closed this as completed Jan 17, 2016
@jamadden
Copy link
Member

We use BTrees+gevent+ClientStorage in some configurations, and I haven't seen any crashes like this.

@ml31415
Copy link
Author

ml31415 commented Jan 17, 2016

Any ideas how to deeper dig into this? It's fairly reproducible. As soon as there is some more activity going on with the db (like every program startup cycle), I see the usual startup, then the activity peak, I see a bunch of new threads spawned, and then the segfault:

(random last log line)
[New Thread 0x7fffdadfd700 (LWP 23777)]
[New Thread 0x7fffda5fc700 (LWP 23778)]
[New Thread 0x7fffd9dfb700 (LWP 23779)]
[New Thread 0x7fffd95fa700 (LWP 23780)]
[New Thread 0x7fffd8df9700 (LWP 23781)]
[New Thread 0x7fffbbfff700 (LWP 23782)]
[New Thread 0x7fffbb7fe700 (LWP 23783)]

Program received signal SIGSEGV, Segmentation fault.
_BTree_get (self=0x7fffe002aef0, keyarg=(('GRYDER', ('AARON', 'T')), 'M'), has_key=0) at BTrees/BTreeTemplate.c:268
268 BTrees/BTreeTemplate.c: Datei oder Verzeichnis nicht gefunden.

These new threads also get spawned, when I run the database in-process, so not sure, what they're doing, but I guess they don't seem to cause the issue.

What I also noted, especially on startup I get KeyErrors from BTrees on keys, that are supposed to be present, and also show up as being present just on retrying. Ever noticed something like that?

@tseaver
Copy link
Member

tseaver commented Jan 17, 2016

@ml31415 if you can build Python with debug enabled, and run your app under pdb, you might be able to get some more clues (e.g., see the contents of self->data).

Another question: are you trying to share your database connection across threads? The ZODB doesn't expect that: in the stock model each thread would check out a connection as needed from the pool managed by the database (e.g., at the start of a web request), and then return it when finished (at the end of a request, for instance).

@ml31415
Copy link
Author

ml31415 commented Jan 17, 2016

I use gevent in only one thread, no multithreading intended. The first line of the program does the monkeypatching, so all the spawned threads should happen somewhere on C-level. Not sure exactly, what they're doing. From gevent, I have about 50-100 microthreads, that access the database without further synchronisation. With plain filestorage, this worked flawlessly.

About self->data, I'm not sure about which object you're talking, but I'll give it a try with the debug symbols.

@jamadden
Copy link
Member

all the spawned threads should happen somewhere on C-level. Not sure exactly, what they're doing.

By default, gevent uses a threadpool to handle hostname (DNS) lookups, and optionally certain types of I/O with FileObjectThread. Chances are the threads you see are pool threads that did DNS lookups, such as when you connect to the database.

@ml31415
Copy link
Author

ml31415 commented Jan 17, 2016

Yeah, the program does indeed a bunch of DNS lookups at that time, though the DB itself is accessed via a socket. So I guess these threads are unrelated to the problem then.

@ml31415
Copy link
Author

ml31415 commented Jan 22, 2016

Just for info the results of my further experiments with this:

  • Threading is definitely unrelated, setting the gevent resolver to ares disabled any threading, but the problem persisted
  • Problem not demonstrateable with a freshly initialized (nearly empty) database
  • Problem not demonstrateable with direct in-process FileStorage access
  • Problem still present after packing the database
  • Problem also happens with elder versions, tested with 3.10.5 and recent versions
  • Parallel to the segfault, there are errors with keys not found in the BTree, which are supposed to be there. This may or may not be related (again not present on direct FileStorage access)
  • ZEO server process never reports any error
  • It seems easier reproducible, the more greenlets are accessing and modifying the database in parallel, ranging from not reproducible with maybe less than 50 greenlets to nearly instant with 150+. Though, none of them are creating heavy load, just infrequently modifying a handful of objects each
  • All my tries to write a slim and simple reproducer failed so far

I'm afraid this isn't too helpful yet, so I'll have some more tries in getting something reproducible together. My thoughts so far, please correct me if I'm drawing wrong conclusions:

  • Data corruption seems unlikely, as the problem persists after the packing, and as the in-process FileStorage access works fine
  • There must be some rare? edge case, when BTree gets different data through ClientStorage/ZEOServer than directly from FileStorage, which causes the segfault in _BTree_get

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants