`ProgrammingError` after reconnecting. #32

luhn · 2014-07-18T15:57:04Z

I'm using ConnectionPool with DeadConnectionDetector. Here's my connection function, if it helps.

This morning, my primary database failed, which happened to be during a period of high-frequency querying. So of course, psycopg2.OperationalError appears several times, before it was replaced with ConnectionDead. Shortly thereafter, my standby server was promoted to master. txpostgres was then able to reconnect, which [I assume] it did. But then, immediately, the following error starts: psycopg2.ProgrammingError: execute cannot be used while an asynchronous query is underway. As far as I can tell, no queries were successfully executed until I restarted the app.

I've been digging around the txpostgres code to try to find how this could happen, but I haven't had any enlightenment so far. I was wondering if you might have any ideas.

Here are some tracebacks, but—as is usual for Twisted—they aren't very helpful.

The text was updated successfully, but these errors were encountered:

wulczer · 2014-07-21T08:31:51Z

Hi,

First of all, thanks for the report!

Which version of Twisted are you using? The culprit seems to be this exception:

ValueError: file descriptor cannot be a negative integer (-1)

which is something that has been plaguing me for a long time - there's at least one Twisted ticket that has been opened in connection with this problem:

http://twistedmatrix.com/trac/ticket/4539

As you can see, the ticket is now solved, but I've seen this error happen on our production system once, even though we're running recent Twisted. OTOH, we've had a master/slave fallback recently and everything kept working.

It would be very good if you could come up with a way to reproduce the problem. At this point, I'm not even sure if it's Twisted's or psycopg2's fault... The -1 file descriptor error is most probably Twisted, but the asynchronous query underway one - that's coming from psycopg2. txpostgres calls close() on cursor detected to be in an error state, but perhaps under some circumstances that doesn't really clear the error...

Without a reproducible test case it's hard to debug; I might try to do some tests with hammering an instance with queries while restarting the database, but in the meantime I'm flying blind :(

luhn · 2014-07-21T15:10:33Z

Which version of Twisted are you using?

I'm using Twisted 13.2.0. Not the most recent, but within the past year.

OTOH, we've had a master/slave fallback recently and everything kept working.

I should note that I've also had failovers that have worked fine, this is the first time something went wrong.

Without a reproducible test case it's hard to debug; I might try to do some tests with hammering an instance with queries while restarting the database, but in the meantime I'm flying blind :(

I'll play around when I get a chance and see if I can come up with anything.

luhn · 2014-07-22T15:59:54Z

After doing some mild investigation: It looks like Ticket #4539 is broken again. The test_foo.py you wrote fails on Ubuntu. (Works fine on my mac. Different reactor?)

luhn · 2014-07-22T16:55:53Z

It appears that the tests created for the previous fix (removedFromReactor, negativeOneFileDescriptor) are passing.

wulczer · 2014-07-23T07:03:15Z

Thanks for the investigation! I just tried with Twisted trunk and test_foo.py failed on my Debian laptop.

I tried with various reactors and both epoll (the default on Linux) and poll failed, while select worked - and AFAICS select is what's used on OSX.

I'll try digging deeper into that, although I'm also a bit concerned about this problem being a red herring. I'm not 100% positive that it's what has been causing the reconnection problem you got; it would be great to have a reproducible test case for that so we can be sure that fixing the -1 FD problem fixes the txpostgres bug.

wulczer · 2014-07-23T07:08:20Z

OK, that one was easy: for epoll, fileno() is called from inside addReader, so it's obvious why my test_foo.py was failing. If I move self.insideReactor = True to just above self.reactor.addReader(self) the test passes.

I'll try to write a program to bombard a Postgres instance with queries and kill the database - let's see if I manage to get a similar error to yours.

wulczer · 2014-07-23T08:32:19Z

I did a few more tests with txpostgres and Twisted trunk and have been able to reproduce the -1 error, but it still managed to recover and continue querying. Here's the code:

https://gist.github.com/fa25fbae2855e8e5d2d6

I'm running it in one terminal and in another one doing:

sudo pg_ctlcluster -m fast 9.1 main restart

This gives me

Unhandled Error
Traceback (most recent call last):
  File "/home/wulczer/src/twisted.git/twisted/python/log.py", line 88, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/home/wulczer/src/twisted.git/twisted/python/log.py", line 73, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/home/wulczer/src/twisted.git/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/home/wulczer/src/twisted.git/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/home/wulczer/src/twisted.git/twisted/internet/posixbase.py", line 614, in _doReadOrWrite
    why = selectable.doRead()
  File "/home/wulczer/src/txpostgres/txpostgres/txpostgres.py", line 683, in doRead
    self.reactor.addReader(self)
  File "/home/wulczer/src/twisted.git/twisted/internet/epollreactor.py", line 271, in addReader
    EPOLLIN, EPOLLOUT)
  File "/home/wulczer/src/twisted.git/twisted/internet/epollreactor.py", line 257, in _add
    self._poller.register(fd, flags)
exceptions.ValueError: file descriptor cannot be a negative integer (-1)

but after that I get a reconnection and the queries keep running.

I start to suspect that the -1 FD is indeed a red herring. We should focus on reproducing the actual problem with the asynchronous query underway error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ProgrammingError` after reconnecting. #32

`ProgrammingError` after reconnecting. #32

luhn commented Jul 18, 2014

wulczer commented Jul 21, 2014

luhn commented Jul 21, 2014

luhn commented Jul 22, 2014

luhn commented Jul 22, 2014

wulczer commented Jul 23, 2014

wulczer commented Jul 23, 2014

wulczer commented Jul 23, 2014

ProgrammingError after reconnecting. #32

ProgrammingError after reconnecting. #32

Comments

luhn commented Jul 18, 2014

wulczer commented Jul 21, 2014

luhn commented Jul 21, 2014

luhn commented Jul 22, 2014

luhn commented Jul 22, 2014

wulczer commented Jul 23, 2014

wulczer commented Jul 23, 2014

wulczer commented Jul 23, 2014

`ProgrammingError` after reconnecting. #32

`ProgrammingError` after reconnecting. #32