New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for "gevent missed a libzmq recv event on 17!" #229

Closed
gtaylor opened this Issue Jun 28, 2012 · 28 comments

Comments

Projects
None yet
8 participants
@gtaylor

gtaylor commented Jun 28, 2012

I saw some activity regarding this bug, but wasn't sure of the correct way to work around it. This is the simple application that is seeing this error:

https://github.com/gtaylor/EVE-Market-Data-Relay/blob/master/emdr/daemons/announcer/main.py

stderr shows:

BUG: gevent missed a libzmq recv event on 17!

Given what I have, what is the correct way to avoid this bug? I see something about using a poller, but wasn't sure if that was a fix.

@minrk

This comment has been minimized.

Member

minrk commented Jun 28, 2012

If I knew a workaround, I wouldn't have given up and put a print statement. If anything, this is already a workaround because it's introducing a maximum timeout of one second on missed events, rather than infinite as it was in gevent_zeromq.

I cannot say with certainty, but I believe this is a bug outside pyzmq, either in libzmq or gevent. Are you using libzmq-3.x?

@gtaylor

This comment has been minimized.

gtaylor commented Jun 29, 2012

I'm using libzmq2. Does this behavior manifest in libzmq3?

Is there any chance an exception could be raised when this happens, so I can react to it?

@minrk

This comment has been minimized.

Member

minrk commented Jun 29, 2012

This message only indicates that gevent did not wake the event watcher as soon as it was supposed to. All events are actually processed correctly, there was only a delay of up to one second in processing. I'm not sure what exception I would raise here.

@gtaylor

This comment has been minimized.

gtaylor commented Jun 29, 2012

We've noticed that sometimes after seeing this print in stdout, some of our SUB clients seem to disconnect and fail to reconnect. I'm not sure if it's due to this issue, or something else, but we had two separate clients in two different data centers on very different hardware end up in the same state at the same time shortly after one of these BUG entries appeared in stdout on the upstream PUB gevent+pyzmq process.

@michelp

This comment has been minimized.

Contributor

michelp commented Jun 29, 2012

@gtaylor is it possible to provide us with a small reproducible test case? I'd love to squash this bug as well. Also what version of gevent are you using?

@gtaylor

This comment has been minimized.

gtaylor commented Jun 29, 2012

@michelp This will eventually manifest the error on some machines, but I haven't figured out the pattern yet:

https://github.com/gtaylor/EVE-Market-Data-Relay/blob/master/emdr/daemons/announcer/main.py

@michelp

This comment has been minimized.

Contributor

michelp commented Jun 30, 2012

Hmm, it looks normal enough. I wonder if you change it so that only a a single greenlet is spawned around run(), and the while loop iterated in that greenlet it would change the occurrence of the error. Not sure how it would be related but maybe threre is some kind of race around spawning and event triggering.

Also what version of gevent are you using?

@tmc

This comment has been minimized.

tmc commented Jun 30, 2012

FWIW after some digging it appears this is likely a libevent issue as I can consistently /not/ produce it with gevent 1.0 (beta) releases (which has switched to libev).

@gtaylor

This comment has been minimized.

gtaylor commented Jul 1, 2012

@michelp I'm running this: https://launchpad.net/ubuntu/precise/amd64/libevent-2.0-5

@traviscline Interesting. And pyzmq is all happy with gevent 1.0?

@spahl

This comment has been minimized.

spahl commented Jul 5, 2012

I also confirm that I only see this issue with libevent and not libev.

@minrk

This comment has been minimized.

Member

minrk commented Aug 2, 2012

Should I close this as just a gevent bug, for which the 'workaround' is to use gevent 1.0-dev?

@gtaylor

This comment has been minimized.

gtaylor commented Aug 2, 2012

If it's documented somewhere on the Sphinx docs. It probably should be visible somewhere, at least, before calling it "fixed".

@minrk

This comment has been minimized.

Member

minrk commented Aug 2, 2012

I will make a note there, thanks.

@minrk minrk closed this in c676027 Aug 19, 2012

minrk added a commit that referenced this issue Aug 19, 2012

Merge pull request #248 from minrk/release
Prep for 2.2.0-1 release

This will be the first release after merging gevent_zeromq and pyzmq-static.

* workaround setuptools bug
* update constants from current state of libzmq-3.2.x
* minor upstream pull from tornado.ioloop
* minor pxd fix

Still some docs to do

closes #246
closes #245
closes #229
closes #226
@awestendorf

This comment has been minimized.

awestendorf commented Jan 22, 2013

I can confirm that I still see this problem using both gevent 1.0b1 and 1.0rc2 combined with pyzmq 2.2.0.1

@minrk

This comment has been minimized.

Member

minrk commented Jan 22, 2013

Well, crap then. There does not appear to be anything pyzmq can do about it. Can you post code that reliably encounters this gevent bug?

@awestendorf

This comment has been minimized.

awestendorf commented Jan 22, 2013

I'll try, it's wrapped up in a larger app. May take me a bit given my current work queue.

@ultrabug

This comment has been minimized.

ultrabug commented Jan 24, 2013

I'm also affected by this bug. I just discovered it as I'm migrating from gevent_zeromq to zmq.green.
More details :

  • no BUG using gevent-zeromq-0.2.5 + pyzmq-2.2.0.1 + zeromq-2.2.0 + gevent-1.0_beta2
  • BUG using pyzmq-2.2.0.1 + zeromq-3.2.2 + gevent-1.0_beta2
@minrk

This comment has been minimized.

Member

minrk commented Jan 24, 2013

@ultrabug - that's good to know, that means I may have introduced a bug. What about zmq.green in pyzmq-2.2.0.1?

@ultrabug

This comment has been minimized.

ultrabug commented Jan 24, 2013

@minrk : Not sure what you're asking me. The BUG log appeared, using the same version of pyzmq (2.2.0.1), when I migrated my code from gevent_zeromq to zmq.green.

@minrk

This comment has been minimized.

Member

minrk commented Jan 24, 2013

bah, sorry, I saw 3.2.2, and I thought you meant pyzmq-3.2 dev. Not reading properly. Can you check with current pyzmq master as well? Also, a test that reliably reproduces the issue would be extremely helpful (the pyzmq test suite used to, but does not anymore).

@minrk

This comment has been minimized.

Member

minrk commented Jan 24, 2013

(also double-dumb, because pyzmq master won't be marked as 3.2dev until PR #296 is merged)

@minrk

This comment has been minimized.

Member

minrk commented Jan 24, 2013

@ultrabug - one more case for you to test:

  • zmq.green in pyzmq-2.2.0.1 + zeromq-2.2.0 + gevent-1.0_beta2

Because if it is the case the libzmq has introduced its own bug (entirely possible), there's not a lot we can do about it here.

@ultrabug

This comment has been minimized.

ultrabug commented Jan 25, 2013

@minrk : just ran the test on one node of my cluster, unfortunately downgrading zeromq does NOT solve the problem.

@awestendorf

This comment has been minimized.

awestendorf commented Jan 25, 2013

In my case, this may be a false positive. I tried to create an isolated test of a gevent wsgi server that mimicked our application. We're using gevent, inproc zmq and native threads to parallelize blocking decryption operations we perform on a subset of our HTTP traffic. The HTTP greenlet writes to a shared zmq context with the encrypted data, then waits on a semaphore while a separate greenlet reads from the zmq socket and wakes semaphores as threads return decrypted objects.

I tried to repeat this by passing the headers to separate threads and running crypt or zip operations, sometimes in a loop, but could not repeat the bug using apachebench and similar load as I was running on our application.

I'll continue to try but it may simply be a case that our actual decryption routines are, at load, blocking long enough to trigger the bug in question. I may have to dig in deeper to find out which thread is actually printing the statement in case it's the decrypters printing it out. It's also possible that our application, which is a gateway to AMQP, is causing the bug because it is performing additional IO to communicate with our backend and possibly starving zmq. I hope that nothing is sitting around for a second or more, so if it's our application design we certainly have to look carefully at it.

@asteven

This comment has been minimized.

Contributor

asteven commented Jan 31, 2013

I'm a zmq and gevent noob, so maybe this is not related.

But whatever versions of pyzmq/libevent/gevent I try, I can reproduce the problem with the script at [1].

Maybe the whole problem is related to gevent.sleep() ?

[1] http://armstrong.cc/~steven/tmp/pyzmq_issue_229.py

@asteven

This comment has been minimized.

Contributor

asteven commented Jan 31, 2013

Versions tested:
pyzmq 13.0dev
gevent 0.13.8
libevent 2.0.21-1
libev 4.11-1

pyzmq 13.0dev
gevent 1.0.0.dev
libevent 2.0.21-1
libev 4.11-1

@minrk

This comment has been minimized.

Member

minrk commented Feb 1, 2013

@asteven I can reproduce the message with your script, but it actually seems to be due to a coincidence of the fact that your messages come every one second, and that's the timeout used for detecting the bug.

The sequence of events:

  1. (t=0) message arrives, and is properly handled
  2. t=ε calls recv, which starts a one second timer
  3. t=1 no message, timeout fires
  4. t=1+ε, check for message, which has arrived! triggers BUG message

When I change the timeout to two seconds, all messages are processed promptly

@minrk

This comment has been minimized.

Member

minrk commented Feb 1, 2013

Here's what I'm going to do:

Since I still do not have any evidence that the actual bug persists in gevent 1.0, only that it's easy to trigger the bug message with a simple timing coincidence, I am going to greatly increase the timeout (> 10 seconds), and introduce a debug flag, so that the messages are off by default.

minrk added a commit to minrk/pyzmq that referenced this issue Feb 1, 2013

tweak gevent bug workarounds
timeout is configurable, and now long (> 10 seconds)
debug message is off by default

ref: zeromq#229

@minrk minrk referenced this issue Feb 1, 2013

Merged

tweak gevent bug workarounds #311

2 of 2 tasks complete

minrk added a commit that referenced this issue Feb 1, 2013

Merge pull request #311 from minrk/greenbarf
tweak gevent bug workarounds

- [x] timeout is configurable, and now long (11.6 seconds, to avoid timeout coincidences)
- [x] debug message is off by default

ref: #229
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment