Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

benchmark.py throws SystemError #28

Closed
MichaelAz opened this Issue Jul 12, 2014 · 13 comments

Comments

Projects
None yet
2 participants
Contributor

MichaelAz commented Jul 12, 2014

When running benchmark.py as found in master, a SystemError is raised:

SystemError: (libev) select: Unknown error
Traceback (most recent call last):
  File "C:\Users\zeev\Desktop\goless\benchmark.py", line 111, in <module>
    main()
  File "C:\Users\zeev\Desktop\goless\benchmark.py", line 105, in main
    prime()
  File "C:\Users\zeev\Desktop\goless\benchmark.py", line 100, in prime
    bench_selects()
  File "C:\Users\zeev\Desktop\goless\benchmark.py", line 68, in bench_selects
    took_nodefault = bench_select(False)
  File "C:\Users\zeev\Desktop\goless\benchmark.py", line 62, in bench_select
    selecting.select(cases)
  File "C:\Users\zeev\Desktop\goless\goless\selecting.py", line 93, in select
    _be.yield_()
  File "C:\Users\zeev\Desktop\goless\goless\backends.py", line 138, in yield_
    gevent.sleep()
  File "C:\Python27\lib\site-packages\gevent\hub.py", line 73, in sleep
    waiter.get()
  File "C:\Python27\lib\site-packages\gevent\hub.py", line 569, in get
    return self.hub.switch()
  File "C:\Python27\lib\site-packages\gevent\hub.py", line 332, in switch
    return greenlet.switch(self)
SystemError: (libev) select: Unknown error

would_deadlock passes and this consistently happens at the 495 itteration (at least for me).

I'm running the code on Windows 7 with the gevent backend, and gevent==1.0.1, greenlet==0.4.2.

Owner

rgalanakis commented Jul 12, 2014

Not getting this behavior on Linux with those versions. Will get a Windows7 virtualbox set up to try things out.

Owner

rgalanakis commented Jul 12, 2014

If you just run: from goless.backends import current; current.yield_(), (something like that) what happens? gevent has never had an issue yielding on the last greenlet, so I don't know where this behavior is coming from... (I am adding some tests to verify this behavior).
Will probably be late Sunday when I am able to look into this on Windows, have weekend plans.

Contributor

MichaelAz commented Jul 12, 2014

That code runs fine. I'll investigate further, see if I can find anything useful.

Contributor

MichaelAz commented Jul 12, 2014

So, something interesting right off the bat.
The benchmark contains this code:

def main():
    prime()
    bench_channels()
    bench_selects()

prime just runs the benchmarks without writing any output, so we can ignore it, but an interesting thing happens when we comment out bench_channels - the error raised by bench_selects magically transforms into a Deadlock error.

The reason for this is that by running bench_channels the errors location changes.
When it's run, the error happens in selecting.py, 93, in the statement _be.yield_().
When it' isn't run, the error happens in selecting.py, 92 in the statement return c, c.exec_().
exec_ causes a send\receive which is wrapped by the _as_deadlock decorator and thus causes a sane error.
yield_ isn't wrapped by that decorator and because of that we get the cryptic error.
So, perhaps we should think of wrapping exceptions thrown in yield_. Next.

Inside, bench_selects it is specifically the call to bench_select(False) that raises the exception.
The reason for this difference in behavior is that by passing True to bench_select we cause a dcase to be added to the case list, so, when none of the other channels are ready the script doesn't throw, but rather uses that default case.

There's some subtle race condition here, I believe, with sending to a full channel, because switching to a buffered channel with buffer size 2.
I honestly have no idea what's going on here but I re-wrote it from scratch and it seems to work now. Unless you find a better explanation for this behavior, I think I'll commit the re-written version.

@rgalanakis rgalanakis added a commit that referenced this issue Jul 14, 2014

@rgalanakis rgalanakis Address #28 in some part.
as_deadlock raises with the original traceback as well.
gevent yield_ calls are wrapped in as_deadlock, and stackless yield_ will pass. updated yield_ docstring to reflect behavior.
2421dfa
Owner

rgalanakis commented Jul 14, 2014

Ok, I've improved the behavior of as_deadlock to include the original stacktrace, and yield_ should not raise if its the last tasklet. I'll dig into this on Windows now.

Owner

rgalanakis commented Jul 14, 2014

May take a while to get my Windows box set up for development... in the meantime, could you try with the tip of gevent in github?

There's some subtle race condition here, I believe, with sending to a full channel, because switching to a buffered channel with buffer size 2.

Yes very likely. We suspect this is why the pypystackless tests don't work either. I will work through this code and see.

Also going from the gevent docs, it appears libev has some problems on Windows- not just bugs but also uknown errors. There could also be some gevent->libev bugs on Windows.

Owner

rgalanakis commented Jul 14, 2014

Ok so here's some progress for the morning. A bit of a mind-dump, maybe writing it out will help uncover something?

I can repro easily (on Windows only) by taking the bench_select code into a script and running that. Unfortunately the behavior disappears within a test framework or under the debugger!

This has nothing to do with a deadlock, so I've removed the as_deadlock catch for SystemError. We are putting gevent/libev into a bad state somehow- I suspect the same thing is happening that is causing pypystackless to be in a bad state. It's the same sort of thing- symptom is that there's no runnable tasklet or whatever, but that cannot really be. Solving one may solve the other! (See #2 )

This is where it gets interesting. On my machine, I consistently fail at iteration 997. However, if instead of:

def sender():
    while True:
        c.send(0)
        c.recv()

I have (you may need to import backends first):

def sender():
    while True:
        c.send(0)
        backends.current.yield_()
        c.recv()

I fail on iteration 499- which is about half of 997. Do you get the same behavior @MichaelAz , or is that just coincidence on my end? I suspect you are spot on, that the problem is send/recv to a full channel and the behavior that goes on there. The semantics are not totally clear- a blocked send will of course yield, but how about an unblocked send? I can't remember if its tested, or even defined. There are some potential problems to work through. Will keep the thread updated over the next few days.

Updates:

  • Update 1: Ah, looking at BackendChannelSenderReceiverPriorityTest, maybe something lies there...
  • Update 2: Wrote a test to verify that successful send or recv do not yield control. Found an issue! Investigating- feel like I'm on the right track.
  • Updated 3: Added tests to verify behavior of send/recv priority. Also found that I do need to also catch SystemError on Windows in event of a deadlock. from gevent.queue import Channel; Channel().get() will raise a SystemError on Windows but LoopExit on Linux.

@rgalanakis rgalanakis added a commit that referenced this issue Jul 14, 2014

@rgalanakis rgalanakis Still working on #28. Added tests verifying when a sender/receiver yi…
…elds control or not on successful send/recv.
512ff20

@rgalanakis rgalanakis added a commit that referenced this issue Jul 14, 2014

@rgalanakis rgalanakis SystemError is used as a deadlock error again on Windows. Happens in …
…certain cases. See issue #28.
62345db
Owner

rgalanakis commented Jul 14, 2014

Okay, confirmed a few things. Basically, something that will deadlock or run perfectly well on Linux will raise on Windows:

from gevent.queue import Channel
import gevent
c = Channel()
def sender():
    while True:
        c.put(0)
gevent.spawn(c.put, 1)
for i in range(1000):
    gevent.sleep(0)

Will exit fine on Linux, will error on Windows. I also cannot replicate in all cases, like under a test runner.

I can catch the error in select and ignore it to replicate the Linux behavior on Windows. I am not sure what else I could do, and other than performance and more Windows bugs in the future, I'm not sure what else we can do. It's up to libev/gevent to fix.

Contributor

MichaelAz commented Jul 16, 2014

It's been a crazy week, I'll go over your updates more thoroughgly tomorow evening/friday morning.

Contributor

MichaelAz commented Jul 18, 2014

This is where it gets interesting. On my machine, I consistently fail at iteration 997. However, if instead of:

def sender():
    while True:
        c.send(0)
        c.recv()
I have (you may need to import backends first):

def sender():
    while True:
        c.send(0)
        backends.current.yield_()
        c.recv()

I'm getting the same behavior.

If this is really a bug in gevent on Windows we ought to open an issue with them. But, since this is (probably) related to the pypystackless bug - perhaps we're at fault. I really don't know.
Could you link to the docs you mentioned about gevent having problems on windows?

Owner

rgalanakis commented Jul 18, 2014

When I dug into it, I don't think the pypystackless and gevent-windows problems are related. I think this is genuinely a bug in gevent/libev on Windows, as I was able to repro it in a purely gevent environment (see my previous comment).

Regarding the links, I wish I had taken better notes. I can only find a few pages, mostly concerning gevent's switch from libevent to libev, and libev's inferior Windows support:

Specifically there was a page I cannot now find that said something like "There should be fewer unknown errors on Windows"- I think it was for gevent (a changelog?) but could have been for libev as well.
I will open a ticket with the gevent repo.

@MichaelAz MichaelAz added a commit to MichaelAz/goless that referenced this issue Jul 19, 2014

@MichaelAz MichaelAz Fix issue #28 642bdcd
Contributor

MichaelAz commented Jul 19, 2014

As discussed in surfly/gevent#459, adding a call to import socket seems to solve the issue.

I'm creating a PR for this, even though the solution is extremely hacky.
I'd say calling WSAStartup with ctypes is less hacky but it'll just require us to re-implement the relevant part of socket and that's not DRY.

@rgalanakis rgalanakis added a commit that referenced this issue Jul 19, 2014

@rgalanakis rgalanakis Merge pull request #30 from MichaelAz/system_error
Fix issue #28
a787883
Owner

rgalanakis commented Jul 19, 2014

Confirmed this fixed the issue on my Windows virtualbox.
I am flabberghasted by this issue. Hopefully gevent fixes the actual problem. Any solution on our side is 'hacky' so don't worry about importing socket not being optimal.

@rgalanakis rgalanakis closed this Jul 19, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment