Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try piggy-backing on tornado for proactor loop support #1524

Merged
merged 4 commits into from
May 13, 2021

Conversation

minrk
Copy link
Member

@minrk minrk commented May 9, 2021

Tornado 6.1 enables support for proactor by running a separate selector loop in a thread

try piggy-backing on that functionality by using tornado's AddThreadEventLoop when someone attempts to use zmq.asyncio with proactor

Went with vendoring SelectorThread from tornadoweb/tornado#3029 so no dependency is added.

closes #1521
closes #1423

minrk added 2 commits May 10, 2021 13:02
use vendored copy of tornado's AddThread as a separate SelectorThread object
try to avoid leaking loop closers
@minrk minrk merged commit 3faf9e4 into zeromq:main May 13, 2021
@minrk minrk deleted the tornado-asyncio branch May 13, 2021 12:45
@Jeducious
Copy link

@minrk Quick question, when using asyncio on windows, I get a warning when using the Proactor event loop, I have tornado 6.1 installed. Is this expected?

RuntimeWarning: Proactor event loop does not implement add_reader family of methods required for zmq. Registering an additional selector thread for add_reader support via tornado. Use 'asyncio.set_event_loop_policy(WindowsSelectorEventLoopPolicy())' to avoid this warning

I am having a persistent issue on windows where some tasks seem to suddenly stop receiving messages. I am implementing the MDP protocol in python, and I have automated tests that create multiple workers as asyncio tasks to simulate a busy server.

As the first few of the workers complete, the rest of the workers suddenly stop reporting heartbeats and the test hangs forever.

I imagine this is something I have done, painting myself into a corner somehow. But it would be great to know if any of this sounds suspect ;)

@minrk
Copy link
Member Author

minrk commented Aug 29, 2021

You can try calling asyncio.set_event_loop_policy(WindowsSelectorEventLoopPolicy()) before invoking any asyncio methods to see if that helps. That would indicate this change is relevant.

@Jeducious
Copy link

Thanks! I did try that, results were that the warning does go away at least. Still having issues where the test will progress for a while but then hangs eventually. I can't 100% say for sure the its due to this though, I am also having problems on linux too, so for the moment, I can't really prove that the hang is due to the event loop policy.

If that changes I'll report back.

@minrk
Copy link
Member Author

minrk commented Aug 30, 2021

If changing the policy still hangs, then I think it's probably not that, and something else, possibly related to edge-triggering issues. These things can be hard to track down!

@Jeducious
Copy link

Indeed! I am digging, the problem is that it's difficult to reproduce reliably. There are other things in here than pyzmq, for example the python logging module. I am currently removing all logging to check that this is not a factor.

So I'm proceeding to eliminate things by removing them where I can. Will let you know if anything points back at ZMQ.

@Jeducious
Copy link

@minrk

Ok, have a question I am seeing an error on linux now. This error suggests I am exhausting the file descriptor quota. I had a look at the offending process, looks like it is accumulating fd's alright, but I wondered if you'd be able to tell me if this looks like something the asyncio pyzmq sockets might use? The majority of the fd's in use are of the type eventfd.

Man page on that is here

It basically says that these are used an event wait/notify mechanism by user-space applications, so I am guessing that this either

  1. Asyncio tasks doing this
  2. Pyzmq sockets... maybe?
  3. Something else entirely that I am missing (catchall, had to throw that in there to cover my ignorance).

It seems like they are not being released, but can't confirm. The process hung, so they might have been if it had closed :)

python3 33202 ubuntu   60u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   61u  a_inode               0,14        0  10299 [eventpoll]
python3 33202 ubuntu   62u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   63u  a_inode               0,14        0  10299 [eventpoll]
python3 33202 ubuntu   64u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   65u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   66u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   67u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   68u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   69u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   70u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   71u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   72u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   73u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   74u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   75u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   76u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   77u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   78u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   79u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   80u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   81u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   82u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   83u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   84u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   85u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   86u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   87u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   88u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   89u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   90u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   91u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   92u  a_inode               0,14        0  10299 [eventfd]
python3 33202 ubuntu   93u  a_inode               0,14        0  10299 [eventfd]

@minrk
Copy link
Member Author

minrk commented Aug 31, 2021

Certainly possible, but I can't be sure. I don't know exactly what operations create these.

You might check asyncio.all_tasks() to see all the asyncio tasks you have running.

It's conceivable you have launched some task/future that you lose track of without awaiting or cancelling. This could be due to your code, or even a pyzmq bug.

@Jeducious
Copy link

Ok, I think I am getting close (though honestly concurrent programming can certainly prove me wrong it seems)

I have several workers which each run as an asyncio task. Each worker has a zmq.DEALER socket, plus, I create a monitor socket for each of the dealers using get_monitor_socket.

During a shutdown I call cancel on each worker task, this triggers a shutdown handler which calls the disable_monitor() method for the dealer socket. This is where the loop hangs.

It seems a little bit random as sometimes a few workers will all be able to cleanly shutdown, but then one will hang the loop on the call to disable_monitor.

I gets the feels that I may have abused disable_monitor, or sockets, or both here.

Is there a right way to cleanup a socket and its monitor socket? I am willing to bet, when multiple sockets with monitors attached are concerned, I am probably not doing it right.

@Jeducious
Copy link

Jeducious commented Aug 31, 2021

OK, so, no need to wait, I decided to simply comment out the line that called disable_monitor and 'give it a ripper of a go' so to speak.

Now, the loop no longer hangs, in fact the entire test suite seems to be passing consistently now. So it seems calling disable_monitor was the wrong thing to do? Just don't know why.

Should I:-

  1. Just leave the disable_monitor commented out and live on in blissful ignorance now it apparently works
  2. call close on the monitor socket instead of disable?
  3. something else?

@minrk
Copy link
Member Author

minrk commented Aug 31, 2021

If disable_monitor causes a hang, this suggests to me that there is a LINGER or ordering issue - that perhaps there are some messages not yet consumed by the monitor socket receiver, and the sender is blocking waiting for messages to be delivered.

That's a bit of a guess, though.

From this discussion you need to call disable before close on the monitor socket (disable closes the socket that bound, which is handled internally by libzmq, while you need ot manage closing the socket that connects to listen for monitor messages)

@Jeducious
Copy link

Thanks :)

I eventually got the test suite to pass on macOS and windows using the following

        self.zmq_socket.disable_monitor()
        self.mon_sock.close(linger=0)
        self.zmq_socket.close(linger=0)

Which matches the discussion you just referred to, which I noticed I've actually been a part of. Seems that came back to bite me by not paying attention to it!

Confirming this now works fine on windows and macOS, and also linux now.

I still have a run away condition of too many open fd's happening on linux. But that's a story for another day I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants