-
-
Notifications
You must be signed in to change notification settings - Fork 177
Fix callbacks race in SelectorLoop.sock_connect. #366
Conversation
Another change: I had to make |
The more I'm trying to fix this thing, the more tests break in interesting ways. Maybe we should just partially revert ed17848. |
Call in the author of that commit?
|
Yes, I put Jesse in cc. I was the one who reviewed it though, so the responsibility is on me. |
Don't beat yourself up. Bugs happen. Life goes on.
|
Alright, the tests are passing; please review. |
Thanks for continued guidance and effort on this, @1st1. I'll review it soon, if you like. I keep thinking, though, maybe we should revert this whole idea: the idea of skipping getaddrinfo if we can detect that the address is already resolved? It seems to be a bug factory. First because getaddrinfo's AI_NUMERICHOST is harder to simulate in Python, in all platforms, than we thought. Second, because my attempt to fallback to getaddrinfo in ed17848 introduced an additional yield, which caused the current race condition. Now that we've made getaddrinfo concurrent on Mac and BSD, getaddrinfo with AI_NUMERICHOST is no longer such a bottleneck. Would rolling back this whole line of changes be simpler and safer than continuing to whack bugs? |
Probably yes. On the other had, I like how |
I like Jesse's idea. It's good to be thinking about the maintainability of
this code.
…--Guido (mobile)
|
This LGTM, I think. It's getting pretty complicated and my confidence is shaken that either of us can detect bugs by inspection. I propose we merge this--it's valuable both to fix the bug for now, and to add tests to guard against regression. Then, I think (and Guido agrees) that we should revert this whole idea. So let's keep the new tests, as much as possible, but stop trying to simulate getaddrinfo's AI_NUMERICHOST. I regret this idea was released in Python 3.5.2. I'm glad it's still "provisional". =) |
I'm -1 on reverting anything TBH.
To conclude - I think we should just leave things as is (after this PR is merged). |
This fix will go in b2. |
Merged in d6dcf25. |
The test fails at least on one buildbot, "AMD64 FreeBSD CURRENT Non-Debug 3.x", please have a look: |
Note that we've removed |
The test has been removed from CPython and this repo. Closing this PR. |
While testing uvloop on recent CPython 3.5.2 I found a regression in
loop.sock_connect
, introduced in ed17848.The bug breaks
loop.sock_*
in a very serious way, making programs that use those methods prone to random hangs after socket is connected.How to trigger
Let's imagine we have a server, that sends some data (let's say
b'hello'
) to the client immediately after connect. And the client program is the following:If the
PAYLOAD
is big enough, the client program will hang forever.Explanation
The cause of the hang is a race between callbacks -- one related to
loop.sock_connect
and one tosock_sendall
.Here's the relevant piece of code from
selector_events.py
:Before ed17848,
sock_connect
called_sock_connect
directly:sock_connect
created afut
Future._sock_connect
, which attached a callback to thefut
--_sock_connect_done
.sock_connect
then returnedfut
to the caller.asyncio.Task
. Therefore,fut
now have two callbacks attached to it:[_sock_connect_done, Task._wakeup]
After that commit:
sock_connect
creates afut
Future._ensure_resolved
(linkedfut
to the result of that call's Future).sock_connect
returnsfut
to the caller.Task
will add a callback to thefut
, eventually resulting in this:[Task._wakeup, _sock_connect_done]
Therefore, after ed17848,
_sock_connect_done
can be called afterawait loop.sock_connect()
line. If the program callsloop.sock_sendall
aftersock_connect
,_sock_connect_done
will remove writer callback thatsock_sendall
set up./cc @gvanrossum @ajdavis @Haypo