Skip to content
This repository has been archived by the owner on Nov 23, 2017. It is now read-only.

Fix callbacks race in SelectorLoop.sock_connect. #366

Closed
wants to merge 3 commits into from

Conversation

1st1
Copy link
Member

@1st1 1st1 commented Jun 28, 2016

While testing uvloop on recent CPython 3.5.2 I found a regression in loop.sock_connect, introduced in ed17848.

The bug breaks loop.sock_* in a very serious way, making programs that use those methods prone to random hangs after socket is connected.

How to trigger

Let's imagine we have a server, that sends some data (let's say b'hello') to the client immediately after connect. And the client program is the following:

data = await self.recv_all(sock, 5)
assert data == b'hello'
await self.loop.sock_sendall(sock, PAYLOAD)

If the PAYLOAD is big enough, the client program will hang forever.

Explanation

The cause of the hang is a race between callbacks -- one related to loop.sock_connect and one to sock_sendall.

Here's the relevant piece of code from selector_events.py:

    def sock_connect(self, sock, address):
        """Connect to a remote socket at address.

        This method is a coroutine.
        """
        if self._debug and sock.gettimeout() != 0:
            raise ValueError("the socket must be non-blocking")

        fut = self.create_future()
        if hasattr(socket, 'AF_UNIX') and sock.family == socket.AF_UNIX:
            self._sock_connect(fut, sock, address)
        else:
            resolved = base_events._ensure_resolved(
                address, family=sock.family, proto=sock.proto, loop=self)
            resolved.add_done_callback(
                lambda resolved: self._on_resolved(fut, sock, resolved))

        return fut

    def _on_resolved(self, fut, sock, resolved):
        try:
            _, _, _, _, address = resolved.result()[0]
        except Exception as exc:
            fut.set_exception(exc)
        else:
            self._sock_connect(fut, sock, address)

    def _sock_connect(self, fut, sock, address):
        fd = sock.fileno()
        try:
            sock.connect(address)
        except (BlockingIOError, InterruptedError):
            # Issue #23618: When the C function connect() fails with EINTR, the
            # connection runs in background. We have to wait until the socket
            # becomes writable to be notified when the connection succeed or
            # fails.
            fut.add_done_callback(functools.partial(self._sock_connect_done,
                                                    fd))
            self.add_writer(fd, self._sock_connect_cb, fut, sock, address)
        except Exception as exc:
            fut.set_exception(exc)
        else:
            fut.set_result(None)

    def _sock_connect_done(self, fd, fut):
        self.remove_writer(fd)

    def _sock_connect_cb(self, fut, sock, address):
        if fut.cancelled():
            return

        try:
            err = sock.getsockopt(socket.SOL_SOCKET, socket.SO_ERROR)
            if err != 0:
                # Jump to any except clause below.
                raise OSError(err, 'Connect call failed %s' % (address,))
        except (BlockingIOError, InterruptedError):
            # socket is still registered, the callback will be retried later
            pass
        except Exception as exc:
            fut.set_exception(exc)
        else:
            fut.set_result(None)

Before ed17848, sock_connect called _sock_connect directly:

  1. sock_connect created a fut Future.
  2. If the address wasn't already resolved it raised an error.
  3. If the address was resolved, it called _sock_connect, which attached a callback to the fut -- _sock_connect_done.
  4. sock_connect then returned fut to the caller.
  5. If the caller is a coroutine, it's wrapped in asyncio.Task. Therefore, fut now have two callbacks attached to it: [_sock_connect_done, Task._wakeup]

After that commit:

  1. sock_connect creates a fut Future.
  2. Then calls _ensure_resolved (linked fut to the result of that call's Future).
  3. sock_connect returns fut to the caller.
  4. If the caller is a coroutine, its Task will add a callback to the fut, eventually resulting in this: [Task._wakeup, _sock_connect_done]

Therefore, after ed17848, _sock_connect_done can be called after await loop.sock_connect() line. If the program calls loop.sock_sendall after sock_connect, _sock_connect_done will remove writer callback that sock_sendall set up.

/cc @gvanrossum @ajdavis @Haypo

@1st1
Copy link
Member Author

1st1 commented Jun 28, 2016

Another change: I had to make sock_connect and actual coroutine. Before this PR it returned a Future (although it was documented that the result of the call is coroutine).

@1st1
Copy link
Member Author

1st1 commented Jun 28, 2016

The more I'm trying to fix this thing, the more tests break in interesting ways. Maybe we should just partially revert ed17848.

@gvanrossum
Copy link
Member

gvanrossum commented Jun 28, 2016 via email

@1st1
Copy link
Member Author

1st1 commented Jun 28, 2016

Yes, I put Jesse in cc. I was the one who reviewed it though, so the responsibility is on me.

@gvanrossum
Copy link
Member

gvanrossum commented Jun 28, 2016 via email

@1st1
Copy link
Member Author

1st1 commented Jun 28, 2016

Alright, the tests are passing; please review.

@ajdavis
Copy link

ajdavis commented Jun 28, 2016

Thanks for continued guidance and effort on this, @1st1. I'll review it soon, if you like.

I keep thinking, though, maybe we should revert this whole idea: the idea of skipping getaddrinfo if we can detect that the address is already resolved? It seems to be a bug factory. First because getaddrinfo's AI_NUMERICHOST is harder to simulate in Python, in all platforms, than we thought. Second, because my attempt to fallback to getaddrinfo in ed17848 introduced an additional yield, which caused the current race condition.

Now that we've made getaddrinfo concurrent on Mac and BSD, getaddrinfo with AI_NUMERICHOST is no longer such a bottleneck. Would rolling back this whole line of changes be simpler and safer than continuing to whack bugs?

@1st1
Copy link
Member Author

1st1 commented Jun 28, 2016

Would rolling back this whole line of changes be simpler and safer than continuing to whack bugs?

Probably yes. On the other had, I like how sock_connect works now, accepting any kind of address. Please take a look at my patch.

@gvanrossum
Copy link
Member

gvanrossum commented Jun 28, 2016 via email

@ajdavis
Copy link

ajdavis commented Jul 3, 2016

This LGTM, I think. It's getting pretty complicated and my confidence is shaken that either of us can detect bugs by inspection. I propose we merge this--it's valuable both to fix the bug for now, and to add tests to guard against regression.

Then, I think (and Guido agrees) that we should revert this whole idea. So let's keep the new tests, as much as possible, but stop trying to simulate getaddrinfo's AI_NUMERICHOST.

I regret this idea was released in Python 3.5.2. I'm glad it's still "provisional". =)

@1st1
Copy link
Member Author

1st1 commented Jul 4, 2016

Then, I think (and Guido agrees) that we should revert this whole idea. So let's keep the new tests, as much as possible, but stop trying to simulate getaddrinfo's AI_NUMERICHOST.

I'm -1 on reverting anything TBH.

  1. Trying to parsing with pton in create_connection and create_server is harmless, at least with the current implementation (that ships with 3.5.2). That's considered to be a good practice actually.
  2. For sock_connect - parsing vs not parsing is not an issue, now we just call loop.getaddrinfo. It's actually a different question: do we want to sock_connect to require resolved addresses or we can make it to resolve them. I think that the latter is much more preferable.

To conclude - I think we should just leave things as is (after this PR is merged).

@1st1
Copy link
Member Author

1st1 commented Sep 12, 2016

This fix will go in b2.

@1st1
Copy link
Member Author

1st1 commented Sep 15, 2016

Merged in d6dcf25.

@1st1 1st1 closed this Sep 15, 2016
@vstinner
Copy link
Member

The test fails at least on one buildbot, "AMD64 FreeBSD CURRENT Non-Debug 3.x", please have a look:
http://bugs.python.org/issue28176#msg276736

@vstinner vstinner reopened this Sep 22, 2016
@berkerpeksag
Copy link
Member

Note that we've removed test_sock_connect_sock_write_race in http://bugs.python.org/issue28283, but it was added back in the last sync. Perhaps we should add a comment?

@1st1
Copy link
Member Author

1st1 commented Oct 5, 2016

The test has been removed from CPython and this repo. Closing this PR.

@1st1 1st1 closed this Oct 5, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants