test_asyncio: test_start_tls_server_1() fails on Python on x86 Windows7 3.7 and 3.x #77875

vstinner · 2018-05-30T11:19:38Z

BPO	33694
Nosy	@vstinner, @ned-deily, @miss-islington
PRs	WIP: bpo-33694, asyncio: fix proactor transport pause/resume #7479 bpo-33694: asyncio: Fix race in Proactor's Transport.set_protocol() #7486 bpo-33694: Fix race condition on proactor recv() #7498 [3.7] bpo-33694: Fix race condition in asyncio proactor (GH-7498) #7499 bpo-33694: Fix typo in asyncio helper function name #7522 [3.7] bpo-33694: Fix typo in helper function name (GH-7522) #7523 bpo-33694: Remove test_asyncio ProactorDatagramTransportTests #16288 [3.8] bpo-33694: Remove test_asyncio ProactorDatagramTransportTests (GH-16288) #16305
Files	Screen Shot 2018-05-30 at 11.38.38 AM.png race.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2018-06-09.16:44:39.217>
created_at = <Date 2018-05-30.11:19:38.205>
labels = ['3.8', '3.7', 'tests', 'expert-asyncio']
title = 'test_asyncio: test_start_tls_server_1() fails on Python on x86 Windows7 3.7 and 3.x'
updated_at = <Date 2019-09-20.21:27:31.745>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2019-09-20.21:27:31.745>
actor = 'miss-islington'
assignee = 'none'
closed = True
closed_date = <Date 2018-06-09.16:44:39.217>
closer = 'vstinner'
components = ['Tests', 'asyncio']
creation = <Date 2018-05-30.11:19:38.205>
creator = 'vstinner'
dependencies = []
files = ['47626', '47632']
hgrepos = []
issue_num = 33694
keywords = ['patch']
message_count = 32.0
messages = ['318164', '318165', '318193', '318194', '318195', '318209', '318424', '318569', '318705', '318707', '318809', '318869', '318872', '318873', '318874', '318875', '318876', '318915', '318921', '318977', '318979', '318993', '319034', '319035', '319155', '319175', '319209', '319220', '319299', '319450', '352891', '352894']
nosy_count = 3.0
nosy_names = ['vstinner', 'ned.deily', 'miss-islington']
pr_nums = ['7479', '7486', '7498', '7499', '7522', '7523', '16288', '16305']
priority = None
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue33694'
versions = ['Python 3.7', 'Python 3.8']

vstinner · 2018-05-30T11:19:38Z

test_asyncio.test_start_tls_server_1() got many fixes recently: see bpo-32458 and bpo-33674... but it still fails on Python on x86 Windows7 3.x at revision bb9474f which contains all these fixes.

The test fails even when test_asyncio is re-run alone (not when other tests run in parallel).

http://buildbot.python.org/all/#/builders/58/builds/930

======================================================================
ERROR: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_asyncio\test_sslproto.py", line 467, in test_start_tls_server_1
    self.loop.run_until_complete(run_main())
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\asyncio\base_events.py", line 566, in run_until_complete
    raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed.

======================================================================
FAIL: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_asyncio\functional.py", line 42, in tearDown
    self.fail('unexpected calls to loop.call_exception_handler()')
AssertionError: unexpected calls to loop.call_exception_handler()

The test fails also on x86 Windows7 3.7:

======================================================================
ERROR: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "D:\cygwin\home\db3l\buildarea\3.7.bolen-windows7\build\lib\test\test_asyncio\test_sslproto.py", line 467, in test_start_tls_server_1
    self.loop.run_until_complete(run_main())
  File "D:\cygwin\home\db3l\buildarea\3.7.bolen-windows7\build\lib\asyncio\base_events.py", line 566, in run_until_complete
    raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed.

======================================================================
FAIL: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "D:\cygwin\home\db3l\buildarea\3.7.bolen-windows7\build\lib\test\test_asyncio\functional.py", line 42, in tearDown
    self.fail('unexpected calls to loop.call_exception_handler()')
AssertionError: unexpected calls to loop.call_exception_handler()

Moreover, 3.7 got an additional failure:

======================================================================
ERROR: test_pipe_handle (test.test_asyncio.test_windows_utils.PipeTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "D:\cygwin\home\db3l\buildarea\3.7.bolen-windows7\build\lib\test\test_asyncio\test_windows_utils.py", line 73, in test_pipe_handle
    raise RuntimeError('expected ERROR_INVALID_HANDLE')
RuntimeError: expected ERROR_INVALID_HANDLE

The Windows7 buildbot is known to be slow. For example, test_value() of test_multiprocessing_spawn failed with a timeout (15 min) on x86 Windows7 3.x:

0:48:12 [276/416/1] test_multiprocessing_spawn crashed (Exit code 1)
Timeout (0:15:00)!
Thread 0x000008e4 (most recent call first):
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\multiprocessing\popen_spawn_win32.py", line 80 in wait
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\multiprocessing\process.py", line 140 in join
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\_test_multiprocessing.py", line 1925 in test_value

Maybe something is wrong with my karma, I don't know. Or test_asyncio just hate me.

vstinner · 2018-05-30T11:20:51Z

The test fails also on x86 Windows7 3.7:
http://buildbot.python.org/all/#/builders/111/builds/297

1st1 · 2018-05-30T15:43:27Z

I can't reproduce test_start_tls_server_1 fails when I do (screenshot attached)

run test_asyncio
run test_asyncio.test_sslproto
run test_asyncio.test_sslproto -m test_start_tls_server_1

I run them in Windows 7 VM on Mac OS.

All other tests pass fine for me except test_sendfile_close_peer_in_the_middle_of_receiving. Andrew, please look at this, this is your test, and I'm not sure I follow why we use SNDBUF at all. IMO it should be possible to write a test without using that (or if not, maybe time to use mocks).

With regards to start_tls tests -- I give up, I don't have time to look into this before 3.7.0. I suggest to mask the test on the specific win buildbot it crashes on and fix it before 3.7.1 (or maybe when Windows users come up with a reliable way to reproduce the bug).

asvetlov · 2018-05-30T15:49:36Z

I used SNDBUF to enforce send buffer overloading.
It is not required by sendfile tests but I thought that better to have non-mocked way to test such situations.
We can remove the socket buffers size manipulation at all without any problem.

1st1 · 2018-05-30T15:50:28Z

We can remove the socket buffers size manipulation at all without any problem.

When I tried to do that I think I was having more failures with that test. But really up to you.

vstinner · 2018-05-30T21:38:12Z

I used SNDBUF to enforce send buffer overloading.

What is that? Would you mind to elaborate? I'm curious :-)

vstinner · 2018-06-01T14:55:30Z

The test also failed on AMD64 Windows8.1 Refleaks 3.7. This buildbot is also very slow because it runs reference leak hunting. It confirms that the remaining issue is a race condition which is only seen when the system becomes slow.

http://buildbot.python.org/all/#/builders/132/builds/154

======================================================================
ERROR: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "D:\buildarea\3.7.ware-win81-release.refleak\build\lib\test\test_asyncio\test_sslproto.py", line 467, in test_start_tls_server_1
    self.loop.run_until_complete(run_main())
  File "D:\buildarea\3.7.ware-win81-release.refleak\build\lib\asyncio\base_events.py", line 566, in run_until_complete
    raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed.

======================================================================
FAIL: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "D:\buildarea\3.7.ware-win81-release.refleak\build\lib\test\test_asyncio\functional.py", line 42, in tearDown
    self.fail('unexpected calls to loop.call_exception_handler()')
AssertionError: unexpected calls to loop.call_exception_handler()

pablogsal · 2018-06-03T16:18:03Z

The test fails also on x86 Windows7 3.x:
http://buildbot.python.org/all/#/builders/58/builds/947

vstinner · 2018-06-04T21:07:32Z

A variant of the bug?

AMD64 Windows8.1 Non-Debug 3.x:
http://buildbot.python.org/all/#/builders/12/builds/948

======================================================================
ERROR: test_create_connection_ssl_failed_certificate (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "D:\buildarea\3.x.ware-win81-release\build\lib\test\test_asyncio\test_sslproto.py", line 623, in test_create_connection_ssl_failed_certificate
    self.loop.run_until_complete(client(srv.addr))
  File "D:\buildarea\3.x.ware-win81-release\build\lib\asyncio\base_events.py", line 566, in run_until_complete
    raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed.

vstinner · 2018-06-04T21:36:24Z

AMD64 Windows8.1 Non-Debug 3.x
ERROR: test_create_connection_ssl_failed_certificate (test.test_asyncio.test_sslproto.ProactorStartTLSTests)

Not surprising: same error on AMD64 Windows8.1 Non-Debug 3.7:
http://buildbot.python.org/all/#builders/133/builds/304

vstinner · 2018-06-06T10:30:09Z

Another variant seen on AppVeyor (PR 7439, master branch):

https://ci.appveyor.com/project/python/cpython/build/3.8build16980

======================================================================
ERROR: test_create_connection_ssl_failed_certificate (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "C:\projects\cpython\lib\test\test_asyncio\test_sslproto.py", line 623, in test_create_connection_ssl_failed_certificate
    self.loop.run_until_complete(client(srv.addr))
  File "C:\projects\cpython\lib\asyncio\base_events.py", line 566, in run_until_complete
    raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed.

vstinner · 2018-06-07T00:58:25Z

The bug can be reproduced on Linux using this patch:

diff --git a/Lib/test/test_asyncio/test_sslproto.py b/Lib/test/test_asyncio/test_sslproto.py
index 78ab1eb822..735313152c 100644
--- a/Lib/test/test_asyncio/test_sslproto.py
+++ b/Lib/test/test_asyncio/test_sslproto.py
@@ -471,6 +471,8 @@ class BaseStartTLS(func_tests.FunctionalTestCaseMixin):
 
             self.assertEqual(proto.data, b'')
 
+            await asyncio.sleep(1)
+
             new_tr = await self.loop.start_tls(
                 tr, proto, server_context,
                 server_side=True,

vstinner · 2018-06-07T01:08:32Z

Call stack when the asyncio client connects to the server when the bug occurs:

SSLProtocol.connection_made()
-> SSLProtocol._start_handshake()
-> SSLProtocol._process_write_backlog()
-> SSLPipe.do_handshake()
-> SSLPipe.feed_ssldata(b'', only_handshake=True)
-> SSLObject.do_handshake()
-> C SSL_do_handshake() returns immediately because the socket is non-blocking

It seems like SSLObject.do_handshake() is only attempted once and... then nothing. The client is supposed to send data, retry the handshake, or something, but it does *nothing*. So the handshake never completes and the test hangs to later fail with a timeout.

vstinner · 2018-06-07T01:26:08Z

The issue occurs when the server calls start_tls() because the client calls start_tls(). In that case, ServerProto.data_received() is called with the server TLS handshake and so the client SSLProtocol never this this handshake from server...

vstinner · 2018-06-07T01:44:49Z

Sorry, the patch to reproduce the issue on Linux is wrong: it introduces a bug in the test. Ignore this comment.

--

It seems like I found the root cause: pause_reading() / resume_reading() on the transport is not safe. Sometimes, we loose data. See the "TODO" below...

class _ProactorReadPipeTransport(_ProactorBasePipeTransport,
                                 transports.ReadTransport):
    """Transport for read pipes."""
    (...)
    def pause_reading(self):
        if self._closing or self._paused:
            return
        self._paused = True

        if self._read_fut is not None and not self._read_fut.done():
            # TODO: This is an ugly hack to cancel the current read future
            # *and* avoid potential race conditions, as read cancellation
            # goes through `future.cancel()` and `loop.call_soon()`.
            # We then use this special attribute in the reader callback to
            # exit *immediately* without doing any cleanup/rescheduling.
            self._read_fut.__asyncio_cancelled_on_pause__ = True

            self._read_fut.cancel()
            self._read_fut = None
            self._reschedule_on_resume = True

        if self._loop.get_debug():
            logger.debug("%r pauses reading", self)

If you remove the "ugly hack", the test no longer hangs...

vstinner · 2018-06-07T02:26:38Z

_ProactorReadPipeTransport.set_transport():

        if self.is_reading():
            # reset reading callback / buffers / self._read_fut
            self.pause_reading()
            self.resume_reading()

This method cancelled the pending overlapped WSARecv(), and then create a new overlapped WSARecv().

Even after CancelIoEx(old overlapped), the IOCP loop still gets an event for the completion of the recv. Problem: since the Python future is cancelled, the even is ignored and so 176 bytes are lost.

I'm surprised that an overlapped WSARecv() cancelled by CancelIoEx() still returns data when IOCP polls for events.

vstinner · 2018-06-07T02:52:02Z

Something else. The bug occurs when CancelIoEx() (on the current overlapped WSARecv()) fails internally with ERROR_NOT_FOUND. According to overlapped.c, it means:

/* CancelIoEx returns ERROR_NOT_FOUND if the I/O completed in-between */

HasOverlappedIoCompleted() returns 0 in that case.

The problem is that currently, Overlapped.cancel() also returns None in that case, and later the asyncio IOCP loop ignores the completion event and so drops incoming received data.

vstinner · 2018-06-07T08:37:16Z

Yury, Andrew, Ned: I set the priority to release blocker because I'm scared by what I saw. The START TLS has a race condition in its ProactorEventLoop implementation. But the bug doesn't see to be specific to START TLS, but rather to transport.set_transport(), and even more generally to transport.pause_reading() / transport.resume_reading(). The bug is quite severe: we loose data and it's really hard to know why (I spent a few hours to add many many print and try to reproduce on a very tiny reliable unit test). As an asyncio user, I expect that transports are 100% reliable, and I would first look into my look (like looking into start_tls() implementation in my case).

If the bug was very specific to start_tls(), I would suggest to "just" "disable" start_tls() on ProactorEventLoop (sorry, Windows!). But since the data loss seems to concern basically any application using ProactorEventLoop, I don't see any simple workaround.

My hope is that a fix can be written shortly to not block the 3.7.0 final release for too long :-(

Yury, Andrew: Can you please just confirm that it's a regression and that a release blocker is justified?

vstinner · 2018-06-07T10:21:06Z

race.py: simple echo client and server sending packets in both directions. Pause/resume reading the client transport every 100 ms to trigger the bug.

Using ProactorEventLoop and 2000 packets of 16 KiB, I easily reproduce the bug.

So again, it's nothing related to start_tls(), start_tls() was just one way to spot the bug.

The bug is in Proactor transport: the cancellation of overlapped WSARecv() sometime drops packets. The bug occurs when CancelIoEx() fails with ERROR_NOT_FOUND which means that the I/O (WSARecv()) completed.

One solution would be to not cancel WSARecv() on pause_reading(): wait until the current WSARecv() completes, store data something but don't pass it to protocol.data_received()!, and no schedule a new WSARecv(). Once reading is resumed: call protocol.data_received() and schedule a new WSARecv().

That would be a workaround. I don't know how to really fix WSARecv() cancellation without loosing data. A good start would be to modify Overlapped.cancel() to return a boolean to notice if the overlapped I/O completed even if we just cancelled it. Currently, the corner case (CancelIoEx() fails with ERROR_NOT_FOUND) is silently ignored, and then the IOCP loop silently ignores the event of completed I/O...

vstinner · 2018-06-07T22:25:55Z

New changeset 79790bc by Victor Stinner in branch 'master':
bpo-33694: Fix race condition in asyncio proactor (GH-7498)
79790bc

vstinner · 2018-06-07T22:49:37Z

New changeset 8fa398d by Victor Stinner (Miss Islington (bot)) in branch '3.7':
bpo-33694: Fix race condition in asyncio proactor (GH-7498) (GH-7499)
8fa398d

vstinner · 2018-06-08T00:13:26Z

I fixed the root issue, a race condition in ProactorEventLoop. But I prefer to keep the issue open a few days to see if the bug is really gone from all CIs.

vstinner · 2018-06-08T08:32:13Z

New changeset ff6c077 by Victor Stinner in branch 'master':
bpo-33694: Fix typo in helper function name (GH-7522)
ff6c077

miss-islington · 2018-06-08T08:48:55Z

New changeset 17beebc by Miss Islington (bot) in branch '3.7':
bpo-33694: Fix typo in helper function name (GH-7522)
17beebc

vstinner · 2018-06-09T16:44:39Z

I ran attached race.py on the 3.6 branch:

2000 packets of 16 KiB: ok
50 packets of 16 MiB: ok

It seems like Python 3.6 (at least the 3.6 development branch) doesn't have this bug.

It didn't see the bug recently, so I close the issue.

Python 3.6 has a different issue: pause_reading() cannot be called twice, same issue with resume_reading(). But I'm not sure that it can be called a bug. Python 3.7 allows to call these methods twice and to call these methods while closing: they just do nothing in that case.

I close the issue.

1st1 · 2018-06-09T19:51:36Z

pause_reading and resume_reading became idempotent only in 3.7

vstinner · 2018-06-10T11:51:01Z

pause_reading and resume_reading became idempotent only in 3.7

Do you consider that it's a new feature or a bugfix? Should we backport this change to 3.6? Currently, race.py fails with an error because of the 3.6 behaviour. (I modified asyncio to be able to run race.py on 3.6.)

1st1 · 2018-06-10T13:18:49Z

Since it a minor change we can reconsider it as s bug fix. Feel free to make a pr.

vstinner · 2018-06-11T11:55:56Z

"Since it a minor change we can reconsider it as s bug fix. Feel free
to make a pr."

Ok, I created #7629 for bpo-32356

vstinner · 2018-06-13T10:25:50Z

New changeset 142e3c0 by Victor Stinner in branch '3.6':
[3.6] bpo-32356: idempotent pause_/resume_reading (GH-4914) (GH-7629)
142e3c0

vstinner · 2019-09-20T21:08:02Z

New changeset bc2256e by Victor Stinner in branch 'master':
bpo-33694: Remove test_asyncio ProactorDatagramTransportTests (GH-16288)
bc2256e

miss-islington · 2019-09-20T21:27:32Z

New changeset b50edac by Miss Islington (bot) in branch '3.8':
bpo-33694: Remove test_asyncio ProactorDatagramTransportTests (GH-16288)
b50edac

vstinner added 3.7 (EOL) end of life 3.8 only security fixes tests Tests in the Lib/test dir topic-asyncio labels May 30, 2018

vstinner added the release-blocker label Jun 7, 2018

vstinner closed this as completed Jun 9, 2018

vstinner removed the release-blocker label Jun 9, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_asyncio: test_start_tls_server_1() fails on Python on x86 Windows7 3.7 and 3.x #77875

test_asyncio: test_start_tls_server_1() fails on Python on x86 Windows7 3.7 and 3.x #77875

vstinner commented May 30, 2018

vstinner commented May 30, 2018

vstinner commented May 30, 2018

1st1 commented May 30, 2018

asvetlov commented May 30, 2018

1st1 commented May 30, 2018

vstinner commented May 30, 2018

vstinner commented Jun 1, 2018

pablogsal commented Jun 3, 2018

vstinner commented Jun 4, 2018

vstinner commented Jun 4, 2018

vstinner commented Jun 6, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 8, 2018

vstinner commented Jun 8, 2018

miss-islington commented Jun 8, 2018

vstinner commented Jun 9, 2018

1st1 commented Jun 9, 2018

vstinner commented Jun 10, 2018

1st1 commented Jun 10, 2018

vstinner commented Jun 11, 2018

vstinner commented Jun 13, 2018

vstinner commented Sep 20, 2019

miss-islington commented Sep 20, 2019

test_asyncio: test_start_tls_server_1() fails on Python on x86 Windows7 3.7 and 3.x #77875

test_asyncio: test_start_tls_server_1() fails on Python on x86 Windows7 3.7 and 3.x #77875

Comments

vstinner commented May 30, 2018

vstinner commented May 30, 2018

vstinner commented May 30, 2018

1st1 commented May 30, 2018

asvetlov commented May 30, 2018

1st1 commented May 30, 2018

vstinner commented May 30, 2018

vstinner commented Jun 1, 2018

pablogsal commented Jun 3, 2018

vstinner commented Jun 4, 2018

vstinner commented Jun 4, 2018

vstinner commented Jun 6, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 7, 2018

vstinner commented Jun 8, 2018

vstinner commented Jun 8, 2018

miss-islington commented Jun 8, 2018

vstinner commented Jun 9, 2018

1st1 commented Jun 9, 2018

vstinner commented Jun 10, 2018

1st1 commented Jun 10, 2018

vstinner commented Jun 11, 2018

vstinner commented Jun 13, 2018

vstinner commented Sep 20, 2019

miss-islington commented Sep 20, 2019