-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_asyncio: test_start_tls_server_1() fails on Python on x86 Windows7 3.7 and 3.x #77875
Comments
test_asyncio.test_start_tls_server_1() got many fixes recently: see bpo-32458 and bpo-33674... but it still fails on Python on x86 Windows7 3.x at revision bb9474f which contains all these fixes. The test fails even when test_asyncio is re-run alone (not when other tests run in parallel). http://buildbot.python.org/all/#/builders/58/builds/930 ====================================================================== Traceback (most recent call last):
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_asyncio\test_sslproto.py", line 467, in test_start_tls_server_1
self.loop.run_until_complete(run_main())
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\asyncio\base_events.py", line 566, in run_until_complete
raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed. ====================================================================== Traceback (most recent call last):
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_asyncio\functional.py", line 42, in tearDown
self.fail('unexpected calls to loop.call_exception_handler()')
AssertionError: unexpected calls to loop.call_exception_handler() The test fails also on x86 Windows7 3.7: ====================================================================== Traceback (most recent call last):
File "D:\cygwin\home\db3l\buildarea\3.7.bolen-windows7\build\lib\test\test_asyncio\test_sslproto.py", line 467, in test_start_tls_server_1
self.loop.run_until_complete(run_main())
File "D:\cygwin\home\db3l\buildarea\3.7.bolen-windows7\build\lib\asyncio\base_events.py", line 566, in run_until_complete
raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed. ====================================================================== Traceback (most recent call last):
File "D:\cygwin\home\db3l\buildarea\3.7.bolen-windows7\build\lib\test\test_asyncio\functional.py", line 42, in tearDown
self.fail('unexpected calls to loop.call_exception_handler()')
AssertionError: unexpected calls to loop.call_exception_handler() Moreover, 3.7 got an additional failure: ====================================================================== Traceback (most recent call last):
File "D:\cygwin\home\db3l\buildarea\3.7.bolen-windows7\build\lib\test\test_asyncio\test_windows_utils.py", line 73, in test_pipe_handle
raise RuntimeError('expected ERROR_INVALID_HANDLE')
RuntimeError: expected ERROR_INVALID_HANDLE The Windows7 buildbot is known to be slow. For example, test_value() of test_multiprocessing_spawn failed with a timeout (15 min) on x86 Windows7 3.x: 0:48:12 [276/416/1] test_multiprocessing_spawn crashed (Exit code 1) Maybe something is wrong with my karma, I don't know. Or test_asyncio just hate me. |
The test fails also on x86 Windows7 3.7: |
I can't reproduce test_start_tls_server_1 fails when I do (screenshot attached)
I run them in Windows 7 VM on Mac OS. All other tests pass fine for me except test_sendfile_close_peer_in_the_middle_of_receiving. Andrew, please look at this, this is your test, and I'm not sure I follow why we use SNDBUF at all. IMO it should be possible to write a test without using that (or if not, maybe time to use mocks). With regards to start_tls tests -- I give up, I don't have time to look into this before 3.7.0. I suggest to mask the test on the specific win buildbot it crashes on and fix it before 3.7.1 (or maybe when Windows users come up with a reliable way to reproduce the bug). |
I used SNDBUF to enforce send buffer overloading. |
When I tried to do that I think I was having more failures with that test. But really up to you. |
What is that? Would you mind to elaborate? I'm curious :-) |
The test also failed on AMD64 Windows8.1 Refleaks 3.7. This buildbot is also very slow because it runs reference leak hunting. It confirms that the remaining issue is a race condition which is only seen when the system becomes slow. http://buildbot.python.org/all/#/builders/132/builds/154 ====================================================================== Traceback (most recent call last):
File "D:\buildarea\3.7.ware-win81-release.refleak\build\lib\test\test_asyncio\test_sslproto.py", line 467, in test_start_tls_server_1
self.loop.run_until_complete(run_main())
File "D:\buildarea\3.7.ware-win81-release.refleak\build\lib\asyncio\base_events.py", line 566, in run_until_complete
raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed. ====================================================================== Traceback (most recent call last):
File "D:\buildarea\3.7.ware-win81-release.refleak\build\lib\test\test_asyncio\functional.py", line 42, in tearDown
self.fail('unexpected calls to loop.call_exception_handler()')
AssertionError: unexpected calls to loop.call_exception_handler() |
The test fails also on x86 Windows7 3.x: |
A variant of the bug? AMD64 Windows8.1 Non-Debug 3.x: ====================================================================== Traceback (most recent call last):
File "D:\buildarea\3.x.ware-win81-release\build\lib\test\test_asyncio\test_sslproto.py", line 623, in test_create_connection_ssl_failed_certificate
self.loop.run_until_complete(client(srv.addr))
File "D:\buildarea\3.x.ware-win81-release\build\lib\asyncio\base_events.py", line 566, in run_until_complete
raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed. |
Not surprising: same error on AMD64 Windows8.1 Non-Debug 3.7: |
Another variant seen on AppVeyor (PR 7439, master branch): https://ci.appveyor.com/project/python/cpython/build/3.8build16980 ====================================================================== Traceback (most recent call last):
File "C:\projects\cpython\lib\test\test_asyncio\test_sslproto.py", line 623, in test_create_connection_ssl_failed_certificate
self.loop.run_until_complete(client(srv.addr))
File "C:\projects\cpython\lib\asyncio\base_events.py", line 566, in run_until_complete
raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed. |
The bug can be reproduced on Linux using this patch: diff --git a/Lib/test/test_asyncio/test_sslproto.py b/Lib/test/test_asyncio/test_sslproto.py
index 78ab1eb822..735313152c 100644
--- a/Lib/test/test_asyncio/test_sslproto.py
+++ b/Lib/test/test_asyncio/test_sslproto.py
@@ -471,6 +471,8 @@ class BaseStartTLS(func_tests.FunctionalTestCaseMixin):
self.assertEqual(proto.data, b'')
+ await asyncio.sleep(1)
+
new_tr = await self.loop.start_tls(
tr, proto, server_context,
server_side=True, |
Call stack when the asyncio client connects to the server when the bug occurs: SSLProtocol.connection_made()
-> SSLProtocol._start_handshake()
-> SSLProtocol._process_write_backlog()
-> SSLPipe.do_handshake()
-> SSLPipe.feed_ssldata(b'', only_handshake=True)
-> SSLObject.do_handshake()
-> C SSL_do_handshake() returns immediately because the socket is non-blocking It seems like SSLObject.do_handshake() is only attempted once and... then nothing. The client is supposed to send data, retry the handshake, or something, but it does *nothing*. So the handshake never completes and the test hangs to later fail with a timeout. |
The issue occurs when the server calls start_tls() because the client calls start_tls(). In that case, ServerProto.data_received() is called with the server TLS handshake and so the client SSLProtocol never this this handshake from server... |
Sorry, the patch to reproduce the issue on Linux is wrong: it introduces a bug in the test. Ignore this comment. -- It seems like I found the root cause: pause_reading() / resume_reading() on the transport is not safe. Sometimes, we loose data. See the "TODO" below... class _ProactorReadPipeTransport(_ProactorBasePipeTransport,
transports.ReadTransport):
"""Transport for read pipes."""
(...)
def pause_reading(self):
if self._closing or self._paused:
return
self._paused = True
if self._read_fut is not None and not self._read_fut.done():
# TODO: This is an ugly hack to cancel the current read future
# *and* avoid potential race conditions, as read cancellation
# goes through `future.cancel()` and `loop.call_soon()`.
# We then use this special attribute in the reader callback to
# exit *immediately* without doing any cleanup/rescheduling.
self._read_fut.__asyncio_cancelled_on_pause__ = True
self._read_fut.cancel()
self._read_fut = None
self._reschedule_on_resume = True
if self._loop.get_debug():
logger.debug("%r pauses reading", self) If you remove the "ugly hack", the test no longer hangs... |
_ProactorReadPipeTransport.set_transport(): if self.is_reading():
# reset reading callback / buffers / self._read_fut
self.pause_reading()
self.resume_reading() This method cancelled the pending overlapped WSARecv(), and then create a new overlapped WSARecv(). Even after CancelIoEx(old overlapped), the IOCP loop still gets an event for the completion of the recv. Problem: since the Python future is cancelled, the even is ignored and so 176 bytes are lost. I'm surprised that an overlapped WSARecv() cancelled by CancelIoEx() still returns data when IOCP polls for events. |
Something else. The bug occurs when CancelIoEx() (on the current overlapped WSARecv()) fails internally with ERROR_NOT_FOUND. According to overlapped.c, it means: /* CancelIoEx returns ERROR_NOT_FOUND if the I/O completed in-between */ HasOverlappedIoCompleted() returns 0 in that case. The problem is that currently, Overlapped.cancel() also returns None in that case, and later the asyncio IOCP loop ignores the completion event and so drops incoming received data. |
Yury, Andrew, Ned: I set the priority to release blocker because I'm scared by what I saw. The START TLS has a race condition in its ProactorEventLoop implementation. But the bug doesn't see to be specific to START TLS, but rather to transport.set_transport(), and even more generally to transport.pause_reading() / transport.resume_reading(). The bug is quite severe: we loose data and it's really hard to know why (I spent a few hours to add many many print and try to reproduce on a very tiny reliable unit test). As an asyncio user, I expect that transports are 100% reliable, and I would first look into my look (like looking into start_tls() implementation in my case). If the bug was very specific to start_tls(), I would suggest to "just" "disable" start_tls() on ProactorEventLoop (sorry, Windows!). But since the data loss seems to concern basically any application using ProactorEventLoop, I don't see any simple workaround. My hope is that a fix can be written shortly to not block the 3.7.0 final release for too long :-( Yury, Andrew: Can you please just confirm that it's a regression and that a release blocker is justified? |
race.py: simple echo client and server sending packets in both directions. Pause/resume reading the client transport every 100 ms to trigger the bug. Using ProactorEventLoop and 2000 packets of 16 KiB, I easily reproduce the bug. So again, it's nothing related to start_tls(), start_tls() was just one way to spot the bug. The bug is in Proactor transport: the cancellation of overlapped WSARecv() sometime drops packets. The bug occurs when CancelIoEx() fails with ERROR_NOT_FOUND which means that the I/O (WSARecv()) completed. One solution would be to not cancel WSARecv() on pause_reading(): wait until the current WSARecv() completes, store data something but don't pass it to protocol.data_received()!, and no schedule a new WSARecv(). Once reading is resumed: call protocol.data_received() and schedule a new WSARecv(). That would be a workaround. I don't know how to really fix WSARecv() cancellation without loosing data. A good start would be to modify Overlapped.cancel() to return a boolean to notice if the overlapped I/O completed even if we just cancelled it. Currently, the corner case (CancelIoEx() fails with ERROR_NOT_FOUND) is silently ignored, and then the IOCP loop silently ignores the event of completed I/O... |
I fixed the root issue, a race condition in ProactorEventLoop. But I prefer to keep the issue open a few days to see if the bug is really gone from all CIs. |
I ran attached race.py on the 3.6 branch:
It seems like Python 3.6 (at least the 3.6 development branch) doesn't have this bug. It didn't see the bug recently, so I close the issue. Python 3.6 has a different issue: pause_reading() cannot be called twice, same issue with resume_reading(). But I'm not sure that it can be called a bug. Python 3.7 allows to call these methods twice and to call these methods while closing: they just do nothing in that case. I close the issue. |
pause_reading and resume_reading became idempotent only in 3.7 |
Do you consider that it's a new feature or a bugfix? Should we backport this change to 3.6? Currently, race.py fails with an error because of the 3.6 behaviour. (I modified asyncio to be able to run race.py on 3.6.) |
Since it a minor change we can reconsider it as s bug fix. Feel free to make a pr. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: