test_ssl: test_wrong_cert_tls13() and test_pha_required_nocert() fail randomly on Windows #88087
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = 'https://github.com/tiran' closed_at = None created_at = <Date 2021-04-23.10:56:18.501> labels = ['expert-SSL', 'type-bug', 'tests', '3.10'] title = 'test_ssl: test_wrong_cert_tls13() and test_pha_required_nocert() fail randomly on Windows' updated_at = <Date 2021-06-04.12:30:13.828> user = 'https://github.com/pablogsal'
activity = <Date 2021-06-04.12:30:13.828> actor = 'vstinner' assignee = 'christian.heimes' closed = False closed_date = None closer = None components = ['Tests', 'SSL'] creation = <Date 2021-04-23.10:56:18.501> creator = 'pablogsal' dependencies =  files =  hgrepos =  issue_num = 43921 keywords = ['patch'] message_count = 28.0 messages = ['391677', '391717', '391725', '391775', '394317', '394318', '394320', '394340', '394343', '394346', '394397', '394398', '394399', '394400', '394407', '394408', '394916', '394924', '394944', '394955', '394967', '394968', '395011', '395043', '395044', '395045', '395072', '395078'] nosy_count = 4.0 nosy_names = ['christian.heimes', 'steve.dower', 'pablogsal', 'miss-islington'] pr_nums = ['25561', '25574', '26489', '26494', '26501', '26502', '26518', '26520'] priority = None resolution = None stage = 'needs patch' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue43921' versions = ['Python 3.10']
The text was updated successfully, but these errors were encountered:
Extract from the logs:
The Buildbot has detected a new failure on builder AMD64 Windows8.1 Non-Debug 3.x while building python/cpython.
Buildbot URL: https://buildbot.python.org/all/
Worker for this Build: ware-win81-release
Build Reason: <unknown>
BUILD FAILED: failed test (failure)
Summary of the results of the build (if available):
== Tests result: FAILURE then FAILURE ==
395 tests OK.
10 slowest tests:
1 test failed:
30 tests skipped:
1 re-run test:
Total duration: 9 min 15 sec
Traceback (most recent call last): File "D:\buildarea\3.x.ware-win81-release.nondebug\build\lib\test\test_ssl.py", line 2333, in wrap_conn self.sslconn = self.server.context.wrap_socket( File "D:\buildarea\3.x.ware-win81-release.nondebug\build\lib\ssl.py", line 518, in wrap_socket return self.sslsocket_class._create( File "D:\buildarea\3.x.ware-win81-release.nondebug\build\lib\ssl.py", line 1070, in _create self.do_handshake() File "D:\buildarea\3.x.ware-win81-release.nondebug\build\lib\ssl.py", line 1339, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:969) Traceback (most recent call last): File "D:\buildarea\3.x.ware-win81-release.nondebug\build\lib\test\test_ssl.py", line 255, in wrapper return func(*args, **kw) File "D:\buildarea\3.x.ware-win81-release.nondebug\build\lib\test\test_ssl.py", line 3171, in test_wrong_cert_tls13 self.fail("Use of invalid cert should have failed!") AssertionError: Use of invalid cert should have failed!
Another TLS 1.3 client cert auth test is flaky, too.
Traceback (most recent call last): File "D:\buildarea\pull_request.bolen-windows10\build\lib\test\test_ssl.py", line 4412, in test_pha_required_nocert s.recv(1024) AssertionError: SSLError not raised
Also, I think the solution in:
(skipping the test) is not acceptable for the next beta.
Christian, I know that this is complicated for you yo reproduce and that this is a hard issue to debug, but this seems linked to the changes in PEP-644 and the reality is that a release manager I don't know the cause of these new failures and the facts are that this is not happening in 3.9 or before.
If we don't fix this for the next beta, I will be forced to revert PEP-644 until we have a permanent fix.
I can't reproduce these on my own (Windows) machine either.
Looking at the output, I think the tests are just going to be inherently flakey. It's not testing the specific scenario directly enough, and relying heavily on implicit synchronization.
We probably just need a different approach to testing these. I think at the very least starting the server on the main test thread and using a helper to send the request is going to be easier to keep things in sync.
But that's not going to be done in time to unblock. Skipping the test seems like an okay compromise.
My notes to debug race conditions:
In general, you should run the same test in a loop in many processes in parallel *and* stress the machine with a random workload.
My favorite recipe:
Sadly, there is no silver bullet for -j20: sometimes, the machine must be "more idle" to trigger the bug (ex: -j5), sometimes the machine must almost die, be more stressed (-j100).
But 3.9 and 3.8 are not failing on test_ssl, so do something has clearly changed. Are the failing tests new?
They're flakey in the sense that they're not failing reliably, when they clearly should be :)
If something changed about how any part of the response is written into the buffer, it could change when the OS decides to dispatch it and unblock the receiving thread, which is going to affect the state observed by the test at the time it observes it (and as we already know, there are inherent race conditions that sometimes recognise the close of the connection before it's finished reading). These are difficult tests to write, because there are so many weird interactions like this.
In terms of this actual change, my best guess is that one of the removed checks actually mattered on Windows, probably because of a different compile flag. We just use the defaults as generated by their Perl scripts, and I'd hope that there'd be compile errors if anything too blatant was changed (I don't see anything obvious looking through 39258d3). But short of that, I don't see anything deterministic changed here.
Thanks for the explanation!
Yeah, I do agree with you in this, but I am still not happy that they *realiably* pass in 3.9 and 3.8 in two different OS where now they fail: Windows and MacOS
Well, but the test *deterministically* fail in 3.10 and main and don't fail in 3.9 or 3.8. And it also happens in MacOS.
I really don't want to be annoying here, I really don't. But please, understand my position as RM when I say that I feel uncomfortable with this situation as the test pass without any problem or complication in previous releases and now they don't and there has been a huge change in OpenSSL in the middle.
Also, the asyncio tests for SSL now hang a lot in the refleak buildbots. For example:
I feel uncomfortable with a solution that basically is "skip the tests", when the test before used to pass deterministically and without problems. I am not an expert in the SSL area by far, so I don't know exactly what is the risk here.
I really apologize if I am insisting on this, but I think is important.
On the main branch, I can reproduce test_pha_required_nocert() failure:
vstinner@DESKTOP-DK7VBIL C:\vstinner\python\main>python -m test test_ssl -u all -v -F -j5 -m test_pha_required_nocert
test_pha_required_nocert (test.test_ssl.TestPostHandshakeAuth) ... server: new connection from ('127.0.0.1', 57456) client cert is None client did not provide a cert server: connection cipher is now ('TLS_AES_256_GCM_SHA384', 'TLSv1.3', 256) TLS: (<ssl.SSLSocket fd=644, family=AF_INET, type=SOCK_STREAM, proto=0, laddr=('127.0.0.1', 57455), raddr=('127.0.0.1', 57456)>, 'write', TLSVersion.TLSv1_3, _TLSContentType.ALERT, _TLSAlertType.CERTIFICA TE_REQUIRED, b'\x02t') Test server failure: Traceback (most recent call last): File "C:\vstinner\python\main\lib\test\test_ssl.py", line 2444, in run msg = self.read() File "C:\vstinner\python\main\lib\test\test_ssl.py", line 2421, in read return self.sslconn.read() File "C:\vstinner\python\main\lib\ssl.py", line 1131, in read return self._sslobj.read(len) ssl.SSLError: [SSL: PEER_DID_NOT_RETURN_A_CERTIFICATE] peer did not return a certificate (_ssl.c:2522) FAIL
Traceback (most recent call last): File "C:\vstinner\python\main\lib\test\test_ssl.py", line 4458, in test_pha_required_nocert with self.assertRaisesRegex( AssertionError: SSLError not raised
See my comparison of read() and write() errors on Linux vs Windows:
I wrote PR 26502 to fix test_wrong_cert_tls13() on Windows (currently, the test is skipped).
On Linux, read() always raises an exception when the connection is reset.
On Windows, read() sometimes fails with SSL_ERROR_SYSCALL+WSAECONNRESET, and in this case the internal C function raises a SSLEOFError. But the outer Python wrapper method converts SSLEOFError to an empty string because the SSL socket is created with suppress_ragged_eofs=True by default.
I don't know why on Linux read() only fails with SSL_ERROR_SSL with ERR_peek_last_error()=0x14094418, whereas it's not the case on Windows. It may be an implementation detail, different between Windows socket and Linux socket.
In Python 3.9, test_pha_required_nocert() looks more strict, it requires read() to fail with the 'tlsv13 alert certificate required' error message:
In the main branch, it tolerates "EOF occurred" error:
# send empty Certificate + Finish s.write(b'HASCERT') # receive alert s.recv(1024)
Christian: I mostly care about buildbots. I cannot reproduce the issue in 3.9. If it's not broken, I don't want to touch the code :-)
For #70707, I explained that I don't want to backport it (it's only in main). Moreover, you asked for automated backports, but it failed. If you consider that it should be backported, please go ahead :-)
"GH-26502 is missing backport to 3.9. I also don't consider the changeset a proper fix. It's a patch that makes the test pass when something goes wrong. We have not yet figured out why something goes wrong on Windows sometimes."
I am not convinced that getting an SSLEOFError on Windows is a bug. I cannot explain it why 3.10 and main branches behave differently, but I'm not interested to investigate.