Skip to content

Bugfix: Send the TLS alert to the peer upon a ssl.CertificateError.#3413

Open
sasozivanovic wants to merge 8 commits intopython-trio:mainfrom
sasozivanovic:ssl_errors
Open

Bugfix: Send the TLS alert to the peer upon a ssl.CertificateError.#3413
sasozivanovic wants to merge 8 commits intopython-trio:mainfrom
sasozivanovic:ssl_errors

Conversation

@sasozivanovic
Copy link
Copy Markdown

Disclaimer: I'm new to networking, async stuff, and especially SSL (just a regular programmer who wants to inform the client that their certificate expired), while SSLStream._retry is claimed to be "probably the single trickiest function in Trio" in the code comments. I tried my best, and learned a lot on the way, but this pull request really has to be reviewed, and thoroughly, by someone who knows very well what they are doing. 😁

This commit (partially) addresses python-trio/trio-websocket#199, which I traced to a bug in Trio itself. A corresponding bugfix relying on the present bugfix will be submitted to the Trio Websocket repo.

From the user's perspective, the issue was that a TLS client hanged after submitting an invalid (e.g. expired) client certificate.

Before this fix, when SSLStream._retry caught a ssl.CertificateError, the error was immediately re-raised (wrapped in a trio.BrokenResourceError). The TLS alert prepared by the ssl library and waiting in MemoryBIO was therefore never sent to the peer.

Now, upon catching a ssl.CertificateError, we first check whether we have any pending outgoing data (self._outgoing.pending). If so, the exception is stashed away and only raised (again, wrapped in a trio.BrokenResourceError) after sending out the pending data (which should contain the alert).

(I had first tried an alternative implementation, where the CertificateError was not stashed away. Rather, the error was raised immediately, but only if there was no pending outgoing data. The idea was to rely on the loop in SSLStream._retry. Upon seeing CertificateError for the first time, there would be pending data, so the exception would not be reraised and the pending data would be sent out, while upon seeing the CertificateError for the second time, there would be no pending data, so the exception would be re-raised. However, it turned out that the loop did not always continue (I'm not sure why), so there was no second time in some situations.)

Tests in test_ssl.py

test_ssl_client_basics (modified)

Here we test whether the TLS alert sent out by the client reaches the (blocking) server.

The second (no CA file) and the third (wrong host name) subtest of this test were modified to check that the server encounters the correct SSL error. In the old code, the server encountered UNEXPECTED_EOF_WHILE_READING (protocol error) in both subtests. After the fix, it correctly receives TLSV1_ALERT_UNKNOWN_CA and SSLV3_ALERT_BAD_CERTIFICATE, respectively.

To facilitate the modified test, function ssl_echo_serve_sync (called by ssl_echo_server_raw called by this test) now allows for a special value of keyword argument expect_fail: "raise". When given this value, the error is expected but raised nevertheless, the idea being that it should be caught and inspected in the test code.

test_client_certificate (new)

Here we test whether the TLS alert sent out by the server reaches the (blocking) client. The test is modeled on test_ssl_server_basics.

The server is configured to require client authentication (SERVER_CTX_REQ, defined at the top of the file). In the first subtest, the client submits a valid certificate; in the second subtest, an expired one.

There is a complication with the second subtest. If the client does not send out the TLS alert (like before the bugfix), the server hangs, but we don't want the test to hang. I could think of no other way to test whether the server hangs than imposing an arbitrary (small) timeout, and there is a (very) small chance that even the correct code will not finish within the allotted time, resulting in a false negative.

This commit (partially) addresses python-trio/trio-websocket#199, which I
traced to a bug in Trio itself.  A corresponding bugfix relying on the present
bugfix will be submitted to the Trio Websocket repo.

From the user's perspective, the issue was that a TLS client hanged after
submitting an invalid (e.g. expired) client certificate.

Before this fix, when `SSLStream._retry` caught a `ssl.CertificateError`, the
error was immediately re-raised (wrapped in a `trio.BrokenResourceError`).  The
TLS alert prepared by the `ssl` library and waiting in MemoryBIO was therefore
never sent to the peer.

Now, upon catching a `ssl.CertificateError`, we first check whether we have any
pending outgoing data (`self._outgoing.pending`).  If so, the exception is
stashed away and only raised (again, wrapped in a `trio.BrokenResourceError`)
after sending out the pending data (which should contain the alert).

I had first tried an alternative implementation, where the `CertificateError`
was not stashed away.  Rather, the error was raised immediately, but only if
there was no pending outgoing data.  The idea was to rely on the loop in
`SSLStream._retry`.  Upon seeing `CertificateError` for the first time, there
would be pending data, so the exception would not be reraised and the pending
data would be sent out, while upon seeing the `CertificateError` for the second
time, there would be no pending data, so the exception would be re-raised.
However, it turned out that the loop did not always continue (I'm not sure
why), so there was no second time in some situations.

TESTS in `test_ssl.py`

`test_ssl_client_basics` (modified)

Here we test whether the TLS alert sent out by the client reaches
the (blocking) server.

The second (no CA file) and the third (wrong host name) subtest of this test
were modified to check that the server encounters the correct SSL error.  In
the old code, the server encountered `UNEXPECTED_EOF_WHILE_READING` (protocol
error) in both subtests.  After the fix, it correctly receives
`TLSV1_ALERT_UNKNOWN_CA` and `SSLV3_ALERT_BAD_CERTIFICATE`, respectively.

To facilitate the modified test, function `ssl_echo_serve_sync` (called by
`ssl_echo_server_raw` called by this test) now allows for a special value of
keyword argument `expect_fail`: `"raise"`.  When given this value, the error is
expected but raised nevertheless, the idea being that it should be caught and
inspected in the test code.

`test_client_certificate` (new)

Here we test whether the TLS alert sent out by the server reaches
the (blocking) client.  The test is modeled on `test_ssl_server_basics`.

The server is configured to require client authentication (`SERVER_CTX_REQ`,
defined at the top of the file).  In the first subtest, the client submits a
valid certificate; in the second subtest, an expired one.

There is a complication with the second subtest.  If the client does not send
out the TLS alert (like before the bugfix), the server hangs, but we don't want
the test to hang.  I could think of no other way to test whether the server
hangs than imposing an arbitrary (small) timeout, and there is a (very) small
chance that even the correct code will not finish within the allotted time,
resulting in a false negative.
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00000%. Comparing base (82af3ab) to head (71fe831).

Additional details and impacted files
@@               Coverage Diff               @@
##                 main        #3413   +/-   ##
===============================================
  Coverage   100.00000%   100.00000%           
===============================================
  Files             128          128           
  Lines           19417        19482   +65     
  Branches         1318         1321    +3     
===============================================
+ Hits            19417        19482   +65     
Files with missing lines Coverage Δ
src/trio/_ssl.py 100.00000% <100.00000%> (ø)
src/trio/_tests/test_ssl.py 100.00000% <100.00000%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sasozivanovic
Copy link
Copy Markdown
Author

On the coverage report. The missing code will never be reached assuming that every ssl.CertificateError produces pending outgoing data (a TLS alert). I have no idea whether this is the case.

sasozivanovic added a commit to sasozivanovic/trio-websocket that referenced this pull request Mar 28, 2026
This commit relies on the bugfix proposed in PR python-trio/trio#3413 to
address issue python-trio#199.

From the user's perspective, the issue was that a TLS client hanged after
submitting an invalid (e.g. expired) client certificate.

The trio bugfix makes sure that the server (in general, peer) sends out the TLS
alert after catching a `ssl.CertificateError`.  In this commit, we do two
things.

First, we intercept `ssl.SSLError`s created by the alerts when receiving data
from the stream (`WebSocketConnection._reader_task`) and sending data to the
stream (`WebSocketConnection._send`), and wrap them into a `HandshakeError`.

Second, we catch and ignore these `HandshakeError`s in
`WebSocketServer._handle_connection`, so that a certificate error does not
crash the server.  Here, I'm unsure whether we should ignore *all* SSL
errors (currently, this is the case), or maybe only those generated by alerts,
or only those having to do with certificates, and neither do I know how to
compile a comprehensive list of the latter, if that is what should be done.
@A5rocks
Copy link
Copy Markdown
Contributor

A5rocks commented Mar 28, 2026

(similar to you, I don't especially know much about networking....)

I could think of no other way to test whether the server hangs than imposing an arbitrary (small) timeout, and there is a (very) small chance that even the correct code will not finish within the allotted time, resulting in a false negative.

You can drop this -- we already have general timeouts for tests. We have other tests which do similar evil things (specifically the REPL, if I remember correctly).

However, it turned out that the loop did not always continue (I'm not sure why), so there was no second time in some situations

Huh I'm surprised the issue isn't instead that CertificateError always adds pending data, meaning an infinite loop! Strange.

On the coverage report. The missing code will never be reached assuming that every ssl.CertificateError produces pending outgoing data (a TLS alert). I have no idea whether this is the case.

IMO add an assert instead of handling the case, even though the behavior seems right.


Looks like CI is failing! Also please rename the newsfragment, I think 3413 instead of NNNN should work. Thanks for the PR!

According to the coverage report, `self._outgoing.pending` is always non-empty.
`datetime.UTC` was only introduced in Python 3.11, so the CIs were failing on
setups with Python 3.10.
The additional semantics for `expect_fail` still seems less invasive than a new
keyword argument like `silent`.
@sasozivanovic
Copy link
Copy Markdown
Author

Thanks for the quick reply!

I could think of no other way to test whether the server hangs than imposing an arbitrary (small) timeout, and there is a (very) small chance that even the correct code will not finish within the allotted time, resulting in a false negative.

You can drop this -- we already have general timeouts for tests. We have other tests which do similar evil things (specifically the REPL, if I remember correctly).

I'm not sure about this. The general timeout is not only rather long (1 minute), it also doesn't do the job: the test client works in a thread, but this thread is not abandoned upon the general timeout, so the test hangs.

I followed the rest of your suggestions — thanks for the ideas and reminders! I also looked at the CI reports, and I managed to fix all issues but one (occuring twice):

src/trio/_tests/test_ssl.py:482: error: Incompatible types in assignment (expression has type "ExceptionInfo[BrokenResourceError]", variable has type "ExceptionInfo[ExceptionGroup[SSLError]]") [assignment]

I first thought this was an issue with exception group nesting, so I switched from with pytest.raises(BrokenResourceError) to pytest.RaisesGroup(BrokenResourceError, allow_unwrapped=True, flatten_subgroups=True) In 02aee02, but this was a wrong move. The real problem is BrokenResourceError vs SSLError. I find this very weird: CI / Ubuntu (3.13, check formatting) is the only check which complains ... and if I switch from BrokenResourceError to SSLError, the test doesn't pass on my machine — which I understand: expecting a BrokenResourceError is what makes sense, because ._reply (called by .send_all) should wrap the SSLError into a BrokenResourceError. I'll look more into this, but help would be greatly appreciated.

@A5rocks
Copy link
Copy Markdown
Contributor

A5rocks commented Mar 29, 2026

My first guess about the 3 mypy errors (not looking at the code) is that the first is because you forgot to explicitly provide a type somewhere, and the two you mention I suspect are because the variable name is reused? (mypy is very picky that variables always mean the same type! at least by default, I forget if we changed the right settings...)


The general timeout is not only rather long (1 minute), it also doesn't do the job: the test client works in a thread, but this thread is not abandoned upon the general timeout, so the test hangs.

Hm, yeah, I suppose. There's other tests that hang rather than fail, though that's definitely a worse experience (but... as long as CI fails if this is broken, all's well!). I guess it's up to you, though make sure to size the timeout large enough because the free GitHub Actions runners we use are... pretty bad -- a sleep(1) can sleep 2 seconds!

Copy link
Copy Markdown
Contributor

@A5rocks A5rocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully these specific suggestions help? It's possible there's a mypy bug or something weirder going on...

@sasozivanovic
Copy link
Copy Markdown
Author

@A5rocks Many thanks, I'll look into this right away!

@sasozivanovic
Copy link
Copy Markdown
Author

sasozivanovic commented Mar 29, 2026

@A5rocks Fantastic, your suggestions worked like a charm!

@sasozivanovic
Copy link
Copy Markdown
Author

The general timeout is not only rather long (1 minute), it also doesn't do the job: the test client works in a thread, but this thread is not abandoned upon the general timeout, so the test hangs.

Hm, yeah, I suppose. There's other tests that hang rather than fail, though that's definitely a worse experience (but... as long as CI fails if this is broken, all's well!). I guess it's up to you, though make sure to size the timeout large enough because the free GitHub Actions runners we use are... pretty bad -- a sleep(1) can sleep 2 seconds!

I'm thinking, maybe increase the timeout from 0.1s to 1s, just to be on the safe side. The thing is, this is not a timeout on the entire test, but on a very specific part of it, between the client receiving the TLS alert and setting the trio.Event.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants