Skip to content

Add Support for openssl v3.5 as a TLS Backend #4959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 5, 2025

Conversation

nhorman
Copy link
Contributor

@nhorman nhorman commented Mar 31, 2025

Description

This commit enhances msquic to allow it to build using the canonical openssl sources as a TLS backend.

Testing

All existing tests should cover this code as far as I can see.

I've run this code base against the quic-interop-runner and it passes all supported test cases there (save for zerortt I believe) which fails with all TLS backends at the moment I think, but we have apis to support early data, so we should be able to get that up and running shortly.

All unit tests pass under windows currently, save for WithHandshakeArgs4.RandomLossResumeRejection/4 and WithHandshakeArgs4.RandomLossResumeRejection/5 which were failing scholastically. I believe I have those fixed by adjusting some of the asserts when built in debug mode which (If I read the code base right) assume that keys are updated in lock step, which the openssl quic callback api doesn't always do, instead updating keys in independent callbacks.

Documentation

Possibly. currently this PR renames openssl support (which initially pointed to quictls), to actually be named quictls[3] and takes over the QUIC_TLS=openssl configuration to use the canonical openssl sources. That may require build doc updates to reflect that, but I've not clearly found any (yet).

@nhorman nhorman requested a review from a team as a code owner March 31, 2025 16:20
@nhorman
Copy link
Contributor Author

nhorman commented Mar 31, 2025

@microsoft-github-policy-service agree [company="OpenSSL"]
@microsoft-github-policy-service agree
@microsoft-github-policy-service agree company="Microsoft"

@nhorman
Copy link
Contributor Author

nhorman commented Mar 31, 2025

@microsoft-github-policy-service agree company="OpenSSL"
@microsoft-github-policy-service agree

@guhetier
Copy link
Contributor

I do think that renaming openssl -> quictls is a good change.

I have some concern with the fact that it would cause a use to go from building with quitls to building with openssl without any action on their end or warning.

  • Is it something we want?
  • Could we find a slightly different name for the "openssl" parameter to avoid re-using the existing name?
    • maybe tagging everything with the major version number, so we have quictls1, quictls3, openssl3?

@nhorman
Copy link
Contributor Author

nhorman commented Mar 31, 2025

@guhetier I think such a change would be fine, though I'd be curious to see what others thought about the version being added to the name - i.e. is doing this committing to adding an openssl4/5/6 in the future? Given that msquic builds against whatever openssl variant is defined in the associated submodule, I'm not sure thats needed, though if you want to support multiple openssl versions, it might be beneficial.

@guhetier
Copy link
Contributor

@nhorman All these are great questions, we need to clarify how we see openssl upgrades happen in the future.

A different name pattern might also be an option - my concern is mainly about avoid the re-use of the same parameter with a different meaning. Lets gather some opinions about how to handle this best.

@nhorman
Copy link
Contributor Author

nhorman commented Mar 31, 2025

@guhetier I understand what you're getting at. I don't have a whole lot of skin in that game, but I'll gladly implement what you and the rest of the dev team think is best.

Just to float another idea for you to consider (and I grant this is additional departure from your current build approach, and so may not be viable), but you might consider:

  1. Removing the TLS submodules from the tree entirely
  2. Creating a cmake variable that points to an openssl-like tree
  3. using a configure time test in cmake to ascertain what platform adaptation layer to use based on what symbols are available in the library pointed to by (2)

i.e. check in cmake to see if SSL_provide_quic_data exists, if it does, then build the quictls platform adaptation layer, if SSL_set_quic_tls_cbs exists, then build the openssl platform adaptation layer.

That would disconnect you from needing to select a specific implementation manually at configure time.

In the interim, I'll start looking at these CI failures. Looks like I did something wrong in the openssl/quictls migration for the workflow files

Copy link

codecov bot commented Mar 31, 2025

Codecov Report

Attention: Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review.

Project coverage is 86.81%. Comparing base (65660c4) to head (102bc39).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/core/crypto.c 77.77% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4959      +/-   ##
==========================================
- Coverage   87.38%   86.81%   -0.57%     
==========================================
  Files          59       59              
  Lines       18037    18045       +8     
==========================================
- Hits        15761    15666      -95     
- Misses       2276     2379     +103     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nhorman
Copy link
Contributor Author

nhorman commented Mar 31, 2025

ok, comments and CamelCase/snake_case issues fixed. Will start looking at CI failures next

@nhorman nhorman force-pushed the openssl-quic-api branch 2 times, most recently from af45354 to 3cbded3 Compare April 1, 2025 20:00
@nhorman
Copy link
Contributor Author

nhorman commented Apr 1, 2025

FYI, I have most of the cargo CI jobs building currently, but the windows static builds are failing, seemingly due to last nights windows-latest github runner update, which migrated from cmake 3 to cmake 4 which is having various build problems, that may be causing you trouble in other pull requests. We hit some similar problems in openssl this morning.

@guhetier
Copy link
Contributor

guhetier commented Apr 4, 2025

Thanks for fixing that on top of everything else!
We are seeing the issue on other PR and didn't get the time to fix it yet.

Sorry for the delay in reviewing the PR, I'll get to it as soon as I can.

@nhorman
Copy link
Contributor Author

nhorman commented Apr 4, 2025

don't rush, I know you all are busy. Feel free to cherry-pick any of the cmake/rust fixes to another PR if you want to prioritize those.

@nibanks nibanks added TLS: OpenSSL external Proposed by non-MSFT labels Apr 4, 2025
Comment on lines 1485 to 1487
// Note: Nominally, if we have the write key, we should have the read key
// but there might be a delay with openssl as it yields read and write
// secrets independently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be abstracted from the core QUIC layer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had initially attempted to do so by delaying the installation of new keys until both read and write were available, but doing so resulted in other failures when the quic stack expected keys be available on return from ProcessData. I can play with it some more, but if you have additional input here as to how best to do this, I would appreciate it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand why openssl would give a read key without a write key?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nhorman any comments on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will eventually give both of course, but there is a difference in how quictls provides keys vs how the openssl api does. in quictls the yield secret callback derives both keys, and provides them in a single callback, so they are always updated synchronously, whereas in openssl we use the state machine call path which derives the keys independenly and provides the in two separate callbacks, so there may be a period of time when the read key is updated but the write key is not, or vice versa

Copy link
Contributor Author

@nhorman nhorman Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm experimenting with that now (and have in the past), as well as with the idea of "cheating", I.e. assuming that keys are symmetric, and so when I get one callback in either direction, installing that as the key for both directions, and skipping subsequent calls for the same protection level.

Referencing the former, its a bit tricky, as it appears the QUIC stack expects various key transitions after various calls to CxPlatTlsProcessData, and by buffering the keys, thats not always honored.

Referencing the latter specifically, it seems to generally work, but locally I'm hitting several different failures here, most notably in TlsTest.HandshakeResumption, which fails in on lines 418 of Test.cpp indicating that The BufferKey attempting to be sent (at level 0) doesn't match the Connection Read Key Level of 0RTT (level 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nibanks I'd ask that @mattcaswell chime in again here, but just as a datapoint, If I back out the change that removed this particular assert, it occurs when the write key level is 0RTT and the read key level is HANDSHAKE.

Adding a bit of printf-ery in, it appears that we get this pattern in key callbacks When Running the Handshake/WithHandshakeArgs4.RandomLossResumeRejection (Which is whats failing in CI):

Installing secret SSL 0x7f639802a890 (Server) for Write, level 2
Installing secret SSL 0x7f639802a890 (Server) for Read, level 2
Installing secret SSL 0x7f639802a890 (Server) for Write, level 3
Installing secret SSL 0x7f63a0006d70 (Client) for Read, level 2
Installing secret SSL 0x7f63a0006d70 (Client) for Write, level 2
Installing secret SSL 0x7f63a0006d70 (Client) for Read, level 3
Installing secret SSL 0x7f63a0006d70 (Client) for Write, level 3
Installing secret SSL 0x7f639802a890 (Server) for Read, level 3
Installing secret SSL 0x7f63a000b280 (Client) for Write, level 1   <=====HERE
Installing secret SSL 0x7f639403b130 (Server) for Write, level 2
Installing secret SSL 0x7f639403b130 (Server) for Read, level 2
Installing secret SSL 0x7f639403b130 (Server) for Write, level 3
Installing secret SSL 0x7f63a000b280 (Client) for Read, level 2 <==== AND HERE

It appears that the client can receive the write key for 0RTT, then the read key for HANDSHAKE, which (maybe) makes sense? @mattcaswell?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming that keys are symmetric, and so when I get one callback in either direction, installing that as the key for both directions,

This is not a valid assumption.

For all other TLS libraries we've used, here in the completion processing code for QUIC, we never expected to have a Read key that is more recent than a write key. Can you explain the scenario where OpenSSL breaks that assumption, generally in wire-format terms?

We don't make any guarantees about the order of availability of keys. On the client side I think it turns out to be true that you will never get a read key installed after the write key for any particular protection level. On the server side I don't think this holds true. For example we install the app data write keys immediately after we have written the server Finished - but we don't install the app data read keys until we have processed the client Finished.

Copy link
Contributor

@anrossi anrossi May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't make any guarantees about the order of availability of keys

But why was this design chosen? To offer maximum flexibility for future algorithms?
We're trying to understand how it's possible to receive TLS 1.3 records such that the handshake or app keys wouldn't be available together.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design is driven by when the keys are made available for "standard" TLS. It's the TLS implementation that decides to release keys (independent of whether we are doing QUIC, standard TLS or DTLS). In the example I gave above we install app data write keys immediately after we have written the server Finished because it is possible for the server to start writing app data immediately, even though the client has not been authenticated yet (so called 0.5RTT data). At least this is true for standard TLS, I don't know what the QUIC RFC has to say about it. But the app data read keys aren't released until the client Finished has been received (which may occur sometime later) and therefore the client has been authenticated.

@nibanks
Copy link
Contributor

nibanks commented Apr 4, 2025

I'm thinking that we should move away from submodules for TLS dependencies. I never likely having two for quictls/openssl v1.1 and v3, but having 3 submodules only makes this worse. I'm open to suggestions, but I'm thinking about having CMake dynamically get the submodule and put it in a well-known path (perhaps still submodules/openssl, or perhaps the build folder). Any better ideas?

P.S. Depending on how we do this (likely in a separate PR first), we might be able to isolate most of the changes in this PR (all the name changes)...

//
// TODO: Remove this when the above tests are tolerant of addition of ML-KEM keyshares
//
SSL_set1_groups_list(TlsContext->Ssl, "secp256r1:x25519");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest trying "?*X25519:?secp256r1 / ?X448:?secp384r1:?secp521r1 / ?ffdhe2048:?ffdhe3072" as this will set the same configuration as in default settings but without enabling the hybrid key share .

@nhorman
Copy link
Contributor Author

nhorman commented May 23, 2025

Note I've created:
#5116
#5117

To track these remaining issues

@nhorman
Copy link
Contributor Author

nhorman commented May 23, 2025

Success!

Of the remaining tests that failed:

  1. Codecov - @nibanks I think you mentioned that this test was somewhat overly agressive, and I think in this case that holds. The patch its complaining about is a number of comments, with one extra conditional error case added, that we don't fall into nominally, so I think this can be ignored. Please tell me if you agree

  2. The stress failure. I think this error is unrelated to this PR, as its a memory leak in QuicSentPacketPoolGetPacketMetadata, and I see this failure in other test runs (notably here https://github.com/microsoft/msquic/actions/runs/15090263875/job/42417843480)

So I think at this point we're generally in good shape.

At this point I think the next steps are:
a) We need to get the openssl changes for this support effort merged (from this openssl PR openssl/openssl#27646). Its generally ready to go, I need to add a test for it, which I'm doing today, and expect that to go in sometime next week

b) Once (a) is done, I need to update this PR to stop using my personal branch with those openssl changes, and update the openssl submodule to pull from the openssl-3.5 branch again

c) I should probably rebase this PR to master, and squash all my intervening changes down to something more readable and reviewable, and run it through CI one more time to ensure that nothing new broke from that rebase effort.

Please tell me if you disagree, but at that point I think we're good to go.

@nibanks
Copy link
Contributor

nibanks commented May 23, 2025

So I think at this point we're generally in good shape.

At this point I think the next steps are: a) We need to get the openssl changes for this support effort merged (from this openssl PR openssl/openssl#27646). Its generally ready to go, I need to add a test for it, which I'm doing today, and expect that to go in sometime next week

b) Once (a) is done, I need to update this PR to stop using my personal branch with those openssl changes, and update the openssl submodule to pull from the openssl-3.5 branch again

c) I should probably rebase this PR to master, and squash all my intervening changes down to something more readable and reviewable, and run it through CI one more time to ensure that nothing new broke from that rebase effort.

Please tell me if you disagree, but at that point I think we're good to go.

I agree! Really awesome job so far! Thanks so much! @anrossi please help review the latest and let's get this merged.

@nhorman
Copy link
Contributor Author

nhorman commented May 27, 2025

forgive the delay here, we've found some issues with openssl/openssl#27646 that I'm addressing. Hope to have them taken care of shortly.

@nhorman
Copy link
Contributor Author

nhorman commented May 30, 2025

@mattcaswell has an alternate fix for key yielding in openssl that I'd like to try on this PR if we could please

@nhorman
Copy link
Contributor Author

nhorman commented May 30, 2025

Note, the ubuntu/quictls/android failure:
https://github.com/microsoft/msquic/actions/runs/15347134043/job/43185942723?pr=4959
seems completely unrelated to this PR, likely some change in the master branch to CI that the workflow in this PR isn't handling, or a transient failure, should come out in the wash when I rebase

@nhorman
Copy link
Contributor Author

nhorman commented Jun 3, 2025

ok!
openssl/openssl#27732 has been merged to the openssl repo

I've cleaned up this pr, squashing all the various commits down to something logical, and updated the openssl submodule to the latest on the 3.5 branch, so as to pull in openssl/openssl#27732

It passes all the unit tests locally on my linux and windows system here, so I think we just need to re-run CI to make sure I haven't screwed up any corner cases and I think we're good to go.

nhorman added 9 commits June 4, 2025 08:49
Openssl, in supporting ML-KEM, can trip over a corner case in asserting
what key level we're at in this code.
Just Marking a TODO for later, in that the fuzzer should support TLS
records that span multiple datagrams (which is the case when ML-KEM is
in use)
BufferKey is only set at the start of the test, But some TLS implementations
may update their keys while processing the data passed into this function
specifically observed on openssl, Sending a buffer with a ServerHello to a client
will yield handshake keys immediately, and following data will cause this to fail
so only check once at the start of the test to ensure we are in the right state
@nhorman nhorman force-pushed the openssl-quic-api branch from 6e3ad70 to 102bc39 Compare June 4, 2025 12:51
@nhorman
Copy link
Contributor Author

nhorman commented Jun 4, 2025

one more rebase to correct the recent conflict with the main branch

@nhorman
Copy link
Contributor Author

nhorman commented Jun 4, 2025

ok, I think we're ready for review here. We had 3 failures in CI. Two from code coverage due to 2 exercised lines of code, which I think can be ignored, and what appears to be a transient failure on quictls that I think is probably out of scope for this PR. Please review at your convenience @nibanks @anrossi

@nibanks
Copy link
Contributor

nibanks commented Jun 4, 2025

ok, I think we're ready for review here. We had 3 failures in CI. Two from code coverage due to 2 exercised lines of code, which I think can be ignored, and what appears to be a transient failure on quictls that I think is probably out of scope for this PR. Please review at your convenience @nibanks @anrossi

Awesome! Thanks so much for your tireless work here! Really appreciated!

@nhorman
Copy link
Contributor Author

nhorman commented Jun 4, 2025

no sweat, it was a good exercise!

Copy link
Contributor

@anrossi anrossi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for your all your hard work here! This has been a monumental effort, and we really appreciate your work, patience, and persistence on this PR.

@nibanks nibanks merged commit cc93053 into microsoft:main Jun 5, 2025
487 of 493 checks passed
@nhorman
Copy link
Contributor Author

nhorman commented Jun 5, 2025

Thank you all! You're help on this was greatly appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external Proposed by non-MSFT TLS: OpenSSL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants