Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't sign X.509 certs #34202

Closed

Conversation

ripatel-fd
Copy link
Contributor

@ripatel-fd ripatel-fd commented Nov 22, 2023

Problem

  • Addresses a TODO comment regarding a vulnerability
  • Removes the pkcs8 dependency
  • Removes logic to create X.509 signatures (X.509 signatures are ignored)
  • Hardcodes the SAN (SAN is ignored)
  • Removes the rcgen dependency

Summary of Changes

Fixes #

@mergify mergify bot added community Community contribution need:merge-assist labels Nov 22, 2023
@mergify mergify bot requested a review from a team November 22, 2023 00:05
@ripatel-fd ripatel-fd changed the title streamer: remove pkcs8 dep and clarify self-signing don't sign X.509 certs Nov 22, 2023
@ripatel-fd
Copy link
Contributor Author

Related Firedancer patch: firedancer-io/firedancer@851e73f

@lijunwangs
Copy link
Contributor

Can you elaborate on what problems this PR solves? Why do we need this?

@ripatel-fd
Copy link
Contributor Author

@lijunwangs There is lots of background here. It starts with the messy integration of QUIC into Solana's peer-to-peer protocols. The Rust ecosystem still lacks supports for some modern TLS standards, like RFC 7250 RawPublicKeys. So, in order to establish a TLS connection, the validator needs to work with X.509 certificates. Currently, it uses self-signed certificates. These certs are almost entirely redundant, except for the fact that they carry the peer's Ed25519 public key.

Ever since QUIC was first added to the validator, there has been a historical security concern with how these certificates were created. Currently, the Labs validator uses the node identity key to sign X.509 certificates. The signature is useless however. Validators don't verify it. This form of signing introduces risk though: It exposes the node's private key to a handful of unaudited third-party dependencies. The signing payload could also be ambiguous with other messages signed by the identity key, such as Solana gossip packets or vote transactions.

So, this PR removes a lot of complexity. It removes several third-party crates used to construct the X.509 certificate with a single hardcoded template. (As mentioned above, the content of the cert doesn't matter, so it is fine to hardcode). The signature contained in the certificate is now deliberately invalid, as nodes don't verify it anyways. For similar reasons, the SAN and serial number are removed too.

What remains is a template holding a byte array with a DER serialization of an X.509 certificate. The pubkey field is then patched with the node's identity key as needed.

@@ -97,8 +93,7 @@ pub struct QuicConfig {

impl NewConnectionConfig for QuicConfig {
fn new() -> Result<Self, ClientError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this becomes infallible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-nelson No, unfortunately not. There are some trait impls of NewConnectionConfig for the legacy UDP code path that can still return an IOError.

quic-client/src/lib.rs Outdated Show resolved Hide resolved
streamer/src/tls_certificates.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes lgtm now. gonna tap in @lijunwangs and @behzadnouri for rationale

@behzadnouri
Copy link
Contributor

  • Can you please confirm that this change is backward compatible with current mainnet?
    i.e. if I spin up a staked node with this code on mainnet, other nodes still accept connections from my node and correctly recognize its pubkey?

  • This is probably a dumb question, but regarding

    The signature contained in the certificate is now deliberately invalid, as nodes don't verify it anyways.

    what exactly prevents pubkey spoofing with this code? i.e. a node claiming a pubkey which they don't possess the private key of.

@ripatel-fd
Copy link
Contributor Author

  • Can you please confirm that this change is backward compatible with current mainnet?
    i.e. if I spin up a staked node with this code on mainnet, other nodes still accept connections from my node and correctly recognize its pubkey?

I'm pretty sure it's safe, but I cannot 100% confirm this, as I haven't tested a build from this PR against an old Solana build.

Is there a simple CLI tool I can run that spins up a solana-streamer client or server? That would help with testing.
Ideally the test setup is reproducible so you can confirm yourself as well.

  • This is probably a dumb question, but regarding

    The signature contained in the certificate is now deliberately invalid, as nodes don't verify it anyways.

    what exactly prevents pubkey spoofing with this code? i.e. a node claiming a pubkey which they don't possess the private key of.

That's a great question, and I should have clarified! The TLS 1.3 CertificateVerify handshake message protects the connection. It is created via an overcomplicated hash tree that includes the ECDH key exchange secret and a hash of all of the incoming and outgoing TLS messages, called the "transcript hash". The transcript hash is then signed with Ed25519. Both parties verify that the signature matches the public key in the peer's X.509 certificate (in the "SubjectPublicKeyInfo" field of the "TBSCertificate" structure)

Both client and server deterministically compute the transcript hash and then do the verification.
So, in the case of a MitM attack where some handshake messages are modified or the pubkeys are swapped, the signature verification fails, and the QUIC connection terminates.

The client and server also put in a 32 byte random value in their ClientHello and ServerHello messages to prevent replay attacks. (So it has some properties of a challenge-response mechanism)

All TLS connections do this procedure described above. X.509 signatures are only used for the certificate chain, but they do not secure the connection itself. (e.g. I could easily take the github.com certificate and signatures and put it in my own web server. The peer will think the certificate is valid. But since I don't have the private key for it, I can't sign a valid TLS 1.3 CertificateVerify message)

In the web, we need both TLS CertificateVerifys (to prove cert ownership) and X.509 certs (to prove cert authenticity against the root CAs via cert chains). In P2P, we only need TLS CertificateVerify, as the blockchain state does the authentication part.

See RFC 8446:

4.4.3. Certificate Verify

This message is used to provide explicit proof that an endpoint
possesses the private key corresponding to its certificate. The
CertificateVerify message also provides integrity for the handshake
up to this point. Servers MUST send this message when authenticating
via a certificate. Clients MUST send this message whenever
authenticating via a certificate (i.e., when the Certificate message
is non-empty). When sent, this message MUST appear immediately after
the Certificate message and immediately prior to the Finished
message.

For completeness, here is a pretty picture describing exactly how the TLS CertificateVerify is created:

tls13_server drawio

@behzadnouri
Copy link
Contributor

Is there a simple CLI tool I can run that spins up a solana-streamer client or server? That would help with testing.
Ideally the test setup is reproducible so you can confirm yourself as well.

I think the easiest would be to run a staked node on testnet and verify quic connections are ok, and other nodes extract the correct pubkey from the connection.

behzadnouri
behzadnouri previously approved these changes Nov 29, 2023
core/src/repair/quic_endpoint.rs Outdated Show resolved Hide resolved
turbine/src/quic_endpoint.rs Outdated Show resolved Hide resolved
behzadnouri
behzadnouri previously approved these changes Nov 29, 2023
@lijunwangs lijunwangs added the CI Pull Request is ready to enter CI label Nov 29, 2023
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label Nov 29, 2023
Copy link

codecov bot commented Nov 29, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (935e06f) 81.9% compared to head (57a35df) 81.9%.
Report is 426 commits behind head on master.

❗ Current head 57a35df differs from pull request most recent head 2b1c177. Consider uploading reports for the commit 2b1c177 to get more accurate results

Additional details and impacted files
@@            Coverage Diff            @@
##           master   #34202     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         819      819             
  Lines      219765   219689     -76     
=========================================
- Hits       180094   179987    -107     
- Misses      39671    39702     +31     

.to_der()
.expect("Failed to convert keypair to DER")
.to_der();
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to where the prefix is documented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.rfc-editor.org/rfc/rfc8410#section-10.3 is probably a better example reference.

Nodes currently don't verify X.509 self-signed certificates
because peer authentication is done via TLS 1.3 CertificateVerify.
Thus, encodes an invalid signature in the X.509 certificate instead.
.to_der()
.expect("Failed to convert keypair to DER")
.to_der();
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.rfc-editor.org/rfc/rfc8410#section-10.3 is probably a better example reference.

cert_params
.distinguished_name
.push(DnType::CommonName, "Solana node");
let mut cert_der = Vec::<u8>::with_capacity(0xf4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference link please for the following sections. And instead of hard coding a certificate, can we not generate it without giving the key_pair?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with reference link? The X.509 spec? This code is functionally no different from what was there before. The previous version just used a crate to produce the same cert on all nodes, with just the serial number, SAN, and public key different. Here, we do the same, but precompute the ASN.1 serialization. I honestly thought this was worth getting rid of thousands of lines of dependency code, considering using X.509 is a mistake in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not generate it without giving the key_pair

Holding the public key is the only purpose of the cert, everything else is arbitrary. Without it, authentication is not possible. So we cannot remove the public key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the reference doc's link, like ITU-T X.690. This hard code is just very hard to review. I really need to understand each code to approve this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lijunwangs Maybe it would help to have a test case that asserts the hardcoded serialization decodes to the structure mentions in the comment. We can use a dev-dependency for this. What do you think?

@ripatel-fd
Copy link
Contributor Author

@lijunwangs I wrote up an explanation of this change: https://forum.solana.com/t/deprecate-x-509-certs-for-p2p-connections/762. We could link it in the code if you think it helps.

Copy link
Contributor

@lijunwangs lijunwangs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to fully review the changes in this PR and test the changes to verify they are safe and working:

  1. I tested against the mainnet -- connections to the existing nodes are okay
  2. Tested pubkey spoofing in the certificate (impersonating someone else's pubkey in the X.509 cert) -- the validator is correctly rejecting with BadSignature during TLS1.3 handshaking.
  3. I created a manual test tool to compare the cert and key generated using this code and without the code and use openssl to compare the generated DER file. Do not see worthy aberrations other than the issuer name.
    The PR while making it safer by eliminating a third party crate does make the code a lot harder to review and maintain -- so the referential links are important.
    The code need to be rebased.

// RelativeDistinguishedName SET (1 elem)
// AttributeTypeAndValue SEQUENCE (2 elem)
// type AttributeType OBJECT IDENTIFIER 2.5.4.3 commonName (X.520 DN component)
// value AttributeValue [?] UTF8String Solana
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep using the old "Solana node" -- it is more appropriate. It is issued by this node -- not Solana networks.

@github-actions github-actions bot added stale [bot only] Added to stale content; results in auto-close after a week. and removed stale [bot only] Added to stale content; results in auto-close after a week. labels Dec 28, 2023
@github-actions github-actions bot added stale [bot only] Added to stale content; results in auto-close after a week. and removed stale [bot only] Added to stale content; results in auto-close after a week. labels Jan 12, 2024
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 30, 2024
@lijunwangs
Copy link
Contributor

Superseded by #34896. Thanks for your efforts @ripatel-fd

@lijunwangs lijunwangs closed this Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution need:merge-assist stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants