Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config option to specify server name for inter-node TLS verification #1507

Closed
jtackaberry opened this issue Dec 20, 2023 · 18 comments · Fixed by #1509 or #1520
Closed

Add config option to specify server name for inter-node TLS verification #1507

jtackaberry opened this issue Dec 20, 2023 · 18 comments · Fixed by #1509 or #1520

Comments

@jtackaberry
Copy link
Contributor

jtackaberry commented Dec 20, 2023

For highly dynamic environments like Kubernetes, configuring inter-node TLS is heavily complicated by the fact that all IP addresses for the cluster need to be decided in advance. This can be especially difficult in K8s clusters that don't support custom IP pool reservations (via a CNI driver such as Calico).

But, at present, failing to do this will result in connectivity issues:

[cluster-join] 2023/12/19 23:44:50 failed to join via node at [2001:470:b0d6:fd:8a42:53cc:a14a:5a32]:4002: factory is not able to fill the pool: tls: failed to verify certificate: x509: cannot validate certificate for 2001:470:b0d6:fd:8a42:53cc:a14a:5a32 because it doesn't contain any IP SANs

This can be avoided by -node-no-verify, but in addition to skipping the server name check, it also skips verification of the certificate chain, which isn't desirable in this case.

One not entirely uncommon solution is to provide a custom server name that the client uses for cert verification instead of the address it connected to. This way, we can issue a certificate with a fixed Common Name such as rqlite.db.svc.cluster.local, use that on all nodes in the cluster, and then have rqlite verify the cert contains that name instead of an IP SAN. (The name itself becomes arbitrary but still admin-controlled.)

This approach could also enable the option to automate certificate provisioning/rotation with Let's Encrypt. (Although, that would require rqlite to support dynamic reloading of cert files on disk, which I don't believe it does. At least, it doesn't support that for the user auth file in my testing.)

So I propose a new flag -node-verify-server-name=<name> as a means to override the server name for cert verification.

This value can then be passed to https://pkg.go.dev/crypto/tls#Config.ServerName along with the other options in:

func createBaseTLSConfig(noverify bool) *tls.Config {

@otoolep otoolep linked a pull request Dec 21, 2023 that will close this issue
@otoolep
Copy link
Member

otoolep commented Dec 21, 2023

Thanks for the suggestion, sounds reasonable.

To be clear, it only applies to inter-node comms?

@otoolep
Copy link
Member

otoolep commented Dec 21, 2023

And should both ends use ServerName if running mutual TLS?

@jtackaberry
Copy link
Contributor Author

To be clear, it only applies to inter-node comms?

Yes, because that's the only time rqlite needs to validate server certificate Common Name/SANs. i.e. it only establishes outgoing connections for inter-node comms.

Of course this does raise the tangential question of how cert verification is done for mTLS, both for the raft and http ports. I imagine the cert DN is entirely ignored and only the issuer chain is validated?

mTLS for client connections is typically done because you don't want to manage passwords (to avoid coordination between client and server for rotations), but rqlite currently lacks a means of mapping client cert DNs (or at least the CN within the DN) to an internal rqlite username. That would be required to do away with passwords. (Alternatively, with sufficient motivation, one could shim a reverse proxy like nginx in front of rqlite to take care of that and inject the HTTP Basic Auth header based on client's DN.)

Barring that, AFAICT mTLS is only a minor extra layer of security on top of passwords, unless you wanted to go with a separate CA per cluster, which fine for personal use or small companies, but can be high friction for larger organizations with overbearing infosec policies that tend not to appreciate ad hoc PKIs. (My own is learning to live with it thanks to Istio. :))

And should both ends use ServerName if running mutual TLS?

You mean is the ServerName field in the Config struct also used for client cert verification in the mTLS case? I'm afraid I don't know offhand how Go works here. The name "ServerName" would be overloaded if so (the proper client/server-agnostic term would be Subject DN, CN or SAN) , but I don't know this aspect of Go well enough to know if it is. I'm curious though, will do some Googling.

@jtackaberry
Copy link
Contributor Author

but I don't know this aspect of Go well enough to know if it is.

I have done this with openssl though, some years ago, and I know it allows a custom verification callback function that can be used to apply some custom logic to validate the peer DN (for both client and server cert checks). I have to imagine Go provides a similar facility, or else at least provides some equivalent to ServerName (or, like I said, overloads it) for verifying the the CN portion of the client cert DN.

@jtackaberry
Copy link
Contributor Author

@jtackaberry
Copy link
Contributor Author

The VerifyConnection callback receives tls.ConnectionState for both client and server scenarios, and the ServerName field comment says "it's available both on the server and on the client side." I don't really understand this. It's not clear to me what ServerName would be in the client cert verification context. But the PeerCertificates field would unambiguously make the subject DN available.

@otoolep
Copy link
Member

otoolep commented Dec 21, 2023

I'm going to support the narrow use case of telling nodes what server name to expect in a cert they receive from another node.

@jtackaberry
Copy link
Contributor Author

jtackaberry commented Dec 21, 2023

@otoolep I'm having some difficulties validating the new feature.

First, without -node-verify-server-name specified, I'm confused by the logs (fragments below):

[rqlited] 2023/12/21 18:18:35 launch command: /bin/rqlited -node-id rqlite-0 -http-addr 0.0.0.0:4001 -http-adv-addr rqlite-0.rqlite-headless.data.svc.cluster.local:4001 -raft-addr 0.0.0.0:4002 -raft-adv-addr rqlite-0.rqlite-headless.data.svc.cluster.local:4002 -auth=/config/sensitive/users.json -join-as=_system_rqlite -node-cert=/config/sensitive/node.crt -node-key=/config/sensitive/node.key -node-ca-cert=/config/sensitive/node-ca.crt -http-cert=/config/client-tls/tls.crt -http-key=/config/client-tls/tls.key -http-ca-cert=/config/sensitive/client-ca.crt -disco-mode=dns-srv -disco-config={"name":"rqlite-headless","service":"raft"} -bootstrap-expect=3 -join-interval=1s -join-attempts=120 -raft-shutdown-stepdown -fk=true /rqlite
[rqlited] 2023/12/21 18:18:35 enabling node-to-node encryption with cert: /config/sensitive/node.crt, key: /config/sensitive/node.key, CA cert /config/sensitive/node-ca.crt, mutual TLS enabled
[...]
[cluster-join] 2023/12/21 18:18:35 failed to join via node at [2001:470:b0d6:fd:8a42:53cc:a14a:5a30]:4002: factory is not able to fill the pool: tls: failed to verify certificate: x509: certificate is not valid for any names, but wanted to match /config/sensitive/node-ca.crt

/config/sensitive/node-ca.crt is the value passed to -node-ca-cert. This doesn't seem like it should be related here. Without node-verify-server-name I would have expected rqlite to want it to match the the remote address, in this case 2001:470:b0d6:fd:8a42:53cc:a14a:5a30?

Anyway, onto the more relevant part, now with -node-verify-server-name specified:

[rqlited] 2023/12/21 18:22:25 launch command: /bin/rqlited -node-id rqlite-2 -http-addr 0.0.0.0:4001 -http-adv-addr rqlite-2.rqlite-headless.data.svc.cluster.local:4001 -raft-addr 0.0.0.0:4002 -raft-adv-addr rqlite-2.rqlite-headless.data.svc.cluster.local:4002 -auth=/config/sensitive/users.json -join-as=_system_rqlite -node-cert=/config/sensitive/node.crt -node-key=/config/sensitive/node.key -node-verify-server-name=rqlite.data.svc.cluster.local -node-ca-cert=/config/sensitive/node-ca.crt -http-cert=/config/client-tls/tls.crt -http-key=/config/client-tls/tls.key -http-ca-cert=/config/sensitive/client-ca.crt -disco-mode=dns-srv -disco-config={"name":"rqlite-headless","service":"raft"} -bootstrap-expect=3 -join-interval=1s -join-attempts=120 -raft-shutdown-stepdown -fk=true /rqlite
[rqlited] 2023/12/21 18:22:25 enabling node-to-node encryption with cert: /config/sensitive/node.crt, key: /config/sensitive/node.key, CA cert /config/sensitive/node-ca.crt, mutual TLS enabled
[...]
[rqlited] 2023/12/21 18:22:25 failed to create cluster client: failed to create TLS config for cluster dialer: open rqlite.data.svc.cluster.local: no such file or directory

Looking at the code, in rqlited/main.go:

		dialerTLSConfig, err = rtls.CreateClientConfig(cfg.NodeX509Cert, cfg.NodeX509Key,
			cfg.NodeVerifyServerName, cfg.NodeX509CACert, cfg.NoNodeVerify)

But the signature of CreateClientConfig() is

func CreateClientConfig(certFile, keyFile, caCertFile, serverName string, noverify bool) (*tls.Config, error) {

So the server name is being passed in place of caCertFile. These two issues are probably related, now that I glance at the code.

BTW if it's better for you, I am happy to test directly out of master or a topic branch, before you merge or release.

@jtackaberry
Copy link
Contributor Author

I have time this afternoon, I can work on a PR.

@otoolep
Copy link
Member

otoolep commented Dec 21, 2023

Silly error.

@jtackaberry
Copy link
Contributor Author

Silly error.

Isn't it always. :)

I can take the fix-var-order branch out for the spin.

@otoolep
Copy link
Member

otoolep commented Dec 21, 2023

Thanks, fixed in 8.13.2. For larger changes going forward I might have you build master to avoid the churn.

@otoolep otoolep reopened this Dec 21, 2023
@otoolep
Copy link
Member

otoolep commented Dec 21, 2023

Let me know if you find any other issues.

@otoolep otoolep closed this as completed Dec 21, 2023
@jtackaberry
Copy link
Contributor Author

Let me know if you find any other issues.

Unfortunately I did. I submitted PR #1513.

@otoolep otoolep reopened this Dec 22, 2023
@otoolep
Copy link
Member

otoolep commented Dec 22, 2023

This is the problem with my lack of testing in this area. Unit testing obviously isn't sufficient.

I'm going to add more extensive testing first, then fix it. I'll take a look at the PR you sent up -- thanks.

@otoolep
Copy link
Member

otoolep commented Dec 22, 2023

Yeah, I think I found the issue. It's due to the awkward way the code was creating TLS configs for internode comms. I'm separating it out here: #1517

@jtackaberry
Copy link
Contributor Author

Yeah, I think I found the issue. It's due to the awkward way the code was creating TLS configs for internode comms.

Ah, excellent. So indeed the listener's tls.Config was being used for outgoing peer connections. That does seem to fully explain the mystery.

@otoolep
Copy link
Member

otoolep commented Dec 23, 2023

This should finally be working now -- end-to-end testing in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants