Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA abstraction #46

Closed
wants to merge 5 commits into from
Closed

CA abstraction #46

wants to merge 5 commits into from

Conversation

tombentley
Copy link
Member

This PR is for proposing an abstraction of Strimzi's existing certificate handling.

It is currently a bit high level. We'll need to discuss a lot of the details (for example, I've not described in detail which Secrets would be used to store certificates in various states), but this describes a vision which we can start discussing the merits of, and whether is solves all the right use cases.

Signed-off-by: Tom Bentley <tbentley@redhat.com>
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. But looks like a good start. I think elaborating a bit more on what secrets will be used to store what and how to store the state machines would be good for better understanding how it all fits together.


1. The issuing operator needs to figure out which EE certificates are missing (or in their renewal period, or otherwise need replacing, e.g. via annotation). This need for a certificate might exist independently of the end entity. For example, if we're going to scale up the Kafka cluster then we need a certificate for the new broker before the new broker exists. **The issuing operator puts the end entity in the `NEW` state.**
2. (**See step 1 in the diagram.**) The issuing operator requests a new certificate. **The issuing operator puts the end entity in the `REQUESTED` state.**
3. (**See step 2 in the diagram.**) The issuing operator cannot assume that issuance will happen quickly (i.e. that `pollForRequestedCertificate()` returns a non-empty `Optional` immediately), so it needs to be able persist the token obtained from `requestCertificate()` in a `Secret`, _and possibly complete the current reconciliation_ (i.e. issuance can be very asynchronous).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not really clear what token are you talking about here. Maybe you should explain what certificate management systems you considered and what are their workflows.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to couple the operator's reconciliation loop to an external system. It's not so much of a problem when bootstrapping a cluster, but blocking reconciliation because we need to renew a certificate feels like a bad idea, since it prevents other actions on the cluster. So rather than polling pollForRequestedCertificate within a single reconciliation, waiting for a non-empty result, the idea is to allow the reconciliation to proceed and call pollForRequestedCertificate again in the later reconciliation. That way we don't have to reason about how quickly a given certificate management system might issue a certificate. Its availability wouldn't matter much. There could be a human in the loop checking each CSR and the operator would still function just fine. It would be possible to implement the external issuer mentioned in the rejected alternatives and the operator would still function correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree that we should not stop the reconcile but why not using the same approach we have with the Cruise Control rebalance process where, while waiting for a proposal Ready and it's still in PendingProposal, there is a polling on the CC REST API to check the status, instead of waiting for the next reconcile loop that could come after 2 mins by default, or sooner or later if user changes the default value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With CC there's nothing else that might need to be done by that operator, so it's OK to poll the CC REST indefinitely. That's not the case with the Kafka operator.


#### Process for ensuring CA trust

When an EE certificate is issued via a `CertificateIssuer` it might have been signed by a CA that is not yet trusted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like the idea how we trust any CA which comes via the certificate issuer. Trusting the CA means trusting all certificates issued by it. So it might be a way how to sneak in undesired CA. Have you considered separating the EE certificate process from the CA trust process and have users explicitly specify CAs they trust? The EE process can then just check if it is trusted or not and decide whether to continue or error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could certainly add an allowlist of trusted CA certificates, and not trust CA certificates not in that list. We'd need a NOT_ALLOWED state in the CA certificate trust state machine. Then the user would need to take some action, since requesting another certificate from that issuer is likely to result in the same unacceptable CA certificate being used. Would that be sufficent @scholzj?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. It seems more secure to require the trust to be explicitly configured rather the expect not-trust to be configured.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the person configuring the Kafka CR necessarily know the root CA certificate which their configured certificate issuer is going to use? If they do then they can configure the Kafka CR to trust EE certificates signed with that root CA certificate. But if they don't then they can't configure the Kafka CR at all.

Even when they do know, it creates extra problems: What should happen when that root CA certificate expires? The operator then is stuck. So we need extra management code to add warning conditions when that's going to happen etc. I'm sure we can do that, but I'm not sure we want the complication in the initial implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not necessarily about the person knowing. The provider implementation should know. As explain in the main comment => you already have a chunk of this information in the CA configuration. In case of the internal issuer, it would be just the Kubernetes Secret as today. In case of the CertManager, it is the Issuer what defines what CA is used. So the provider implementation knows how to get it.

Imagine the difference like this:

  • If you take the CAs from the issued certificate, as a bad actor I need just to subvert one issued certificate to inject Strimzi with my own CA and gain access. The only way you would notice is by checking the CAs trusted by Strimzi Kafka cluster.
  • If you take the CA from the provider implementation instead of from the issued certificates, the bad actor would need to modify the Issuer to get Strimzi to trust some new CA. That has IMHO better chance of being noticed because the issuer might be used by other apps as well etc.

035-ca-abstraction.md Outdated Show resolved Hide resolved

When an EE certificate is issued via a `CertificateIssuer` it might have been signed by a CA that is not yet trusted.
The cluster operator (for the cluster CA), or the user operator (for the clients CA) needs to detect this lack of established trust and ask the cluster operator to trust the new CA.
The cluster operator and the user operator run in separate processes, so this will be done via a shared `Secret`, for example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a bit to the comment above -> there is some demand for having different trusted clients CA per listener. Going in that direction might IHMO require better separation of UO and CO. I think to keep it simple, you could for example establish trust to the clients CA via the Kafka CR. The User Operator does not really care about it, it can just request the EE from whatever issue is configred in the KafkaUser resource and leave it up to user to make sure the issuers CA is also trusted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That assumes that the user knows beforehand which CA cert will be used by the issuer. Is that necessarily always true? They might know today, but what about some future time when that CA cert is close to expiring and the CA starts using a new one? If the person operating the cluster hasn't configured trust in that clients CA cert then the UO would happily issue user EE certs which won't work. Or should the CO pass the set of trusted certs to the UO, so the UO can check the issued certs will be trusted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be that it's a mistake to try to have any common abstraction that covers the cluster CA and the clients CA. From the PoV of the Kafka cluster it would work for the user to specify a set of CA certs to trust (and roll the brokers when that changed). And for the UO just make it a requirement that there is a CA (possibly intermediate) cert and key that can be given to the UO to issue certificates. In other words, not use this CA abstraction at all for the UO. That would certainly simplify a number of things, like the need for the shared Secret. At the cost of making the UO less broadly applicable, since it would still be making this assumption about being able to issue certificates itself.

EE certificate expiry is driven by the cert path validation described above.
The issuing operator will perform cert path validation and if the end entity certificate is currently expired, or will be expired on `now + renewal period` then a new EE certificate will be issued as described in the previous section.

#### Intermediate or root CA cert expiry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this deserves a bit more detail on how would it work. For example in my past investigation, Cert Manager didn't really allow a smooth migration from old CA to new CA because it just replaced the CA. That makes this a bit complicated and that is also why as mentioned in one of the comments above you might need a clear separartion of the secrets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I plan to update with which Secretss are updated in which steps.

035-ca-abstraction.md Show resolved Hide resolved
@katheris
Copy link
Contributor

katheris commented Dec 7, 2021

Overall the proposal sounds good to me. The main section that I feel like has the most unknowns is using the secret state to drive the logic flow, if there are multiple certificate updates happening in quick succession it isn't immediately obvious to me how the system would cope with that.


1. The issuing operator needs to figure out which EE certificates are missing (or in their renewal period, or otherwise need replacing, e.g. via annotation). This need for a certificate might exist independently of the end entity. For example, if we're going to scale up the Kafka cluster then we need a certificate for the new broker before the new broker exists. **The issuing operator puts the end entity in the `NEW` state.**
2. (**See step 1 in the diagram.**) The issuing operator requests a new certificate. **The issuing operator puts the end entity in the `REQUESTED` state.**
3. (**See step 2 in the diagram.**) The issuing operator cannot assume that issuance will happen quickly (i.e. that `pollForRequestedCertificate()` returns a non-empty `Optional` immediately), so it needs to be able persist the token obtained from `requestCertificate()` in a `Secret`, _and possibly complete the current reconciliation_ (i.e. issuance can be very asynchronous).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree that we should not stop the reconcile but why not using the same approach we have with the Cruise Control rebalance process where, while waiting for a proposal Ready and it's still in PendingProposal, there is a polling on the CC REST API to check the status, instead of waiting for the next reconcile loop that could come after 2 mins by default, or sooner or later if user changes the default value.

### Interface for certificate issuance

```java
public interface CertificateIssuer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU this is an interface that a component used by CO or UO should implement to issue certificate, right?
Can you make it clear so that in the next sections when you mention the "issuing operator" it's more clear?

```java
public interface CertificateIssuer {

byte[] requestCertificate(Subject subject, int requestedValidity) throws CertificateIssuanceException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no information about how to generate the private key? key size, algorithm to use, ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this interface, no. The reason being is that different issuers (cert manager, vault) might not expose that.
This could be handled in the API (see later) where we select which implementation is going to be used. For example with the internal type we could have

spec:
  clusterCa:
    validityDays: <integer>
    generateCertificateAuthority: <boolean>
    generateSecretOwnerReference: <boolean>
    renewalDays: <integer>
    certificateExpirationPolicy: <renew-certificate|replace-key>
    keySize: 4096
    keyType: RSA # (or EC)

The internal type could also support an implementation (e.g. openssl vs bouncycastle), but it's not part of this proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the validity really differ from key size, type etc.? Different providers might have limited availabilities of that as well I guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, maybe this should be configured away too. I can't see a reason why we need it in the interface for the cluster CA (it's not like we want ZK certs to have different validity to broker certs for example). For the clients CA I suppose it is possible that some KafkaUser could support an explicit validity period (possibly even end dates) which would allow time-limited accounts. Generally speaking there are better ways of achieving that with Kafka than doing it at the certificate level.

Because of the current assumptions, the CO generates cluster CA certificates (if they're missing or expired) within a reconciliation, before it knows whether any EE certificate issuance is needed in that reconciliation. Then if it needs to issue EE certificates it just uses the CA certificates that must exist by that point.

The steps required by the `CertificateIssuer` contract are more complicated, and will be described over the following sections.
The term "issuing operator" will be used to refer to the operator that's issuing the certificate. This could be the User or Cluster Operator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general point ... what is a really confusing is that we say something like "CO issues certificates" or "issuing operator". While it's true with the current openssl implementation, it's not so true with the overall proposal offloading the certificate generation outside of the CO. What CO is doing is kind of "adopting" the certificate: it asks for a certificate and an external component (or human) returns to it what it asked for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the language is tricky. I'm often using "issuing" to mean "using the CertificateIssuer interface to issue". I'll adjust the text here to make this clearer, but I'm not sure there's a better term that "issuing operator" which we could use throughout the text.

IN_USE // the certificate is in use
|
v
NOT_NEEDED // a certificate for the end entity is not needed any more
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "is not needed any more" exactly mean? For example when a Kafka cluster is scaled down and you need to get rid of the EE certificate for the broker that doesn't run anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good question.

For the CO it's relatively easy, because at the end of the day everything is in the same process and the set of required certificates is deterministic from the Kafka CR. So the CO can basically do a set subtraction to figure of the certs which exist but are not needed, transition then to NO_NEEDED and (the important thing) if those were the last which depended on some root CA certificate, then that root cert can be removed from the trust stores too (by transition its state to ZOMBIE).

For the UO it's a bit harder. Determining which root CA certs are no longer needed requires a full reconciliation of all the KafkaUsers. That doesn't really scale well. But if it can do that then it can also figure out which root certs might not be needed and transition them to ZOMBIE

In either case the root certs being in ZOMBIE would result in removal from the trust stores.

TBH the process around getting rid of EE certs and particularly CA certs when they're no longer required is the area which I didn't really cover enough in my PoC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to figure out the interaction with the certs management system (certs-manager, vault, ..) and what does it mean in this case. Who decides that the certificate is not needed anymore? Is it something that the user would do directly on the certs management system and maybe he is allowed to delete the cert, or the interaction is always through some CRD in the CO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I was thinking about this at the time was things like scale-down of #brokers.

Certificate revocation is listed as a non-goal of this proposal (I was trying to limit the scope), but I wonder whether we should at least consider how that would work in terms of the state machine, even if we didn't support it initially. I suspect we'd probably want distinct states (even if the handling turned out to be the same) to cover revocation. Taken to the extreme we could have an end state for every CRLReason.

1. The issuing operator needs to figure out which EE certificates are missing (or in their renewal period, or otherwise need replacing, e.g. via annotation). This need for a certificate might exist independently of the end entity. For example, if we're going to scale up the Kafka cluster then we need a certificate for the new broker before the new broker exists. **The issuing operator puts the end entity in the `NEW` state.**
2. (**See step 1 in the diagram.**) The issuing operator requests a new certificate. **The issuing operator puts the end entity in the `REQUESTED` state.**
3. (**See step 2 in the diagram.**) The issuing operator cannot assume that issuance will happen quickly (i.e. that `pollForRequestedCertificate()` returns a non-empty `Optional` immediately), so it needs to be able persist the token obtained from `requestCertificate()` in a `Secret`, _and possibly complete the current reconciliation_ (i.e. issuance can be very asynchronous).
(An example of how this could happen is where a `CertificateIssuer` implementation uses cert-manager; `requestCertificate()` creates a cert-manager `CertificateRequest` CR but while the cert manager operator is not running no certificate will be issued).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it just Certificate CR on the cert-manager side? [1]
From the cert-manager API doc it seems that CertificateRequest is created by the cert-manager but the user interaction happens via Certificate, or?
When I used cert-manager in the past I interacted with it by using Certificate CR.

[1] https://cert-manager.io/docs/usage/certificate/

[1] https://cert-manager.io/docs/usage/certificate/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Certificate, in which case cert manager will keep a Secret up to date for you. I don't think that really helps us though, because we still need to orchestrate the roll out of the EE certificate (and possible the roll out of a new CA cert too). So we'd still need similar machinery in the CO, it's just we'd be needing to spot changes in the Secret which cert manager updates spontaneously, rather than being in control of renewal ourselves (and thus being able to respect the maintenance times, for example).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Certificate, in which case cert manager will keep a Secret up to date for you.

Mhhh ... but by using an external certificates manager, so cert-manager for example, aren't we leveraging it to generate certificates and storing them in Secrets? At this point it's not clear to me how we are interacting with a cert-manager, how we get the generated certificates to put them in our own Secrets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference, as I understand it, is that with Certificate cert manager will renew it automatically (basically creating a CertificateRequest behind the scenes) and update the Secret. Whereas CertificateRequest is a one time deal, no renewal on expiry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppatierno did that clarify your question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so aren't we leveraging to the external cert manager to do the renewal process for us automatically? we want to control the cert manager to do the renewal when we want to, is that right? And it means interacting with cert-manager, for example, at lower level (CertificateRequest vs Certificate)?

035-ca-abstraction.md Outdated Show resolved Hide resolved
name: <string> # used for CertificateRequest.spec.duration
kind: <string>
group: <string>
generateSecretOwnerReference: <boolean>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this in the current API. It controls whether the Secrets containing the certs and keys get an owner reference generated or not, therefore whether they get GCd with the Kafka CR or not. I think a similar thing would make sense for other issuers, though I haven't really thought through precisely which Secrets would get the owner ref set when this was true. Ill mention this in the text.

Signed-off-by: Tom Bentley <tbentley@redhat.com>
@tombentley
Copy link
Member Author

I've addresses many of the comments, and added a section the goes into the Secrets a bit, if you want to take another look?


2. The issuing operator requests a new certificate. **The issuing operator puts the end entity in the `REQUESTED` state.**

3. The issuing operator cannot assume that issuance will happen quickly (i.e. that `pollForRequestedCertificate()` returns a non-empty `Optional` immediately). In fact we explicitly decouple the operators reconciliation loop from the waiting for certificate issuance. If, after some number of retries within the current reconciliation, the certificate has not been issued, the token obtained from `requestCertificate()` is persisted in a `Secret` and the polling is completed in a subsequent reconciliation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact we explicitly decouple the operators reconciliation loop from the waiting for certificate issuance.

Doesn't this sentence means a kind of polling (while waiting for certificate issuance) as I mentioned in another comment about CC REST API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall the details of theCC interaction. What I'd trying to describe in this proposal is:

  1. Operator calls issueCertificate
  2. Operator calls pollForRequestedCertificate, it's not ready.
  3. Operator completes the rest of the reconciliation, making do without the requested certs (e.g. for renewal continuing to use the existing ones, for creation then just completing the reconciliation without having a functional cluster)
  4. Operator starts a new reconciliation (by default 2 minutes later).
  5. Operator calls pollForRequestedCertificate, it's not ready.
  6. Operator completes the rest of the reconciliation making do without the requested certs
  7. Operator starts a new reconciliation (by default 2 minutes later).
  8. Operator calls pollForRequestedCertificate, this time it is ready
  9. Operator does what's necessary to use the newly issued certs, maybe orchestrating trust, and then making use of the issued certificates.

The point is that steps 4–6 could repeat many times, and in the mean time the user can update the CR and those changes get reflected, rather than cert renewal blocking changes to the CR for an arbitrarily long time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppatierno did that clarify your question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe :-) ... the text says after some number of retries within the current reconciliation it seems that anyway the step 2, 5 and 8 so calling pollForRequestedCertificate are executed more times (retries) in the same reconcile. So we have retries in the same reconcile and if not ready, retries across reconciles.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I was proposing (but it doesn't have to be that way). The reason: Often issuance will be quick, so it's worth trying once or twice within the reconcile. But we can't be sure it will always be quick, so we need to support the across-reconcile case too. If we had to only do one I would suggest keeping across-reconcile and dropping the within reconcile. I don't think there's a huge difference code-wise, but there might be test-wise.

Before describing the process let's introduce two state machines. First the "EE certificate state machine"

```
NEW // a certificate is needed for the end entity, but doesn't exist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would REQUIRED or NEEDED be less confusing than NEW? NEW might suggest it already exists.

UNTRUSTED // not yet trusted everywhere it needs to be
|
v
TRUSTED_NEW // trusted everywhere it needs to be, but no EE certificates yet issued
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe TRUSTED_UNUSED would be better?

REQUESTED // a certificate has been requested; requestCertificate() has been called
|
v
PENDING // a certificate has been issued; pollForRequestedCertificate() has returned non-empty,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe TRUST_PENDING would be more fitting?

The cluster operator and the user operator run in separate processes, so this will be done via a shared `Secret`, for example:

2. The issuing operator has placed a root CA certificate in the shared `Secret` in the `UNTRUSTED` state.
3. The cluster operator sees the root CA certificate orchestrates changes to the brokers (reconfig or restart) (and, for the Cluster CA the other components) so that the certificate is trusted by all components that will observe the EE certificates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How intelligent you expect this to be? Following the logic described here which centers around the EE certs and not CAs ... if UO EE cert is issued and is signed by CA-X, the CA-X is needed really only in one place => in the truststore of the 9091 listener in the Kafka broker. The other compoenents such as ZooKeeper, TO, KE or CC will not really care about this CA if their EE certs use CA-Y. Do you expect it to have a detailed handing like this? Or will it simply roll the CA to all operands distinguishing only between Cluster CAs and Client CAs?

(Note: I can see why for simplicity we might want to just roll them everywhere => that is fine. I really raise this mainly for clarity)

035-ca-abstraction.md Outdated Show resolved Hide resolved
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this technically works. My main concern is how it sees the CA just as a property of the EE certificate rather than a first level citizen. It basically assumes relation ship is:

--------------
|    Cert    |       -----------
| Management | ====> | EE Cert |  
|    Tool    |       -----------
--------------            |  
                          |
                          V
                       -------  
                       | CAs |
                       -------

But it seems to me that the actual relation ship is more like this:

--------------
|    Cert    |       ------      -----------
| Management | ====> | CA | ===> | EE Cert |
|    Tool    |       ------      -----------
--------------  

(Where CA is not necessarily a single private and public key but a Virtual CA which might be renewed over the time etc.)


I think technically, this proposal would work fine while treating the CAs just as a property of the EE certificate. But I think there are practical consequences:

  • The fact that the role of the CA is different and it is not just property of the EE certificate leaks into the API design in the CA configuration examples for cert-manager where the Issuer largely determines the CAs used by the EE certificates => the Issuer is the CA in my drawings above.
  • The trust is on the CA level. So you do not trust the EE certificate. You trust the CA and by that you trust all the certificates issued by it. Having the trust to the CA kind of hidden might make that not clear enough to unexperienced users. This mechanism would be applied to the Cluster CA, but also to the Clients CA where you might expect that there is no network isolation and the client listeners might be exposed to much wider network areas.
  • How do you secure the certificate issuing chain? I think it needs a mechanism to verify that the certificate was really issued by a trusted counter party. Otherwise, how is the system protected against a bad actor who sees that a certificate is being requested and injects its own CA into the Kafka cluster by for example creating a secret before the CertManager would do so? Is this something what we should be worried about? Or is this a risk in any case with any other implementation as well?

I guess the security question is really important. If we do not think this is a bigger security threat than a model following more CA oriented approach, I'm fine with this proposal at the high level => it would at the end in most cases iterrate to the same result, just in a way wich is maybe a bit different than I would expect.

Validity is a property of the whole certificate chain, not just the EE certificate.
For example it is possible that a root or intermediate certificate expires before an EE certificate that depends on it.
So for the operator to reason correctly about validity it needs to perform PKIX cert path validation.
Fortunately Java comes with APIs for doing this in `java.security.cert`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the API do? You made me curious by mentioning it. :-D

Does it give enough insights to understand that the full path will not be valid anymore in 30 days? Or does it just return true/false? That would be possibly too late for taking some action.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very configurable. You can configure it with your own effective date and it will give you detailed results about which cert is invalid and why. See this guide and the API docs.

035-ca-abstraction.md Show resolved Hide resolved
${fingerprint}.${state}: <PEM encoded Secret>
```

The `fingerprint` is the fingerprint/thumbprint of the certificate. I.e. the SHA-1 has of the ASN.1 DER-encoded form of the X509 certificate. The `state` can be any of the states of the CA certificate trust state machine described above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo I guess?

Suggested change
The `fingerprint` is the fingerprint/thumbprint of the certificate. I.e. the SHA-1 has of the ASN.1 DER-encoded form of the X509 certificate. The `state` can be any of the states of the CA certificate trust state machine described above.
The `fingerprint` is the fingerprint/thumbprint of the certificate. I.e. the SHA-1 hash of the ASN.1 DER-encoded form of the X509 certificate. The `state` can be any of the states of the CA certificate trust state machine described above.

Also, is the DER format safe enough in terms of how it uses whitespaces etc. to make sure the SHA-1 hash of it works well enough? Or does it need to use the fingerprint from the certificate itself (which I at least guess is unique)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHA-1 of the DER is precisely what the fingerprint of the certificate is. My understanding is that the DER encoding is deterministic so the hash should be completely stable.

Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
@scholzj
Copy link
Member

scholzj commented Jan 13, 2022

One quick thought to consider ... how would this affect the .status.listeners[].certificates part of the Kafka CR status.


```java
class IssuedCertificate {
private final private byte[] key;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many private :-)

035-ca-abstraction.md Show resolved Hide resolved
IN_USE // the certificate is in use
|
v
NOT_NEEDED // a certificate for the end entity is not needed any more
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to figure out the interaction with the certs management system (certs-manager, vault, ..) and what does it mean in this case. Who decides that the certificate is not needed anymore? Is it something that the user would do directly on the certs management system and maybe he is allowed to delete the cert, or the interaction is always through some CRD in the CO?


2. The issuing operator requests a new certificate. **The issuing operator puts the end entity in the `REQUESTED` state.**

3. The issuing operator cannot assume that issuance will happen quickly (i.e. that `pollForRequestedCertificate()` returns a non-empty `Optional` immediately). In fact we explicitly decouple the operators reconciliation loop from the waiting for certificate issuance. If, after some number of retries within the current reconciliation, the certificate has not been issued, the token obtained from `requestCertificate()` is persisted in a `Secret` and the polling is completed in a subsequent reconciliation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe :-) ... the text says after some number of retries within the current reconciliation it seems that anyway the step 2, 5 and 8 so calling pollForRequestedCertificate are executed more times (retries) in the same reconcile. So we have retries in the same reconcile and if not ready, retries across reconciles.

1. The issuing operator needs to figure out which EE certificates are missing (or in their renewal period, or otherwise need replacing, e.g. via annotation). This need for a certificate might exist independently of the end entity. For example, if we're going to scale up the Kafka cluster then we need a certificate for the new broker before the new broker exists. **The issuing operator puts the end entity in the `NEW` state.**
2. (**See step 1 in the diagram.**) The issuing operator requests a new certificate. **The issuing operator puts the end entity in the `REQUESTED` state.**
3. (**See step 2 in the diagram.**) The issuing operator cannot assume that issuance will happen quickly (i.e. that `pollForRequestedCertificate()` returns a non-empty `Optional` immediately), so it needs to be able persist the token obtained from `requestCertificate()` in a `Secret`, _and possibly complete the current reconciliation_ (i.e. issuance can be very asynchronous).
(An example of how this could happen is where a `CertificateIssuer` implementation uses cert-manager; `requestCertificate()` creates a cert-manager `CertificateRequest` CR but while the cert manager operator is not running no certificate will be issued).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so aren't we leveraging to the external cert manager to do the renewal process for us automatically? we want to control the cert manager to do the renewal when we want to, is that right? And it means interacting with cert-manager, for example, at lower level (CertificateRequest vs Certificate)?

${fingerprint}.${state}: <PEM encoded Secret>
```

The `fingerprint` is the fingerprint/thumbprint of the certificate. I.e. the SHA-1 hash of the ASN.1 DER-encoded form of the X509 certificate. The `state` can be any of the states of the CA certificate trust state machine described above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two concerns ... first are we sure all characters valid in a certificate fingerprint are valid for a key in the data section of a Secret? the second one is about how I can get use of this information just watching at the data ... I mean how can I recognize what's the CA cert just seeing a fingerprint and not a Subject or more human readable info like this? Or do you think that because it's all automated we don't really need human readibility of this?

@mattia-crypto
Copy link

Is there an estimated timeframe for this to be implemented?

@cconnert
Copy link

cconnert commented Jun 9, 2022

I stumbled across this proposal while trying to use cert-manager for client/cluster CA with strimzi. In my, layman, option there is no need to deeply integrated with third party certificate management tools (please correct me if I am wrong).

To my understanding strimzi puts strict requirements on the secrets and their structure (https://strimzi.io/docs/operators/latest/configuring.html#installing-your-own-ca-certificates-str) and does not provide a way to integrated a different structure like e.g. provided by cert-manager. Briefly it requires:

  • that key and certificates are provided by different secrets.
  • that the follow a naming convention (ca.key, ca.crt, ca.p12, ca.password)
  • that the certificate is provided in PEM and PKCS 12 format and the password of the PKCS 12 is provided as well.

So what we need is a way to map different structure to the required structure. This functionality is already implemented for the brokerCertChainAndKey configuration. There we are able to remap a secret into the internal structure and strimzi takes care of managing the PKCS 12 store for us.

Thus, naively, I would argue all it takes to address this issue is to reuse the functionality of brokerCertChainAndKey for clusterCa and clientsCa by e.g.: introducing a new configuration customCaSecret. When the new configuration is present it entails that generateCertificateAuthority: false. That way backwards compatibility can be ensured.

@scholzj
Copy link
Member

scholzj commented Jun 9, 2022

@cconnert I do not think that on its own is sufficient for multiple reasons:

  • It is not just about the structure of the secrets but also about the information they provide. For example when migrating to a new CA etc.
  • For many people, Kubernetes Secrets are not the thing they want to use because they are not really that secret.

@iMajna
Copy link

iMajna commented Jun 9, 2022

For many people, Kubernetes Secrets are not the thing they want to use because they are not really that secret.

I second that. There was vulnerability in Grafana where you could get all secrets from k8s in plaintext. If you have external CA private key there as a k8s secret which Strimzi uses for generating certs you most certainly want to put Chinese wall in front of it - aka place it somewhere where it can be secure.

@tombentley
Copy link
Member Author

@cconnert thanks for the input! I have also wondered whether we really need to go to these lengths, because this design is not simple.

I think Jakub's point is a good one:

It is not just about the structure of the secrets but also about the information they provide. For example when migrating to a new CA etc.

A new customCaSecret configuration could, I'm sure, be made to work in the basic case of standing up a cluster having configured customCaSecret. But I think we should be aiming to also solve the problem of how those certificates get renewed. Given the trust relationships between the components that the CO manages doing that without requiring downtime is not trivial. You could come up with some way to do it manually, with a documented procedure to follow, but it's the sort of thing people regularly get wrong. This is a cost borne by every (production) user. So I think there is significant value in having the operator do it for you in a way that tested and really hard to get wrong. It also enables other things that people sometime like, such as more frequent cert rotations, which are disincentivised by leaving the process to users.

@cconnert
Copy link

@scholzj, @iMajna,@tombentley thanks for getting back to me. I do see your points. Just wanted to raise the question if the solutions could be simplified. Also I wanted to understand what you aim for by this proposal so I can figure out if my use-case can be solved.

CA migration / renewal is definitely a difficult problem which likely requires a complex solution. cert-manager has an open issue on that cert-manager/cert-manager#2478 and I am looking forward to see it implemented. A zero-downtime constraint makes this task even harder. If "client" side communication is also secured by TLS I would argue that is something an operator cannot solve solely, but I guess that is out of scope of this proposal.

@scholzj
Copy link
Member

scholzj commented Jan 25, 2024

Triaged on the Strimzi Community call on 25.1.2024: @katheris is getting back to work on this topic. However, the expectation is that there will be a new PR (although based on the content from this PR). This should be closed for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants