-
Notifications
You must be signed in to change notification settings - Fork 2.4k
cert renewal can't succeed if account key has been recreated #18251
Description
Since #15595 that added the ARI 'replaces' extension field to cert renewal, orders will fail if the ACME account key with which the renewal order is created does not match that with which the old cert was being issued, see check in Boulder. Note that this check is only run for order with the 'replaces' extension.
Our issuance logic just creates ACME accounts if needed, so it is not impossible to end up in a situation where account key got deleted, but not a cert and when renewal time comes, we just create a new account key, attempt to renew the cert with a 'replaces' order and fail. Currently the only fix would be to delete the old cert, which means downtime.
Users can hit this with Kubernetes operator HA Ingress where account keys are stored in a Secret associated with the hosts (ProxyGroup Pods), but certs for each Ingress get stored in a separate Secret whose lifecycle is tied to the Ingress- if they delete and recreate the ProxyGroup but not Ingress resources, the account key will be recreated, but not the certs (the reasoning is that we don't want to unneccessary delete certs whose issuance is limited, but also we don't want to leave old Secrets lying around).
However, currently this flow leaves users with broken renewal further down the line that they cannot recover from without deleting the certs.
What we could do:
-
figure out if it is safe to add
replacesextension to order in core cert client.
I think this can't be done though - afaik we don't have a way to verify if a cert and account key we have are a match. We could retry without the field on errors/error string matches, but that seems too hacky. -
add some config knob that allows us to opt out of the
replacesin the operator. This would be the easiest to do -
We could also put
ProxyGroupaccount keys in a separateSecretthat we never delete, not even when theProxyGrouphas been deleted. That would have to be oneSecretper operator because we cannot assume that users will recreateProxyGroupwith the same name etc. We could do this as well - but it's probably not good enough UX- for example, if users explicitly deleted the account key Secret while debugging something, we could start failing renewals monts after with the only way to recover being to delete certs -
opt out of
replacesextension altogether - users could have deleted the account key for whatever reasons and failing renewals possibly months after that is not a good UX
As a side note, it could be nice if as part of ARI API there was also some method that helps in cases like ours, perhaps allows to check if a cert is for an account or something.