Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Forced Rotation and Revocation #1934

Open
evan2645 opened this issue Oct 24, 2020 · 7 comments
Open

[RFC] Forced Rotation and Revocation #1934

evan2645 opened this issue Oct 24, 2020 · 7 comments
Assignees
Labels
priority/backlog Issue is approved and in the backlog

Comments

@evan2645
Copy link
Member

evan2645 commented Oct 24, 2020

Introduction

Forced rotation and revocation has been a roadmap item for a while, and the time has come to scope it and put forth a proposal. The goal of this roadmap item is to provide a rapid, reliable, and automated mechanism for recovering from key compromise. This represents a very significant improvement over the current situation, which involves manual surgery on the datastore, as well as deep knowledge on SPIRE internals.

As our first steps into this territory, I'd like to propose the following scope, which I believe is the smallest reasonable scope that can still accomplish our goal:

  • Premature authority rotation (i.e. rotation of signing keys on demand)
  • Authority revocation (i.e. safe removal of key(s) from the bundle)

The following items would be out-of-scope:

  • Leaf SVID revocation
  • CRLs and JTI deny lists

The major tradeoff considered in the scope of this proposal is operator action over AllowList/DenyList maintenance and distribution. In the event of a compromise, an operator must "rotate away" from the compromised keys, as opposed to distributing a list of distrusted keys to all consumers. This tradeoff is consistent with existing SPIFFE philosophies. It is also more reliable, as it exercises existing pathways that must be working.

A Four Step Operation

To safely rotate away from a compromised key, four distinct operations take place (from a user's perspective). First, we must prepare a new signing key. This creates the key to be rotated to, adding it to the bundle to begin its propagation. Second, we must activate the new signing key. This step shifts active signing operations off of the compromised key and onto the new key. The amount of time it takes to move from step one to step two must be as small as is safely possible.

In the third step, we must declare that the compromised key is about to be revoked. This declaration causes agents to proactively rotate any SVIDs they are managing that are associated with the to-be-revoked key. Finally, in the fourth step, we actively revoke the key. This removes the key from the bundle, an update that propagates to all agents and workloads. The amount of time it takes to move from step three to step four must be as small as is safely possible.

There may be an opportunity to combine step two and step three if the implementation and experience can easily allow for it.

Step 1: Prepare a New Signing Key

This step involves generating a new signing key, and injecting it into the bundle for distribution. The "next" signing key may already be prepared, however it may also be possible that an operator wants to step past that already-prepared key, and prepare a new one (if, for example, the prepared key is also compromised). Therefor, the prepare operation should allow an operator to provide the -f force flag equivalent, to prepare a new key overwriting the previously prepared key.

Step 2: Activate a New Signing Key

This step involves shifting signing operations away from the old or compromised key, and towards the newly generated key. To do this safely, we need to have some level of understanding on how many agents have picked up the new key (as it was generated in step 1). It then becomes an operator decision on what level of propagation is "good enough".

Another factor to consider is how likely it is that other (federated) trust domains have picked up the new key. At the very least, this can be implied by the amount of time elapsed since step 1, compared to the SPIFFE bundle refresh hint that is currently set. Considering this period, as well as the relative number of agents that have picked up the new key, warnings should be reported/logged if necessary (e.g. "Are you sure you want to do this?")

Step 3: Signal an Impending Key Revocation

This step involves distributing knowledge of an impending key revocation. This is important because we need agents and workloads to actively (and rapidly) move off of key validation paths that include the compromised key. To do this, agents will need to be able to recognize this condition, audit their caches to see which SVIDs need renewal, and pro-actively request new SVIDs. The assumption is that by this point, signing operations on the key-to-be-revoked are no longer active (thus avoiding the chance that a renewed SVID is signed by the old key).

This operation may cause a flood of renewals. The load will need to be managed appropriately.

Step 4: Revoke the Compromised Key

This step involves completely removing the old key from the bundle. This bundle update will propagate downwards, and be pushed into workloads attached to the workload API. At this time, we want to be as certain as is reasonably possible that agents have completed their rotation away from the affected validation path. Ideally, there is a feedback mechanism to understand which agents have completed this task and which have not.

Some operators may prefer availability over the continued use of a potentially-compromised key. To accommodate this case, undoing the action should be straightforward and trivial.

New API Things

It is necessary to introduce some new APIs in order to achieve the process described above. In this section, I propose some APIs for this purpose.

It should be noted that the APIs proposed herein are geared towards generality, as opposed to the specific task at hand. I imagine that these APIs will grow over time, and will likely result in convenient avenues on which further (unrelated) features may be built.

Agent Status

In order to accomplish step two, and likely step four, it is necessary to have some insight into agent state. This insight is best achieved at the server level, for multiple reasons. One reason is that there may be different teams or organizations that are managing agents vs servers. Another reason is that this key manipulation process takes place at the server level - having to interrogate every agent directly will only slow things down. In the end, an operator performing these steps may not be in a position to collect information from every agent.

To solve this, I propose that we introduce a new RPC on the existing Agent API. This new RPC will allow agents to periodically post their status to the server, which in turn stores the latest reported state. For the purpose of this RFC, all we really need is the sequence number of the bundle that the agent has loaded, but it is easy to imagine further use of this update (e.g. agent version number).

The introduction of this API will not only help us accomplish the task at hand, but will also light the path for a much better agent management experience.

Key Management API

SPIRE currently has an API for bundle management, which covers CRUD-like operations on bundle resources stored by SPIRE. It is important to note that this management extends beyond just the local trust domain's bundle - it also applies to the management of federated bundles.

What is missing in this API is the ability to manipulate keys. Bundles are strictly "public"... and while one of those bundles (the local bundle) has ties to locally-managed keys, there is no way to manage the keys or the bundle that represents them, in lockstep.

Rather than overload the bundle API with key management operations, I propose the introduction of a new API. This API would be responsible for handling both step one and step two actions described in this proposal - the early preparation and activation of locally managed keys. It can also expose an interface for listing keys etc, which is not available today.

Bundle Revocation and Signaling

In order to support step four, we need an interface by which we can both signal for removal and actually remove an authority from the local bundle. I propose that this manifest as an RPC addition(s) to the existing Bundle API.

Due to the way that bundle information is stored in SPIRE, this task may be more involved than it seems. The requirement put forth by this proposal is that an operator completing step three of the process must be able to easily complete step four with the information that they already have on hand (e.g. the operator uses the SKID of the CA certificate to be removed in both step three and four).

The Bigger Picture

I feel that the absence of available interfaces for accomplishing this task is indicative of a larger problem. There are aspects of day-to-day SPIRE management that are not covered by the existing APIs, which (to me, at least) illustrates that we have overlooked some areas of functionality when designing them. As a result, I've made my best effort to model this proposal around generalized APIs that will be useful for future endeavors.

All of this is to say that, regardless of whether this proposal is accepted or not, the ultimate solution to the problem at hand should result in generalized interfaces and functionality rather than specialized.

Request for Comments

  • Does the scope seem reasonable?
  • Do the API proposals feel directionally correct?
  • Are there implementation details that may make this less desirable over another?
  • What other options might there be to achieve this goal faster and with less work?
  • Any concerns over taking this approach to revocation over something else?

Any/all comments are appreciated!

@bri365
Copy link
Contributor

bri365 commented Oct 26, 2020

Thanks, Evan. The presented scope feels like the right first step - it accomplishes the goals without undue complication or burden. Also, exercising "existing pathways that must be working" along with the proactive nature of the proposed operations feels preferable. Revocation lists feel more likely to increase polling activities which may lead to scaling and stability concerns.

Keeping key operations separate from bundle operations feels safer than combining them. It may be worth noting that the proposed agent status update preserves the security posture of no agent network listener - an important element in our overall security design.

I am a little concerned about the ambiguous nature of the timing of completeness in steps 2 and 4. I believe strong instrumentation and reporting are critical to successful implementation of this solution, along with perhaps some configurable threshold defaults for the percentage of agent completion (through status updates). This may also help with future automation efforts as well for large scale rotation.

What positive acknowledgment mechanisms do/can we have for federated trust domains?

@MarcosDY
Copy link
Collaborator

MarcosDY commented Oct 26, 2020

is Key Management API going to have the same access (tcp + uds) that regular server API?

All this process can result in a lot of work for server (we'll force rotation of all SVIDs for agent/workload) maybe we can add a ratelimit or prevents to rotates only x SVIDs each n seconds, with actual approach we'll have more control when we force rotation but once it is forced, all SVIDs will rotate again at the same time (if all use the same ttl)

@evan2645
Copy link
Member Author

I am a little concerned about the ambiguous nature of the timing of completeness in steps 2 and 4. I believe strong instrumentation and reporting are critical to successful implementation of this solution, along with perhaps some configurable threshold defaults for the percentage of agent completion (through status updates)

Yes I agree - the agent status api described in this proposal is the main avenue by which we can gain visibility into the safety of step 2 and 4. I should note that steps 1, 2, and 4 are all operations that occur in SPIRE today - as such, we're very familiar with the behavior... the difference is that today, safety is provided by delaying these operations for long enough that we can be reasonably certain that everything has picked up the changes. In this proposal, we want to (safely) speed that up. Having an operator present makes it easier.

What positive acknowledgment mechanisms do/can we have for federated trust domains?

Unfortunately, none. SPIFFE does not require bundle endpoint clients to authenticate, nor does it require servers to have knowledge of their consumers. One thing we can do is warn based on the SPIFFE bundle refresh hint... though this does assume that bundle endpoint clients are respecting the hint.

This does bring up an additional point - we may want the ability to roll back step 2.

is Key Management API going to have the same access (tcp + uds) that regular server API?

I think so?

All this process can result in a lot of work for server (we'll force rotation of all SVIDs for agent/workload) maybe we can add a ratelimit or prevents to rotates only x SVIDs each n seconds, with actual approach we'll have more control when we force rotation but once it is forced, all SVIDs will rotate again at the same time (if all use the same ttl)

Yes this is a good point. One thing we could do is add some jitter to SVID TTLs when we sign them. I think this has been brought up before as a generally needed feature.

@MarcosDY
Copy link
Collaborator

my comment about key manager api, is becase that one is like a red button, and maybe, some people will complain about who can press it, and if it is ok to be open on http, or allow only for a local service,

another question here about that key manager api, is that we'll have key managers, (we have allow create plugins for that) based in a quick look to that, those key managers creates new keys on demand, so it is no possible that those managers update a key by them-self, and notify us, but what happens if that communication fails (no key could be generated on demand) on this case? we'll just log and notify admin to try it again displaying error from key manager right?

@evan2645
Copy link
Member Author

my comment about key manager api, is becase that one is like a red button, and maybe, some people will complain about who can press it, and if it is ok to be open on http, or allow only for a local service,

Hmm yea. If we have to, I suppose we could limit calls to local clients... but I'd really hate to start splitting things up like that. In the end, I think the real problem is lack of granular permissions for admin SVIDs (#716). I was told yesterday that we should see a renewed proposal coming through for that soon.

another question here about that key manager api, is that we'll have key managers, (we have allow create plugins for that) based in a quick look to that, those key managers creates new keys on demand, so it is no possible that those managers update a key by them-self, and notify us, but what happens if that communication fails (no key could be generated on demand) on this case? we'll just log and notify admin to try it again displaying error from key manager right?

Yes, I think so. The operations in this procedure all benefit from tight feedback loops, so I'd expect the call to the rotate API to be blocking, and bubble up any errors that we get from the keymanager during the process.

Copy link

This issue is stale because it has been open for 365 days with no activity.

@github-actions github-actions bot added the stale label Apr 17, 2024
@evan2645 evan2645 removed the stale label Apr 18, 2024
@evan2645
Copy link
Member Author

This work is active and being tracked as part of GH project https://github.com/orgs/spiffe/projects/21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/backlog Issue is approved and in the backlog
Projects
None yet
Development

No branches or pull requests

7 participants