Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document process for creating a new CT log shard #589

Closed
Tracked by #766
haydentherapper opened this issue May 17, 2022 · 14 comments
Closed
Tracked by #766

Document process for creating a new CT log shard #589

haydentherapper opened this issue May 17, 2022 · 14 comments
Assignees
Labels
enhancement New feature or request ga-blocker

Comments

@haydentherapper
Copy link
Contributor

Description

@haydentherapper haydentherapper added the enhancement New feature or request label May 17, 2022
@haydentherapper haydentherapper self-assigned this Jun 1, 2022
@haydentherapper
Copy link
Contributor Author

To rotate a log, the following needs to occur:

  • Spin up a new Trillian instance (log server and signer) and MySQL database
    • The log should use the same signing key unless there was a compromise. If there was a compromise, the log's verification key must first be distributed to clients via TUF
    • The log's prefix will be changed to the current year. Currently, the prefix is test. For the first sharding, it will either 2022
  • Verify the log's health
  • Update Fulcio's configuration to point to the new log
  • Roll out Fulcio. Fulcio will not dual write to both logs. One instance of Fulcio may write to a different log than another instance as it's rolling out, but this is not an issue.
  • Freeze the old log

This is a simpler process than Rekor, since we don't maintain a virtual index in front of all shards. Our tooling does not access the logs directly, it simply verifies SCTs on signing and verification. As long as the log signing key does not change, SCTs will continue to be verified without issue for all shards.

@haydentherapper
Copy link
Contributor Author

Also should add a prober pinging ct/v1/get-sth for each log shard

@haydentherapper
Copy link
Contributor Author

Chatted with @k4leung4 about the process for sharding a CT log. To summarize, we need to add support for creating an arbitrary number of CT log instances, where each will have its own Trillian tree and configmap.

One option is to create separate GCP instances for the database that backs Trillian. I opened up an issue to discuss separating Rekor and the CT log's infrastructure first. I'd be fine with having all of the CT logs' trees in a single DB, but I would prefer it be isolated from Rekor.

Assuming any Terraform changes are done outside of the scope of this work, we will focus on updating the Helm configurations. We will need to:

  • Add support to run an arbitrary number of CT log instances that share the same Trillian backend (Hopefully we can follow an example from the sharding work)
  • Handle ingress routing - Each log will have its own prefix, e.g ctfe.sigstore.dev/2022, ctfe.sigstore.dev/2023

For freezing the log, looks like we've already set up infrastructure to do this, which is documented in the sharding playbook and uses the updatetree job.

@vaikas
Copy link
Contributor

vaikas commented Aug 11, 2022

I'd be happy to help with this effort if help is needed :)

Since this issue is under Fulcio, I'd like to clarify the discussion about having multiple CT Log instances and 'ingress routing'. I'm not clear if we're talking about adding support for a single Fulcio to be able to write to multiple CTLogs based on some criteria (hence the question about ingress routing). Bear with me while I get my understanding of what's left to do :)

Today the CTLog endpoint is a flag like:

--ct-log-url=http://ctlog.ctlog-system.svc/sigstorescaffolding

Question: Are we expecting (as part of this effort or in the future to be able to write to multiple CTLogs?). Just trying to make sure I understand if this requires changes to Fulcio or not.

For the 'ingress routing', is that different from the flag above? As in, any Fulcio instance has 1:1 to a CTLog, or again are there some changes required to Fulcio?

But, from the comment above: #589 (comment)

I think we are saying that "we" as in Sigstore needs to be able to handle operating / writing to multiple CT Log instances. If we have multiple Fulcio instances running at the same time, each of them would still be writing 1:1 to a CT Log instance. Is that correct?

@haydentherapper
Copy link
Contributor Author

haydentherapper commented Aug 11, 2022

@vaikas That would be very appreciated if you would like to help! My knowledge of Helm is lacking :) Happy to sync with you to chat more about this and review any PRs.

Question: Are we expecting (as part of this effort or in the future to be able to write to multiple CTLogs?). Just trying to make sure I understand if this requires changes to Fulcio or not.

No, this is not in scope. Fulcio only needs to write to one CT log. Maybe we'd consider writing to an external one at a later point, but that should be a simple change, to just make ct-log-url repeated.

I think we are saying that "we" as in Sigstore needs to be able to handle operating / writing to multiple CT Log instances. If we have multiple Fulcio instances running at the same time, each of them would still be writing 1:1 to a CT Log instance. Is that correct?

This is correct. The purpose of this work is to be able to rotate in fresh shards so we don't indefinitely grow a single CT instance (which will have performance degradations over time). One instance of Fulcio writes to a single CT log instance at a point in time.

For the 'ingress routing', is that different from the flag above? As in, any Fulcio instance has 1:1 to a CTLog, or again are there some changes required to Fulcio?

When I talked about routing, I was referring to the public URL for accessing the CT log, ctfe.sigstore.dev/<id>, currently ctfe.sigstore.dev/test. Here's how I view it, lemme know if this sounds good:

  • There is a CT log, publicly accessible on ctfe.sigstore.dev/<id>, accessible within the cluster at http://ctlog.ctlog-system.svc/<id>
  • Fulcio is configured to make requests to http://ctlog.ctlog-system.svc/<id>
  • Each year, we will need to create a new CT log instance, accessible at ctfe.sigstore.dev/<other ID> and http://ctlog.ctlog-system.svc/<other ID> (we'll use the current year for the ID)
  • We will create the CT log instance manually, and it will be unused until we update the Fulcio configuration
  • Once the new CT log instance is up, we will update the Fulcio configuration to point to the new CT log (< other ID>).
  • We will not turn down old logs - This is critical, old logs must still be publicly accessible for monitors. (We'll decide the life of the log later, probably 5 years).
  • Once the Fulcio configuration has rolled out, the old log should be put into a read-only mode.

One other detail - In the same vein as https://github.com/sigstore/public-good-instance/issues/343, we should ideally use a separate database for each CT log instance so we don't have to indefinitely grow the same database.

@vaikas
Copy link
Contributor

vaikas commented Aug 11, 2022

That all sounds great, thanks! That's how I roughly understood things, but got confused by some comment in some other bug, so just wanted to double-check :)

The one other thing (that's probably discussed elsewhere) is the "reverse" of this. When Fulcio Cert rotates, the new cert must be added to the trusted certs on the CT Log side. I looked quickly, but didn't see an issue for this, is there one for it somewhere?

read-only mode for the logs == 'freeze' of the trillian, or is there a knob for that in CTLog also?

@vaikas
Copy link
Contributor

vaikas commented Aug 11, 2022

Re: separate database, if we do that, then we'll basically have 1:1 of
Fulcio - CTLog - Trillian - mysql

So all four get operated as a "single entity"?
If so, kneejerk response is that I think it makes things easier to operate.

@haydentherapper
Copy link
Contributor Author

The one other thing (that's probably discussed elsewhere) is the "reverse" of this. When Fulcio Cert rotates, the new cert must be added to the trusted certs on the CT Log side. I looked quickly, but didn't see an issue for this, is there one for it somewhere?

That is a good question. Right now, the root is automatically fetched when createctconfig is run. https://github.com/haydentherapper/scaffolding/blob/079be7cd54dd47bb0df9ac1af3193f765986f3bc/cmd/ctlog/createctconfig/main.go#L106
Can it be rerun to fetch the latest root? If not, can you create an issue for this?

read-only mode for the logs == 'freeze' of the trillian, or is there a knob for that in CTLog also?

It should be the same configuration since CT is backed by Trillian.

So all four get operated as a "single entity"?

Yes, that would be the plan. Right now, the same Trillian (and mysql) instance operates Rekor and CT. There's an open conversation right now if we will take on the work of separating the two before GA, but I think we can if we set up the CT log sharding to use separate Trillian/mysql instances.

@vaikas
Copy link
Contributor

vaikas commented Aug 12, 2022

Yeah, I remember that code :) That's part of the reason I was asking. In particular if we have the 1:1 stack that gets operated as a single entity, then we'll have a case where we might need to rotate a Fulcio key. Would that trigger a new Stack creation (new ctlog, trillian, etc.), or merely we upgrade the cert for Fulcio and roll it out. If we do that, then we need to add that new cert of Fulcio to CTLog roots PEM. Current code works great but assumes there's only one. I think if we rotate they and launch new instances then we have to add the new one so ctlog will accept from both old and new, and then eventually we'll need to remove the old one once the roll out completes, I think.
So, I think the question really is: If we need to rotate fulcio, will that create a new stack or not?
If not, we'll need to do some work in createctconfig as well as add a cleanup step after fulcio rollout completes.

@haydentherapper
Copy link
Contributor Author

haydentherapper commented Aug 12, 2022

So, I think the question really is: If we need to rotate fulcio, will that create a new stack or not?

I would say no. I separate the two - Fulcio cert would be rotated due to expiration for example, which might happen mid year, or in an emergency due to compromise. The log sharding will happen yearly to keep size down (or in the event of a compromise of the CT log key).

I think we need to do the work you specified in the ticket. Maybe allowing for you to manually specify the root certificates in addition to fetching the certificate from Fulcio? Something like:

  • Create new Fulcio root
  • Manual job to append Fulcio root to trusted CT log roots
  • Change configuration for Fulcio, redeploy with new root
  • Manual job to re-sync Fulcio root to CT log (removing the old root)

Is that doable? I'm not familiar with scaffolding so there might be a better way.

@haydentherapper
Copy link
Contributor Author

Something to mention, the root rotation will be very infrequent. Fulcio is configured with an intermediate certificate - that might change if we change the signing key for fulcio, but the intermediate doesn’t have to be distributed to the log. Still need a mechanism in place, but it’ll be used not very often.

@haydentherapper
Copy link
Contributor Author

Haven’t dug into this much to see if it’s useful, but there is some configuration options for limiting when logs will accept entires https://github.com/google/certificate-transparency-go/blob/master/trillian/docs/Operation.md#temporal-sharding

@vaikas
Copy link
Contributor

vaikas commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ga-blocker
Projects
None yet
Development

No branches or pull requests

2 participants