Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Sharding Downtime #806

Closed
priyawadhwa opened this issue May 3, 2022 · 11 comments
Closed

Improving Sharding Downtime #806

priyawadhwa opened this issue May 3, 2022 · 11 comments
Labels
enhancement New feature or request

Comments

@priyawadhwa
Copy link
Contributor

priyawadhwa commented May 3, 2022

Right now, sharding the log requires about 10-20 minutes of downtime.

Sharding process currently looks like this:

  1. Create a new trillian tree
  2. Mark the current tree as frozen (at this point, the log is readable but not writable)
  3. Update rekor config with the following:
    • Change the active tree to the tree created in (1)
    • Update the sharding config with the now-inactive tree ID and length
  4. The log is ready to go again!

Downtime occurs between freezing the log & updating the sharding config. We require the length of the inactive shard in the config to properly compute virtual indexes across shards.

FAQ

Why can't we migrate traffic from one shard to the next and then update the sharding config?

In this case, the inactive shard wouldn't be correctly queryable for some portion of time. That's because rekor relies on the sharding config to determine which shards to search and how to determine the "virtual index" across shards. In the period between switching over traffic and updating the sharding config, we could append entries to the log but we wouldn't be able to access entries in the frozen log -- almost all verifications would fail & get by log index would be incorrect.

What do we use the sharding config for?

The sharding config keeps track of inactive shards and for each tracks:

  • ID
  • Length
  • Associated public key (needed if we rotate keys for new shards)

Instead of manually creating the sharding config, why don't we automatically search for shards and compute the config in code?

We'd have to guess at how to order the shards (this determines virtual log index computation). We could assume shards should be ordered chronologically, if we're comfortable with that it could reduce downtime.

@priyawadhwa priyawadhwa added the enhancement New feature or request label May 3, 2022
@lkatalin
Copy link
Contributor

lkatalin commented May 3, 2022

Thanks for this recap, @priyawadhwa ! I would be a fan of trying to automate as much of this as possible instead of doing manual shard config / restart. The manual portions introduce the most downtime and also I think there is some possibility for user error down the line if the project continues to grow and scale. Could the issue with trillian trees not necessarily being chronologically ordered be fixed by Rekor code ordering them and keeping track?

@lkatalin
Copy link
Contributor

lkatalin commented May 3, 2022

Other ideas floated in the meeting:

  • Instead of using the size of the logs to calculate the virt index, have a pointer to the next log
  • Create a mapping of log index <--> UUID only during migration-to-new-shard periods

@priyawadhwa
Copy link
Contributor Author

Could the issue with trillian trees not necessarily being chronologically ordered be fixed by Rekor code ordering them and keeping track?

We can order trees chronologically during runtime, there's a CreateTime field on the Tree struct.

I think we can achieve way less downtime by keeping the config, but removing the requirement for knowing the size of the inactive shard ahead of time. We can figure out the size during runtime. Sharding would look something like this:

  1. Create a new tree
  2. Add the old tree ID to the sharding config (w/ public key if we are rotating signers)
  3. Update the Rekor configmap to point to the new tree ID
  4. Redeploy Rekor to pick up these new changes
  5. During runtime, Rekor will get the sizes of the inactive shards (maintaining the order in the config) to determine the virtual log index for entries

WDYT @lkatalin ?

@lkatalin
Copy link
Contributor

lkatalin commented May 4, 2022

@priyawadhwa I read this a couple of times, at first thinking you meant these steps would be implemented in the code (except redeployment) but now realizing that steps 1-4 are manual and 5 is implemented in code - is this right? So the change would be that instead of passing in tree sizes in the config, Rekor can calculate them at runtime, but the config is still used to specify and order the shards. I think that in itself would be a good change.

My first (I think mistaken) read of your idea was that these steps would be automated and that led me to thinking about the utility of eventually having an API endpoint for sharding commands that could perform some of these functions. Then we could avoid having to do a redeployment to update the config, and instead the new config could be picked up during runtime. This would be in addition to having Rekor compute the inactive shard sizes separately from the config. Something like:

rekor-server shard --create-config mynewconfig.yaml #generates updated config (other params tbd)
# next, someone can manually edit the config if desired or even skip the previous step
rekor-server shard --use-config mynewconfig.yaml

In addition to not needing to do a redeployment, the benefits would be less work for whoever is creating the config or updating the server to use it, and less room for human error while still giving the option to do overrides.

I think the step you described of having Rekor get the sizes of inactive shards during runtime is more important and then this could be additional if we want to do it. There may be some problems as I'm still thinking through it but it seem beneficial to skip mandatory redeployment if possible. What do others think?

@priyawadhwa
Copy link
Contributor Author

but now realizing that steps 1-4 are manual and 5 is implemented in code - is this right?

Yep! I think that removing the tree length requirement is the minimal change we can make to get sharding to work without downtime. I'll probably do some manual testing to confirm that.

an API endpoint for sharding commands that could perform some of these functions.

This could be really useful, we'd just need some way to make sure only authorized users could hit this endpoint!

@priyawadhwa
Copy link
Contributor Author

priyawadhwa commented May 10, 2022

Update: With #810 we've pretty much removed downtime, instead there is a small race condition when the Rekor deployment is spinning down old pods (which point to the old shard) and spinning up new Pods (which point to the new shard). The risk is that the LoadBalancer would send requests to the old shard, and the new Pods may not pick up those new entries as they come up.

We have gotten around this for now by scaling down the Deployment to 1 Pod for the turnover and sharding staging when there are low requests being made (which is most of the time for staging 😅 ). For a seamless experience in production I think we could do the following (none of which should require big changes):

  1. Add support for automatically mark shards as FROZEN in the code. As soon as Rekor is redeployed and a "new" Pod spins up it will mark old shards inactive. It will then get the length of the shards, which is immutable once the shard is marked Frozen. Thus, we can be confident that the length we get is correct. If writes are directed to the old Pod in the 20 seconds it takes for the old Pod to spin down they will fail. This can be addressed by:
  2. @lukehinds idea to add in retries. If we retry after 20 seconds this should pretty much address the small period of time when we are redeploying.

@lkatalin WDYT?

@var-sdk
Copy link

var-sdk commented May 12, 2022

Regarding the proposal around seamless updates, is it rollback safe? i.e. if something goes wrong with the new Rekor deployment during the roll out or for some window after the roll out has completed, can the system quickly and easily be restored to the previous state?

With the current approach, there is a small window of downtime but it does seem to have rollback safety at the cost of a small window of data loss.

@priyawadhwa
Copy link
Contributor Author

Regarding the proposal around seamless updates, is it rollback safe?

That's a good point. If we're automatically marking logs as FROZEN when they're in the sharding config, we would need to automatically make them as ACTIVE again if we rolled back.

@priyawadhwa
Copy link
Contributor Author

priyawadhwa commented May 16, 2022

Re: my initial idea around marking pods as FROZEN; I think an easier way of achieving the same thing would be to mark the Rekor deployment strategy as Recreate -- all old pods would be terminated before new ones come up. The current strategy is the default Rolling Update. the nice thing about this is that if something goes wrong, we can easily run kubectl rollout undo to get back to the old state.

@haydentherapper
Copy link
Contributor

@priyawadhwa This is complete, correct?

@priyawadhwa
Copy link
Contributor Author

Yep!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants