Improving Sharding Downtime #806

priyawadhwa · 2022-05-03T16:55:17Z

Right now, sharding the log requires about 10-20 minutes of downtime.

Sharding process currently looks like this:

Create a new trillian tree
Mark the current tree as frozen (at this point, the log is readable but not writable)
Update rekor config with the following:
- Change the active tree to the tree created in (1)
- Update the sharding config with the now-inactive tree ID and length
The log is ready to go again!

Downtime occurs between freezing the log & updating the sharding config. We require the length of the inactive shard in the config to properly compute virtual indexes across shards.

FAQ

Why can't we migrate traffic from one shard to the next and then update the sharding config?

In this case, the inactive shard wouldn't be correctly queryable for some portion of time. That's because rekor relies on the sharding config to determine which shards to search and how to determine the "virtual index" across shards. In the period between switching over traffic and updating the sharding config, we could append entries to the log but we wouldn't be able to access entries in the frozen log -- almost all verifications would fail & get by log index would be incorrect.

What do we use the sharding config for?

The sharding config keeps track of inactive shards and for each tracks:

ID
Length
Associated public key (needed if we rotate keys for new shards)

Instead of manually creating the sharding config, why don't we automatically search for shards and compute the config in code?

We'd have to guess at how to order the shards (this determines virtual log index computation). We could assume shards should be ordered chronologically, if we're comfortable with that it could reduce downtime.

lkatalin · 2022-05-03T17:06:17Z

Thanks for this recap, @priyawadhwa ! I would be a fan of trying to automate as much of this as possible instead of doing manual shard config / restart. The manual portions introduce the most downtime and also I think there is some possibility for user error down the line if the project continues to grow and scale. Could the issue with trillian trees not necessarily being chronologically ordered be fixed by Rekor code ordering them and keeping track?

lkatalin · 2022-05-03T17:20:58Z

Other ideas floated in the meeting:

Instead of using the size of the logs to calculate the virt index, have a pointer to the next log
Create a mapping of log index <--> UUID only during migration-to-new-shard periods

priyawadhwa · 2022-05-03T18:44:33Z

Could the issue with trillian trees not necessarily being chronologically ordered be fixed by Rekor code ordering them and keeping track?

We can order trees chronologically during runtime, there's a CreateTime field on the Tree struct.

I think we can achieve way less downtime by keeping the config, but removing the requirement for knowing the size of the inactive shard ahead of time. We can figure out the size during runtime. Sharding would look something like this:

Create a new tree
Add the old tree ID to the sharding config (w/ public key if we are rotating signers)
Update the Rekor configmap to point to the new tree ID
Redeploy Rekor to pick up these new changes
During runtime, Rekor will get the sizes of the inactive shards (maintaining the order in the config) to determine the virtual log index for entries

WDYT @lkatalin ?

lkatalin · 2022-05-04T14:50:32Z

@priyawadhwa I read this a couple of times, at first thinking you meant these steps would be implemented in the code (except redeployment) but now realizing that steps 1-4 are manual and 5 is implemented in code - is this right? So the change would be that instead of passing in tree sizes in the config, Rekor can calculate them at runtime, but the config is still used to specify and order the shards. I think that in itself would be a good change.

My first (I think mistaken) read of your idea was that these steps would be automated and that led me to thinking about the utility of eventually having an API endpoint for sharding commands that could perform some of these functions. Then we could avoid having to do a redeployment to update the config, and instead the new config could be picked up during runtime. This would be in addition to having Rekor compute the inactive shard sizes separately from the config. Something like:

rekor-server shard --create-config mynewconfig.yaml #generates updated config (other params tbd)
# next, someone can manually edit the config if desired or even skip the previous step
rekor-server shard --use-config mynewconfig.yaml

In addition to not needing to do a redeployment, the benefits would be less work for whoever is creating the config or updating the server to use it, and less room for human error while still giving the option to do overrides.

I think the step you described of having Rekor get the sizes of inactive shards during runtime is more important and then this could be additional if we want to do it. There may be some problems as I'm still thinking through it but it seem beneficial to skip mandatory redeployment if possible. What do others think?

priyawadhwa · 2022-05-04T15:33:20Z

but now realizing that steps 1-4 are manual and 5 is implemented in code - is this right?

Yep! I think that removing the tree length requirement is the minimal change we can make to get sharding to work without downtime. I'll probably do some manual testing to confirm that.

an API endpoint for sharding commands that could perform some of these functions.

This could be really useful, we'd just need some way to make sure only authorized users could hit this endpoint!

priyawadhwa · 2022-05-10T18:21:12Z

Update: With #810 we've pretty much removed downtime, instead there is a small race condition when the Rekor deployment is spinning down old pods (which point to the old shard) and spinning up new Pods (which point to the new shard). The risk is that the LoadBalancer would send requests to the old shard, and the new Pods may not pick up those new entries as they come up.

We have gotten around this for now by scaling down the Deployment to 1 Pod for the turnover and sharding staging when there are low requests being made (which is most of the time for staging 😅 ). For a seamless experience in production I think we could do the following (none of which should require big changes):

Add support for automatically mark shards as FROZEN in the code. As soon as Rekor is redeployed and a "new" Pod spins up it will mark old shards inactive. It will then get the length of the shards, which is immutable once the shard is marked Frozen. Thus, we can be confident that the length we get is correct. If writes are directed to the old Pod in the 20 seconds it takes for the old Pod to spin down they will fail. This can be addressed by:
@lukehinds idea to add in retries. If we retry after 20 seconds this should pretty much address the small period of time when we are redeploying.

@lkatalin WDYT?

var-sdk · 2022-05-12T16:37:59Z

Regarding the proposal around seamless updates, is it rollback safe? i.e. if something goes wrong with the new Rekor deployment during the roll out or for some window after the roll out has completed, can the system quickly and easily be restored to the previous state?

With the current approach, there is a small window of downtime but it does seem to have rollback safety at the cost of a small window of data loss.

priyawadhwa · 2022-05-12T16:48:37Z

Regarding the proposal around seamless updates, is it rollback safe?

That's a good point. If we're automatically marking logs as FROZEN when they're in the sharding config, we would need to automatically make them as ACTIVE again if we rolled back.

priyawadhwa · 2022-05-16T18:37:12Z

Re: my initial idea around marking pods as FROZEN; I think an easier way of achieving the same thing would be to mark the Rekor deployment strategy as Recreate -- all old pods would be terminated before new ones come up. The current strategy is the default Rolling Update. the nice thing about this is that if something goes wrong, we can easily run kubectl rollout undo to get back to the old state.

haydentherapper · 2023-01-03T04:53:32Z

@priyawadhwa This is complete, correct?

priyawadhwa · 2023-01-03T16:09:57Z

Yep!

priyawadhwa added the enhancement New feature or request label May 3, 2022

priyawadhwa mentioned this issue May 4, 2022

Retrieve shard tree length if it isn't provided in the config #810

Merged

lukehinds mentioned this issue May 6, 2022

client reference / standard sigstore/community#58

Open

priyawadhwa closed this as completed Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Sharding Downtime #806

Improving Sharding Downtime #806

priyawadhwa commented May 3, 2022 •

edited

Loading

lkatalin commented May 3, 2022

lkatalin commented May 3, 2022

priyawadhwa commented May 3, 2022

lkatalin commented May 4, 2022 •

edited

Loading

priyawadhwa commented May 4, 2022

priyawadhwa commented May 10, 2022 •

edited

Loading

var-sdk commented May 12, 2022

priyawadhwa commented May 12, 2022

priyawadhwa commented May 16, 2022 •

edited

Loading

haydentherapper commented Jan 3, 2023

priyawadhwa commented Jan 3, 2023

Improving Sharding Downtime #806

Improving Sharding Downtime #806

Comments

priyawadhwa commented May 3, 2022 • edited Loading

lkatalin commented May 3, 2022

lkatalin commented May 3, 2022

priyawadhwa commented May 3, 2022

lkatalin commented May 4, 2022 • edited Loading

priyawadhwa commented May 4, 2022

priyawadhwa commented May 10, 2022 • edited Loading

var-sdk commented May 12, 2022

priyawadhwa commented May 12, 2022

priyawadhwa commented May 16, 2022 • edited Loading

haydentherapper commented Jan 3, 2023

priyawadhwa commented Jan 3, 2023

priyawadhwa commented May 3, 2022 •

edited

Loading

lkatalin commented May 4, 2022 •

edited

Loading

priyawadhwa commented May 10, 2022 •

edited

Loading

priyawadhwa commented May 16, 2022 •

edited

Loading