Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token renewal failure time bomb 馃挘 #48

Closed
moertel opened this issue Sep 23, 2020 · 5 comments 路 Fixed by #49
Closed

Token renewal failure time bomb 馃挘 #48

moertel opened this issue Sep 23, 2020 · 5 comments 路 Fixed by #49

Comments

@moertel
Copy link
Contributor

moertel commented Sep 23, 2020

I'm using token-based auth and am wondering whether there's a design issue with how tokens are renewed.

Whatever the value of VAULT_TOKEN_LEASE_DURATION, with each failure to renew the token, the initial "buffer" (of the recommended 24h) will get smaller and smaller until it hits a lower limit where there's only one chance (two if you're lucky) for renewal until the token expires:

func RenewToken() {
for {
log.Info("Renew Vault token")
_, err := client.Auth().Token().RenewSelf(tokenLeaseDuration)
if err != nil {
log.Error(err, "Could not renew token")
}
time.Sleep(time.Duration(float64(tokenLeaseDuration)*0.5) * time.Second)
}
}

Example for a token with initial TTL of 24h and VAULT_TOKEN_LEASE_DURATION set to 6h:

  • Sep 23 00:00:00 -- ttl=24h, renew fails
  • Sep 23 03:00:00 -- ttl=21h, renew fails
  • Sep 23 06:00:00 -- ttl=18h, renew fails
  • Sep 23 09:00:00 -- ttl=15h, renew fails
  • Sep 23 12:00:00 -- ttl=12h, renew fails
  • Sep 23 15:00:00 -- ttl=9h, renew fails
  • Sep 23 18:00:00 -- ttl=6h, renew fails
  • Sep 23 21:00:00 -- ttl=3h, renew fails
  • Sep 24 00:00:00 -- ttl=0h, renew succeeds --> new TTL is 6h
  • Sep 24 03:00:00 -- ttl=3h, renew fails
  • Sep 24 06:00:00 -- ttl=0h (let's hope we have some headroom here)

So in a sense, the initial validity of the token buys us an error budget. In this case, it would be 9 or 10 failures until the token could expire (depending on whether we run into a race condition if the token is "almost expired").

While it is an unlikely event, especially in a HA Vault setup, it's definitely possible (and more likely as time progresses) and actually pretty tragic that the sleep is that long even in case of an error. 馃槄 I'm wondering whether you'd be open to adding a VAULT_TOKEN_RENEWAL_INTERVAL variable that makes this configurable, or by reducing the sleep in case of errors.

@ricoberger
Copy link
Owner

Hi @moertel, thank you for the detailed issue description. I would prefer your second solution, to keep the code simpler and that I haven't to maintain two logics (one with the VAULT_TOKEN_RENEWAL_INTERVAL and one without the environment variable, to not introduce a breaking change).

Maybe we can use an exponential backoff algorithm for the sleep time in case of an error. So that we retry after 10, 20, 40, 80, etc. seconds. When the renewal succeeds we can reset this. Other solution would be to try it every 30 seconds or so.

What do you think?

@moertel
Copy link
Contributor Author

moertel commented Sep 25, 2020

Both are fine with me. Personally, I always lean on the "making-it-configurable" side of things. One idea regarding backwards-compatibility: letting VAULT_TOKEN_RENEWAL_INTERVAL default to VAULT_TOKEN_LEASE_DURATION*0.5 would keep the code working exactly like it is today.

Exponential backoff sounds neat but as we can assume that we still have a valid token for quite some time it's probably overkill to try to optimise the time it takes to renew it. Let's just go with a 30-second interval. It won't hammer the Vault instance but still leave plenty of time to renew the token.

I'll make PRs for both things, if you don't mind, and if you feel the introduction of VAULT_TOKEN_RENEWAL_INTERVAL is just too much of a hassle, feel free to close it.

moertel added a commit to moertel/vault-secrets-operator that referenced this issue Sep 25, 2020
By default, token renewals take place after 50% of the token TTL have
passed. In certain scenarios this would lead to only having a single
chance to renew the token successfully before it would expire.

This commit decreases the time until the next renewal attempt in case
the last renewal failed.

Fixes ricoberger#48
@ricoberger
Copy link
Owner

Thanks for your contribution. The idea for the backwards compatibility sounds good. If you want to come up with another PR for this, you are welcome.

moertel added a commit to moertel/vault-secrets-operator that referenced this issue Sep 27, 2020
For particular Vault setups, the default values for token renewals might
not be a good fit. This commit adds a parametrisation, so that the time
between a successful or failed token renewal and the next renewal attempt
can be controlled via optional environment variables.

VAULT_TOKEN_RENEWAL_INTERVAL:
    The time (in seconds) between a successful token renewal and the
    next renewal attempt. Defaults to 50% of the token lease duration.

VAULT_TOKEN_RENEWAL_RETRY_INTERVAL:
    The time (in seconds) between a failed token renewal and the next
    renewal attempt. Defaults to 30 seconds.

Refs ricoberger#48
@ricoberger
Copy link
Owner

Hi, your two PRs are included in the 1.8.0 release.

@moertel
Copy link
Contributor Author

moertel commented Sep 28, 2020

Perfect, thanks a lot for the quick feedback and turnaround @ricoberger 馃憤

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants